Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 976–1050

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 14 of 24

976

MCQeasy

A data analyst is using Amazon SageMaker Studio to perform exploratory data analysis on a dataset stored in S3. The analyst wants to generate summary statistics and visualizations quickly. Which built-in feature of SageMaker Studio should the analyst use?

A.SageMaker Ground Truth

B.SageMaker Data Wrangler

C.SageMaker Autopilot

D.SageMaker Clarify

AnswerB

Data Wrangler provides visual EDA capabilities like summary stats and charts.

Why this answer

Option C is correct because SageMaker Data Wrangler is a visual data preparation tool integrated into SageMaker Studio that provides summary statistics, histograms, and correlation matrices without code. Option A is wrong because SageMaker Autopilot automates model building, not EDA. Option B is wrong because SageMaker Clarify is for bias detection and explainability.

Option D is wrong because SageMaker Ground Truth is for labeling.

Full explanation →

977

MCQhard

A data scientist is training a recurrent neural network (RNN) for time series forecasting. The model's training loss is not decreasing, and the gradients are vanishing. Which technique should the scientist apply to address vanishing gradients?

A.Apply gradient clipping.

B.Replace the RNN cells with LSTM or GRU units.

C.Add batch normalization layers.

D.Increase the learning rate.

AnswerB

LSTM/GRU have gating mechanisms that help preserve gradients over long sequences.

Why this answer

Option C is correct because LSTM and GRU are designed to mitigate vanishing gradients via gating mechanisms. Option A is wrong because gradient clipping addresses exploding gradients, not vanishing. Option B is wrong because increasing learning rate may cause instability.

Option D is wrong because batch normalization helps with internal covariate shift but not specifically vanishing gradients.

Full explanation →

978

MCQhard

A company is using Amazon SageMaker to train a model on data stored in S3. The training job needs to access data from an S3 bucket in a different AWS account. The data owner has granted cross-account access via a bucket policy. However, the training job fails with an AccessDenied error. What is the MOST likely cause?

A.The data is encrypted with SSE-KMS and the SageMaker role lacks KMS permissions.

B.The SageMaker execution role does not have the necessary permissions to access the S3 bucket.

C.The S3 bucket is not configured with public access.

D.The S3 bucket is in a different region and requires a VPC endpoint.

AnswerB

The IAM role used by SageMaker must be allowed in the bucket policy.

Why this answer

Option A is correct because SageMaker training jobs use an execution role; that role must be granted cross-account access via the bucket policy. Option B is wrong because S3 does not require VPC endpoints for cross-account access. Option C is wrong because the data does not need to be public.

Option D is wrong because KMS encryption is not the issue unless the bucket policy denies.

Full explanation →

979

Multi-Selectmedium

Which THREE techniques can help reduce overfitting in a neural network? (Choose 3)

Select 3 answers

A.Dropout

B.Increasing the number of layers

C.Using a larger learning rate

D.Early stopping

E.L2 regularization

AnswersA, D, E

Dropout randomly drops neurons, reducing overfitting.

Why this answer

Dropout is a regularization technique that randomly drops a fraction of neurons during training, which prevents the network from relying too heavily on any single neuron and forces it to learn more robust features. This reduces overfitting by introducing noise that improves generalization.

Exam trap

Cisco often tests the misconception that increasing model capacity (e.g., more layers) or adjusting the learning rate can reduce overfitting, when in fact these techniques either exacerbate overfitting or address convergence issues rather than regularization.

Full explanation →

980

MCQmedium

A team has trained a deep learning model on Amazon SageMaker using a custom Docker container. They want to deploy the model to a SageMaker endpoint for real-time inference. Which format should the model artifacts be in?

A.A single .tar.gz file containing the model files.

B.A folder on S3 with the model files.

C.No format requirement; any file works.

D.A .zip file containing the model files.

AnswerA

SageMaker requires model artifacts as a tarball.

Why this answer

Amazon SageMaker requires model artifacts to be packaged as a single .tar.gz file when using a custom Docker container for real-time inference. This compressed archive must contain the model files (e.g., model.pth, model.h5) and any necessary inference code, as SageMaker extracts the archive to the /opt/ml/model directory during deployment. The .tar.gz format ensures consistent extraction and compatibility with SageMaker's inference pipeline.

Exam trap

The trap here is that candidates may assume SageMaker accepts common archive formats like .zip or any file structure, but the exam specifically tests the requirement for a single .tar.gz file as the only supported format for model artifacts in custom container deployments.

How to eliminate wrong answers

Option B is wrong because a folder on S3 with model files is not a valid format; SageMaker expects a single compressed archive, not a directory structure, to ensure atomic deployment and consistent extraction. Option C is wrong because SageMaker does impose a format requirement: the model artifacts must be a .tar.gz file; arbitrary files would break the deployment process. Option D is wrong because a .zip file is not supported by SageMaker for model artifacts; only .tar.gz is accepted, as SageMaker's extraction logic is built around tar-based archives.

Full explanation →

981

MCQeasy

A machine learning team is using Amazon SageMaker to tune hyperparameters for a neural network. They have defined a hyperparameter tuning job with a random search strategy. The training time per job is very long. Which strategy can reduce the total tuning time?

A.Enable early stopping to terminate poorly performing jobs.

B.Use a larger instance type for each training job.

C.Switch to Bayesian optimization.

D.Increase the number of training jobs.

AnswerA

Early stops poor trials early, saving compute time.

Why this answer

Enabling early stopping allows SageMaker to terminate training jobs that are unlikely to produce better results based on the objective metric, which directly reduces total tuning time by freeing up compute resources for more promising hyperparameter combinations. This is especially effective with random search, where many trials may converge slowly or plateau.

Exam trap

The trap here is that candidates often confuse early stopping with reducing training time per job (Option B) or assume Bayesian optimization always converges faster, but in practice, early stopping directly cuts wasted time on poor trials, which is the most effective strategy when individual training jobs are very long.

How to eliminate wrong answers

Option B is wrong because using a larger instance type speeds up individual training jobs but does not reduce the number of jobs or the time wasted on poor performers, and may increase cost without proportional benefit. Option C is wrong because switching to Bayesian optimization typically requires more initial jobs to build a surrogate model and can be less effective with very long training times per job, as it still waits for each job to complete before suggesting the next. Option D is wrong because increasing the number of training jobs would increase total tuning time, not reduce it, since each job still takes a long time to run.

Full explanation →

982

MCQmedium

A data scientist is using Amazon SageMaker to train a model. The training dataset contains missing values in several features. The data scientist wants to impute missing values using the median of each feature. Which approach is most appropriate?

A.Drop all rows that contain missing values

B.Compute the median on the entire dataset, then split into training and test sets

C.Impute missing values with zero for all features before splitting

D.Compute the median of each feature on the training set only, then impute both training and test sets using that median

AnswerD

Why B is correct

Why this answer

Option B is correct because the median should be computed on the training set only to avoid data leakage, then applied to both training and test sets. Option A is wrong because imputing with zero may not be appropriate. Option C is wrong because computing median on the entire dataset and then splitting causes data leakage.

Option D is wrong because dropping rows with missing values may discard useful data and is not imputation.

Full explanation →

983

MCQmedium

A company uses AWS Glue ETL jobs to transform data from Amazon RDS for MySQL to Amazon S3. The transformation includes aggregations and joins. The job runs daily and processes approximately 100 GB of data. Recently, the job started failing with memory errors on the worker nodes. Which approach would MOST effectively resolve the issue without changing the logic?

A.Switch from a Spark ETL job to a Python shell job

B.Decrease the number of workers to reduce overhead

C.Change the worker type from G.2X to G.1X to increase memory per worker

D.Increase the number of workers in the job configuration

AnswerD

More workers distribute the data processing, reducing memory per node.

Why this answer

Option A is correct because increasing the number of workers distributes the workload and reduces memory pressure per node. Option B is wrong because reducing workers increases memory pressure. Option C is wrong because changing to G.1X increases memory per worker but may not be as cost-effective as adding more G.2X workers.

Option D is wrong because Python shell is not suitable for large data transformations.

Full explanation →

984

MCQhard

A company uses SageMaker to train a model that processes sensitive customer data. Due to compliance, the training data must be encrypted at rest and in transit, and the model artifacts must be stored in a secured S3 bucket with encryption. Which combination of actions is REQUIRED?

A.Store data in an S3 bucket with AWS CloudHSM integration

B.Use an S3 bucket with SSE-S3 and enable SageMaker Internet-facing mode

C.Use an S3 bucket with default encryption (SSE-S3) and enable SSL for all connections

D.Enable AWS KMS encryption for the SageMaker notebook and training job, and use an S3 bucket with default encryption using AWS KMS

AnswerD

KMS encryption ensures encryption at rest and in transit for SageMaker and S3.

Why this answer

Option D is correct because it ensures end-to-end encryption: AWS KMS encryption for the SageMaker notebook and training job encrypts data in transit and at rest within the SageMaker environment, while an S3 bucket with default encryption using AWS KMS encrypts the training data and model artifacts at rest. This combination meets compliance requirements for encryption at rest and in transit, as KMS provides envelope encryption with customer-managed keys, and SageMaker automatically uses TLS for data in transit when KMS is enabled.

Exam trap

The trap here is that candidates often assume SSE-S3 alone is sufficient for compliance, but it does not cover encryption in transit or SageMaker-specific encryption, and they overlook the requirement for KMS to encrypt the SageMaker environment itself.

How to eliminate wrong answers

Option A is wrong because AWS CloudHSM integration is not a direct encryption method for S3 buckets; it provides hardware security modules for key storage but does not inherently encrypt data at rest in S3 or in transit, and it is not a required action for SageMaker encryption. Option B is wrong because SSE-S3 encrypts data at rest but does not address encryption in transit, and enabling SageMaker Internet-facing mode exposes the endpoint to the internet without ensuring SSL/TLS for all connections, violating compliance. Option C is wrong because SSE-S3 encrypts data at rest but does not provide encryption for SageMaker notebook instances or training jobs; enabling SSL for all connections is a best practice but not a specific SageMaker configuration, and it does not cover encryption of model artifacts in transit between SageMaker and S3 without KMS integration.

Full explanation →

985

MCQeasy

A data engineer needs to transform raw clickstream data (JSON files) stored in S3 into a partitioned Parquet dataset for querying with Athena. The transformation includes cleaning, deduplication, and enrichment. The pipeline should run daily. Which solution is MOST cost-effective and requires the least operational overhead?

A.Launch an Amazon EMR cluster with Spark, transform the data, and terminate the cluster after completion.

B.Use an AWS Glue ETL job with a schedule trigger to perform the transformation and write to S3.

C.Use AWS Lambda functions triggered by S3 events to transform each file incrementally.

D.Use Amazon Athena to run CTAS queries to convert and partition the data daily.

AnswerB

Glue ETL is serverless, can handle complex transformations, and scheduling is built-in.

Why this answer

Option B is correct because AWS Glue crawlers can catalog the data, and Glue ETL jobs are serverless, cost-effective, and can be scheduled. Option A is wrong because Athena is for querying, not ETL. Option C is wrong because EMR requires cluster management.

Option D is wrong because Lambda has execution time limits and is not ideal for large datasets.

Full explanation →

986

MCQhard

A company uses SageMaker Ground Truth to label images for object detection. After labeling, they notice that the bounding boxes are often misaligned with the objects. Which action should they take to improve label quality?

A.Use a pre-built annotation tool that enforces bounding box alignment

B.Use automated labeling with a pre-trained model

C.Increase the number of workers per task

D.Adjust the confidence threshold for the model

AnswerA

Tool constraints improve consistency.

Why this answer

Option C is correct because using a pre-built annotation tool reduces variability. Option A is wrong because increasing workers does not guarantee consistency. Option B is wrong because automated labeling may not be accurate initially.

Option D is wrong because adjusting confidence threshold is for post-processing, not labeling.

Full explanation →

987

MCQeasy

A machine learning team is preparing a dataset for model training. The data is stored in an Amazon S3 bucket with objects that are each approximately 100 MB in size. The team wants to use Amazon SageMaker for training. To optimize training performance, which data format and storage configuration should be used?

A.Store data as RecordIO-Protobuf files and use SageMaker File input mode

B.Store data as RecordIO-Protobuf files and use SageMaker Pipe input mode

C.Store data as CSV files and use SageMaker Pipe input mode

D.Store data as CSV files and use SageMaker File input mode

AnswerB

Pipe mode streams data directly from S3, and RecordIO-Protobuf provides efficient binary format.

Why this answer

Option B is correct because SageMaker Pipe input mode streams data directly from S3, avoiding disk I/O, and RecordIO-Protobuf is an optimized binary format. Option A is wrong because File mode copies data to disk, increasing latency. Option C is wrong because CSV is not as efficient as binary.

Option D is wrong because File mode with CSV is not optimal.

Full explanation →

988

Multi-Selecthard

Which THREE are common techniques for detecting outliers in a univariate dataset? (Select THREE.)

Select 3 answers

A.Cook's distance

B.DBSCAN clustering

C.Z-score

D.Interquartile range (IQR) method

E.Modified Z-score using median absolute deviation (MAD)

AnswersC, D, E

Z-score measures how many standard deviations an observation is from the mean.

Why this answer

Options A, C, and D are correct. Option B is wrong because DBSCAN is a multivariate clustering method. Option E is wrong because Cook's distance is for regression diagnostics.

Full explanation →

989

Multi-Selecteasy

A data scientist is evaluating a binary classification model. The model's confusion matrix shows: True Positives=80, False Positives=20, True Negatives=900, False Negatives=0. Which THREE metrics can be calculated from this confusion matrix? (Choose three.)

Select 3 answers

A.Precision

B.Recall

C.AUC-ROC

D.Accuracy

E.Root Mean Squared Error (RMSE)

AnswersA, B, D

Precision = TP/(TP+FP).

Why this answer

Precision is calculated as TP/(TP+FP) = 80/(80+20) = 0.80. This metric measures the proportion of positive identifications that were actually correct, which is directly derivable from the confusion matrix values.

Exam trap

The trap here is that candidates often assume AUC-ROC can be derived from a single confusion matrix, but it actually requires the full distribution of predicted probabilities to plot the ROC curve and calculate the area under it.

Full explanation →

990

MCQhard

A company uses Amazon SageMaker to train a model for fraud detection. The training data is highly imbalanced. The data scientist uses SMOTE to oversample the minority class. However, the model still has poor recall on the minority class. Which additional technique should the data scientist consider?

A.One-vs-rest encoding

B.Use class weights in the loss function

C.L1 regularization

D.Principal component analysis (PCA)

AnswerB

Class weights penalize minority errors more.

Why this answer

Cost-sensitive learning assigns higher penalty to misclassifications of minority class, addressing imbalance. Option A is for feature selection, B is regularization, D is for multi-class.

Full explanation →

991

Multi-Selecteasy

A data scientist is working with a dataset that contains both numerical and categorical features. The target variable is continuous. Which TWO EDA techniques should the scientist use to understand relationships between features and the target?

Select 2 answers

A.Generate a confusion matrix for the target variable.

B.Compute the silhouette score for each feature.

C.Create scatter plots of numerical features against the target variable.

D.Use box plots to compare target distribution across categorical feature categories.

E.Plot a histogram of the target variable.

AnswersC, D

Reveals linear/nonlinear relationships.

Why this answer

Scatter plots for numerical vs continuous target and box plots for categorical vs continuous target are standard. Option C (confusion matrix) is for classification. Option D (histogram of target) is univariate.

Option E (silhouette score) is for clustering.

Full explanation →

992

MCQhard

Refer to the exhibit. A SageMaker endpoint is returning 5xx errors. The logs show the above error. Which change will most likely resolve the issue?

A.Reduce the batch size in the inference script

B.Enable Auto Scaling on the endpoint

C.Compress the model artifact

D.Use a larger instance type with more memory

AnswerD

More memory solves OutOfMemoryError.

Why this answer

Option B is correct because increasing memory (e.g., upgrading instance type) addresses the OutOfMemoryError. Option A is wrong because batch size is not relevant for inference (single request). Option C is wrong because the error is memory, not model file size.

Option D is wrong because AutoScaling does not change instance memory.

Full explanation →

993

MCQeasy

Refer to the exhibit. A data scientist runs the AWS CLI command to create a SageMaker training job. The training job fails because the input data is not accessible. Which step should the data scientist take to fix the issue?

A.Attach an IAM policy to SageMakerRole that grants s3:GetObject on the bucket

B.Add a VpcConfig to the training job

C.Modify the bucket policy to allow s3:GetObject for any principal

D.Increase VolumeSizeInGB to 50

AnswerA

The role needs explicit S3 read permissions.

Why this answer

Option B is correct because the IAM role must have s3:GetObject permission for the S3 bucket. Option A is wrong because the bucket policy is separate. Option C is wrong because VolumeSizeInGB is for local storage, not S3.

Option D is wrong because VPC configuration is not the issue.

Full explanation →

994

Multi-Selectmedium

A data scientist is using SageMaker to train a model and wants to track experiments, including hyperparameters and metrics. Which TWO actions should the scientist take to set up experiment tracking? (Choose TWO.)

Select 2 answers

A.Use the SageMaker Experiments Python SDK to create an experiment and log runs.

B.Enable SageMaker Model Monitor to track training metrics.

C.Configure CloudWatch Logs to store experiment data.

D.Create a trial component in the experiment to log hyperparameters and metrics.

E.Enable SageMaker Studio to automatically capture experiments.

AnswersA, D

Directly supports experiment tracking.

Why this answer

Option A is correct because the SageMaker Experiments Python SDK provides the primary interface for creating and managing experiments, allowing the data scientist to log runs, hyperparameters, and metrics in a structured way. This SDK directly integrates with SageMaker training jobs and notebook executions to capture experiment metadata.

Exam trap

Cisco often tests the distinction between monitoring (Model Monitor) and experiment tracking (Experiments SDK), and the trap here is that candidates confuse CloudWatch Logs or Model Monitor as valid tools for structured experiment metadata capture when they are not designed for that purpose.

Full explanation →

995

Multi-Selectmedium

Which TWO statements about handling missing data during EDA are correct? (Select TWO.)

Select 2 answers

A.Dropping columns with >50% missing values is always recommended.

B.Mean imputation preserves the variance of the original distribution.

C.If data are missing completely at random (MCAR), listwise deletion yields unbiased estimates.

D.Multiple imputation (MICE) is always the safest method regardless of missing data mechanism.

E.Imputing with the median is more robust to outliers than imputing with the mean.

AnswersC, E

Under MCAR, missingness is independent of data, so deletion is unbiased.

Why this answer

Options B and C are correct. Option A is wrong because MICE is multivariate imputation, not necessarily safest. Option D is wrong because listwise deletion can introduce bias.

Option E is wrong because mean imputation reduces variance.

Full explanation →

996

MCQeasy

A data scientist is using Amazon SageMaker to train a model using a built-in algorithm. The training job is taking a long time, and the data scientist wants to improve performance by using a larger instance type with more vCPUs. The training job is currently using an ml.m5.large instance. The data scientist changes the instance type to ml.m5.4xlarge and resubmits the training job. However, the training time does not decrease significantly. What is the MOST likely reason?

A.The algorithm is single-threaded and cannot use multiple vCPUs.

B.The built-in algorithm is not designed to scale with additional vCPUs.

C.The training job is I/O bound, and increasing vCPUs does not help.

D.The training dataset is too small to benefit from more vCPUs.

AnswerB

Correct: Some algorithms are not parallelized and do not benefit from more vCPUs.

Why this answer

The built-in algorithm may not be able to utilize additional vCPUs effectively if it is not parallelized. Option B is correct. Option A is incorrect because data size does not prevent parallelism.

Option C is incorrect because the algorithm is not inherently single-threaded; it depends on implementation. Option D is unlikely because training on larger instances generally costs more, but that does not affect time.

Full explanation →

997

MCQhard

A data scientist is performing EDA on a high-dimensional dataset with 500 features. They want to visualize the data in 2D to check for clusters. They first apply PCA and get a 2D projection that shows no clear structure. They suspect that the data lies on a non-linear manifold. Which of the following techniques should they try next?

A.Use Independent Component Analysis (ICA).

B.Use Linear Discriminant Analysis (LDA).

C.Apply PCA again with more components.

D.Use t-distributed Stochastic Neighbor Embedding (t-SNE).

AnswerD

t-SNE is a non-linear technique that preserves local structure for visualization.

Why this answer

Option D is correct because t-SNE is designed for non-linear dimensionality reduction and visualization. Option A is wrong because PCA is linear. Option B is wrong because LDA is supervised.

Option C is wrong because ICA separates independent components, not for visualization.

Full explanation →

998

MCQmedium

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS for PostgreSQL and load it into Amazon Redshift. The Glue job runs nightly and takes 6 hours to complete. The Redshift cluster is a single dc2.large node. The team needs to reduce the load time to under 3 hours. The data volume is 200 GB per night. The team is considering using Amazon Redshift Spectrum to query data directly from S3 instead of loading it. However, the data transformation logic is complex and requires multiple joins and aggregations that are currently performed in Glue. Which approach should the team recommend to meet the time requirement?

A.Use Redshift Spectrum to create external tables and run the transformations directly in Redshift, bypassing the Glue job.

B.Increase the Redshift cluster to a multi-node cluster with dc2.8xlarge nodes to improve COPY and query performance.

C.Split the Glue job into multiple parallel jobs that each load a portion of the data into separate Redshift tables, then use UNION ALL views.

D.Stage the data in S3 in Parquet format and use a COPY command with the PARQUET option to load data faster.

AnswerB

More nodes increase parallelism for loading and any post-load transformations.

Why this answer

Option C is correct because increasing the node size or number of nodes provides more compute resources for the COPY command and any subsequent processing. Using a larger node type like dc2.8xlarge or adding nodes increases parallelism. Option A is wrong because Redshift Spectrum does not replace the transformation logic; the complex transformations would still need to be run, possibly in Glue or Redshift.

Option B is wrong because staging data in S3 does not reduce the transformation time. Option D is wrong because using a single node with elastic resize is not possible for dc2; also, splitting the load does not reduce total time if the bottleneck is compute.

Full explanation →

999

MCQeasy

A data scientist is building a text classification model. The dataset contains 10,000 documents, each labeled with one of 5 categories. Which algorithm is most suitable for this task?

A.Principal Component Analysis (PCA)

B.Naive Bayes

C.Linear regression

D.k-means clustering

AnswerB

Naive Bayes is effective for text classification and small datasets.

Why this answer

Naive Bayes is highly suitable for text classification because it models the probability of each category given the document's word features using Bayes' theorem with a strong independence assumption. It performs well on high-dimensional sparse data like bag-of-words or TF-IDF representations, and it is particularly effective when the number of documents (10,000) is moderate relative to the vocabulary size, as it requires relatively little training data to estimate parameters.

Exam trap

Cisco often tests the distinction between supervised and unsupervised learning, leading candidates to mistakenly choose k-means clustering (an unsupervised method) for a labeled classification task, or to confuse PCA with a classification algorithm because it is used for feature reduction before modeling.

How to eliminate wrong answers

Option A is wrong because Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that finds orthogonal components maximizing variance; it does not perform classification and ignores the category labels entirely. Option C is wrong because linear regression predicts a continuous numeric output, not a discrete categorical label; applying it to multiclass classification would require inappropriate thresholding and violates the assumption of normally distributed errors. Option D is wrong because k-means clustering is an unsupervised algorithm that partitions data into clusters based on distance, without using label information; it cannot assign documents to predefined categories and requires post-hoc mapping of clusters to labels.

Full explanation →

1000

MCQhard

A data scientist is using Amazon SageMaker Autopilot to automatically build a model. The dataset contains a mix of numerical and categorical features. After the experiment completes, Autopilot provides several candidate pipelines. Which pipeline is MOST likely to be ranked highest by Autopilot?

A.The pipeline with the lowest validation loss

B.The pipeline with the simplest model (e.g., linear classifier)

C.The pipeline with the fastest training time

D.The pipeline with the lowest training loss

AnswerA

Autopilot ranks candidates by validation performance.

Why this answer

Amazon SageMaker Autopilot ranks candidate pipelines by their objective metric on the validation dataset, which is typically the validation loss (e.g., cross-entropy for classification or mean squared error for regression). The pipeline with the lowest validation loss generalizes best to unseen data, making it the highest-ranked candidate. Autopilot uses hold-out validation or cross-validation to compute this metric, ensuring the ranking reflects out-of-sample performance rather than overfitting to the training set.

Exam trap

The trap here is that candidates often confuse training loss with validation loss, mistakenly thinking that a lower training loss indicates a better model, but Autopilot explicitly ranks by validation performance to prevent overfitting.

How to eliminate wrong answers

Option B is wrong because Autopilot does not prioritize model simplicity; it optimizes for predictive performance, and a more complex model (e.g., ensemble or gradient-boosted tree) often achieves lower validation loss than a linear classifier. Option C is wrong because training time is not a ranking criterion; Autopilot focuses on accuracy, not computational speed, and a faster pipeline may sacrifice performance. Option D is wrong because training loss is an in-sample metric that can be misleadingly low due to overfitting; Autopilot uses validation loss to avoid this bias and ensure generalization.

Full explanation →

1001

MCQmedium

A data scientist is exploring a dataset and finds that the variance of a feature is 0. What should be done with this feature?

A.Remove the feature from the dataset

B.Create interaction terms with other features

C.Apply Min-Max scaling to normalize the feature

D.Impute missing values using the mean

AnswerA

Constant feature provides no predictive power.

Why this answer

Option C is correct because zero variance means the feature is constant and provides no information for modeling; it should be removed. Option A is wrong because scaling does not change constant values. Option B is wrong because imputation is for missing values, not constant.

Option D is wrong because interaction with a constant feature remains constant.

Full explanation →

1002

MCQmedium

A company has deployed a model on SageMaker for real-time inference. The endpoint is experiencing high latency during traffic spikes. Which action should the company take to reduce latency?

A.Use a larger instance type for the endpoint

B.Attach SageMaker Elastic Inference to the endpoint

C.Enable SageMaker endpoint auto-scaling

D.Use SageMaker Neo to compile the model

E.Switch to SageMaker batch transform

AnswerC

Auto-scaling adds instances during spikes, reducing latency.

Why this answer

Option D is correct because enabling auto-scaling adds instances during spikes, reducing latency. Option A (larger instance) may help but is not cost-effective. Option B (batch transform) is for async inference.

Option C (Elastic Inference) accelerates inference but does not handle spikes. Option E (SageMaker Neo) optimizes for edge devices.

Full explanation →

1003

MCQmedium

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transactions. They notice that the target variable is highly imbalanced: 99% of samples belong to class 0 and 1% to class 1. Which technique should they use to address this imbalance before training a classification model?

A.Train the model on the raw data without any modification.

B.Apply SMOTE to generate synthetic samples for the minority class.

C.Use accuracy as the evaluation metric and train on the raw data.

D.Under-sample the majority class to match the minority class size.

AnswerB

SMOTE creates synthetic minority samples, helping balance the dataset.

Why this answer

Option C is correct because SMOTE generates synthetic samples for the minority class, which is effective for imbalanced datasets. Option A is wrong because accuracy is not a good metric for imbalanced data. Option B is wrong because under-sampling discards majority class data.

Option D is wrong because using raw data without handling imbalance typically leads to poor minority class performance.

Full explanation →

1004

Multi-Selecteasy

A company wants to monitor SageMaker endpoints for data drift. Which TWO services can be used together to detect and alert on drift?

Select 2 answers

A.SageMaker Data Wrangler

B.SageMaker Model Monitor

C.AWS CodePipeline

D.Amazon CloudWatch Alarms

E.Amazon CloudWatch Logs

AnswersB, D

Model Monitor detects drift in real-time.

Why this answer

SageMaker Model Monitor detects drift and CloudWatch Alarms can send alerts. Options A and D are correct. Option B is for code, Option C is for logs, Option E is for data preparation.

Full explanation →

1005

MCQhard

A machine learning engineer is tuning hyperparameters for a gradient boosting model using Amazon SageMaker Automatic Model Tuning. The objective metric is validation accuracy. After several tuning jobs, the best accuracy achieved is 0.85, but the engineer suspects the model is overfitting. Which hyperparameter adjustment is most likely to reduce overfitting?

A.Increase the regularization parameter (e.g., lambda or alpha)

B.Increase the maximum depth of trees

C.Increase the subsample ratio

D.Increase the learning rate

AnswerA

Regularization penalizes large weights, reducing overfitting.

Why this answer

Option D is correct because increasing the regularization parameter (e.g., lambda or alpha in XGBoost) penalizes model complexity and reduces overfitting. Option A is wrong because increasing learning rate can cause overfitting. Option B is wrong because increasing max depth increases model complexity, leading to overfitting.

Option C is wrong because decreasing subsample might reduce overfitting but increasing it introduces more data, which could increase overfitting.

Full explanation →

1006

Multi-Selecthard

A company is using AWS Glue to catalog data stored in Amazon S3. The data is partitioned by year, month, day, and hour. The company runs hourly ETL jobs that add new partitions. The Glue crawler is scheduled to run every hour to update the Data Catalog. However, the crawler is taking longer than expected and is not completing before the next crawler run starts. Which THREE actions could the company take to resolve this issue?

Select 2 answers

A.Increase the throughput of the crawler by configuring the 'Schema updates' option

B.Enable partition indexing on the table to speed up the crawler

C.Decrease the crawler schedule frequency to every 2 hours to avoid overlapping runs

D.Use multiple crawlers, each configured to crawl a different path (e.g., one for year=2023, one for year=2024)

E.Increase the number of crawler instances by configuring the 'Crawler queue' to process multiple partitions in parallel

AnswersD, E

Splitting the crawl scope across crawlers allows parallel execution.

Why this answer

Options A, C, and D are correct. Increasing the crawler's S3 throughput improves speed. Partition indexing speeds up queries but does not affect crawler speed.

Adding more crawlers allows parallel processing. Option B is wrong because partition indexing is for Athena/Redshift, not crawler performance. Option E is wrong because decreasing schedule frequency would increase the backlog.

Full explanation →

1007

Multi-Selectmedium

Which TWO of the following are valid techniques for handling missing values in a dataset for machine learning?

Select 2 answers

A.Replace missing values with the maximum value of the feature

B.Remove rows with missing values

C.Replace missing values with random noise

D.Convert missing values to the string 'missing'

E.Replace missing values with the mean of the feature

AnswersB, E

Dropping rows is acceptable if the missing data is random and not too many.

Why this answer

Option A is correct because mean imputation is a common technique. Option C is correct because dropping rows with missing values is valid. Option B is wrong because using the maximum value introduces bias.

Option D is wrong because adding random noise is not standard. Option E is wrong because converting to string is not appropriate for numerical features.

Full explanation →

1008

Multi-Selectmedium

A company is building a data pipeline that uses Amazon Kinesis Data Streams to ingest real-time events. The pipeline then uses AWS Lambda to process the events and store results in Amazon DynamoDB. The company wants to ensure that the Lambda function can process all events without data loss and without duplicating processing. Which TWO configuration steps should the company take?

Select 2 answers

A.Increase the data retention period of the Kinesis stream to 7 days to allow reprocessing

B.Set the Lambda function's batch window to a small value (e.g., 1 second) to reduce processing latency

C.Enable the 'iterator age' metric in Amazon CloudWatch to monitor consumer lag

D.Use a single shard for the Kinesis stream to ensure order and avoid parallel processing issues

E.Configure the Lambda function to disable retries on failure to avoid duplicate processing

AnswersB, C

Small batch window ensures timely processing.

Why this answer

Options A and C are correct. Setting the Lambda batch window to a small value ensures low latency, and enabling the iterator age metric helps monitor lag. Option B is wrong because increasing the retention period does not prevent duplication.

Option D is wrong because using a single shard limits throughput. Option E is wrong because disabling retries could cause data loss.

Full explanation →

1009

MCQeasy

A data analyst is examining a scatter plot of two variables and notices a strong positive correlation. Which of the following is a valid conclusion?

A.The relationship is linear

B.One variable causes the other

C.The two variables are related, but causation cannot be inferred

D.The relationship can be used to accurately predict one variable from the other

AnswerC

Correlation does not imply causation.

Why this answer

Option A is correct because correlation indicates a relationship, but does not imply causation. Option B is wrong because correlation does not imply causation. Option C is wrong because correlation does not provide a prediction model.

Option D is wrong because correlation does not guarantee linearity; it could be non-linear.

Full explanation →

1010

MCQeasy

Refer to the exhibit. A data scientist creates a SageMaker notebook instance using this Terraform configuration. The notebook fails to start. The logs indicate 'The IAM role does not have the necessary permissions'. Which addition to the IAM role policy is MOST likely needed?

A.cloudwatch:PutMetricData

B.s3:GetObject on the notebook bucket

C.sagemaker:CreatePresignedNotebookInstanceUrl

D.sagemaker:CreateTrainingJob

AnswerC

Required for notebook access.

Why this answer

The SageMaker notebook instance requires the `sagemaker:CreatePresignedNotebookInstanceUrl` permission to generate a presigned URL, which is used to access the notebook's Jupyter interface. Without this permission, the notebook fails to start because the IAM role cannot create the necessary URL for the user to connect, as indicated by the 'The IAM role does not have the necessary permissions' log error.

Exam trap

AWS often tests the specific permission required for notebook instance access, and the trap here is that candidates confuse general SageMaker permissions (like training or S3 access) with the precise `CreatePresignedNotebookInstanceUrl` action needed for the notebook to start.

How to eliminate wrong answers

Option A is wrong because `cloudwatch:PutMetricData` is used for publishing custom metrics to CloudWatch, which is not required for starting a SageMaker notebook instance; the notebook startup process does not depend on CloudWatch permissions. Option B is wrong because `s3:GetObject` on the notebook bucket is typically needed for accessing data or artifacts, but the notebook instance itself does not require S3 read access to start; the startup failure is due to missing permissions for generating the presigned URL, not S3 access. Option D is wrong because `sagemaker:CreateTrainingJob` is a permission for launching training jobs, which is unrelated to the notebook instance lifecycle; the notebook startup does not involve creating training jobs.

Full explanation →

1011

MCQmedium

An e-commerce company wants to build a recommendation system. They have user-item interaction data (clicks, purchases) and user demographic data. The goal is to recommend items that a user is likely to purchase. Which approach should be used?

A.Linear regression on user and item features.

B.Collaborative filtering using matrix factorization.

C.Factorization Machines using user-item interactions and user features.

D.Content-based filtering using item features.

AnswerC

Handles sparse data and side features effectively.

Why this answer

Option D is correct because Factorization Machines are designed for high-dimensional sparse data and can handle both user-item interactions and side features. Option A is wrong because collaborative filtering does not naturally incorporate user demographic features. Option B is wrong because content-based filtering typically uses item features.

Option C is wrong because linear regression is not suitable for implicit feedback.

Full explanation →

1012

MCQhard

A data engineering team is designing a data lake on Amazon S3. Raw data is ingested in JSON format and must be partitioned by year, month, and day. The team expects high query performance for recent data but infrequent queries for older data. The data is immutable. Which storage tier configuration minimizes costs while meeting performance requirements?

A.Store all data in S3 Standard, then move to S3 Glacier after 30 days using a lifecycle policy

B.Store recent partitions in S3 Standard, older partitions in S3 One Zone-IA

C.Keep all data in S3 Standard because query performance is critical

D.Use S3 Intelligent-Tiering for the entire data lake

AnswerD

Intelligent-Tiering automatically moves data between access tiers based on usage, optimizing cost without retrieval delays.

Why this answer

Using S3 Intelligent-Tiering for the entire data lake automatically optimizes costs by moving data between frequent and infrequent access tiers based on usage patterns, without performance impact. Option A uses Glacier for recent data, which would cause retrieval delays. Option C uses S3 Standard for all data, which is cost-inefficient for older data.

Option D uses a lifecycle policy to S3 One Zone-IA, which is not cost-optimal for infrequent queries and may have durability concerns.

Full explanation →

1013

Multi-Selecteasy

A data scientist is performing feature engineering for a machine learning model. The dataset contains categorical features with high cardinality. Which THREE techniques are appropriate for encoding high-cardinality categorical features?

Select 3 answers

A.Target encoding

B.Binary encoding

C.Label encoding

D.Count encoding

E.One-hot encoding with pruning of rare categories

AnswersA, D, E

Replaces category with target mean.

Why this answer

Option A is correct because target encoding replaces categories with the mean target value. Option B is correct because count encoding replaces categories with frequency counts. Option D is correct because one-hot encoding can be used if the number of categories is manageable after pruning.

Option C is wrong because label encoding implies ordinal relationship, not ideal for nominal high cardinality. Option E is wrong because binary encoding is another option, but the question asks for THREE; typically target, count, and one-hot are common.

Full explanation →

1014

MCQeasy

A data scientist wants to understand the distribution of a continuous feature before training a model. Which visualization is most appropriate?

A.Scatter plot

B.Box plot

C.Histogram

D.Bar chart

AnswerC

Histograms display the frequency distribution of a continuous variable.

Why this answer

A histogram is the standard tool for showing the distribution of a single continuous variable. Option A is wrong because scatter plots compare two variables. Option B is wrong because bar charts are for categorical data.

Option D is wrong because box plots show summary statistics, not the full distribution shape.

Full explanation →

1015

Multi-Selectmedium

A team is building a regression model to predict house prices. They observe that the model performs well on training data but poorly on validation data. Which THREE actions can help reduce overfitting? (Choose THREE.)

Select 3 answers

A.Reduce model complexity by selecting fewer features

B.Increase regularization strength (e.g., L1, L2)

C.Collect more training data if possible

D.Increase the maximum depth of decision trees

E.Add more interaction features

AnswersA, B, C

Simpler models generalize better.

Why this answer

Option A (increase regularization) penalizes large coefficients. Option C (reduce model complexity) like using fewer features or a simpler algorithm. Option D (add more training data) helps generalization.

Option B (increase tree depth) increases overfitting. Option E (feature engineering) may not reduce overfitting directly.

Full explanation →

1016

MCQeasy

A data scientist is analyzing a dataset with 10,000 rows and 50 columns. The target variable is binary. Which technique is most appropriate for identifying the most important features for predicting the target?

A.Use t-SNE to reduce dimensionality and inspect clusters

B.Run K-means clustering and examine cluster centroids

C.Train a Random Forest classifier and use feature_importances_

D.Apply PCA and select components with highest variance

AnswerC

Random Forest provides feature importance scores based on impurity reduction.

Why this answer

Option A is correct because Random Forest feature importance is a well-known method for ranking features in classification tasks. Option B is wrong because PCA is unsupervised and does not use the target. Option C is wrong because K-means is clustering, not feature selection.

Option D is wrong because t-SNE is for visualization, not feature importance.

Full explanation →

1017

MCQeasy

Refer to the exhibit. A data scientist checks the status of a SageMaker endpoint and sees the output above. What does this indicate?

A.The endpoint has failed

B.The endpoint is running at full capacity

C.The endpoint is out of service

D.The endpoint is scaling up to meet desired capacity

AnswerD

Current is less than desired, so scaling up.

Why this answer

Option B is correct because the endpoint is InService but the current instance count (2) is less than the desired count (5), indicating scaling is in progress. Option A is wrong because the status is InService, not OutOfService. Option C is wrong because the endpoint is running but at lower capacity.

Option D is wrong because the endpoint is not failed.

Full explanation →

1018

Multi-Selectmedium

Which THREE techniques are commonly used for feature engineering in exploratory data analysis? (Select THREE.)

Select 3 answers

A.Extracting date/time components like day of week or hour.

B.Using principal component analysis (PCA) to create new features.

C.Applying one-hot encoding to numerical features.

D.Creating interaction features between variables.

E.Binning continuous variables into discrete intervals.

AnswersA, D, E

Temporal features often reveal patterns.

Why this answer

Option A is correct because extracting date/time components such as day of week, hour, or month from a timestamp is a standard feature engineering technique. It transforms a single datetime column into multiple categorical or cyclical features that can reveal temporal patterns like weekly seasonality or peak hours, which are often critical for time-series models.

Exam trap

Cisco often tests the distinction between feature engineering (creating new features from existing data) and dimensionality reduction (PCA) or encoding (one-hot encoding), leading candidates to mistakenly select PCA as a feature engineering technique when it is actually a preprocessing step for reducing feature space.

Full explanation →

1019

MCQeasy

A data scientist trains a convolutional neural network (CNN) for image classification. The training loss decreases steadily, but the validation loss starts increasing after 10 epochs. Which technique should the data scientist use to address this problem?

A.Add more data augmentation to the training set.

B.Use early stopping to halt training when validation loss stops decreasing.

C.Increase the number of training epochs.

D.Add more convolutional layers to increase model capacity.

E.Increase the learning rate.

AnswerB

Early stopping prevents overfitting by stopping at the optimal point.

Why this answer

Option D is correct because early stopping halts training when validation loss stops improving, preventing overfitting. Option A (more data) is not directly addressing the current overfitting. Option B (data augmentation) can help but is already used.

Option C (increase learning rate) may cause divergence. Option E (more layers) increases model complexity, likely worsening overfitting.

Full explanation →

1020

Multi-Selecthard

A data scientist is exploring a dataset with mixed data types (numeric, categorical, text). The dataset has 5 million rows. The scientist wants to understand the relationships between variables and identify potential data quality issues. Which THREE tools are suitable for this analysis?

Select 3 answers

A.AWS Glue DataBrew

B.AWS Data Pipeline

C.Amazon SageMaker Data Wrangler

D.Amazon Athena

E.Amazon Kinesis Data Analytics

AnswersA, C, D

Data profiling and visualization.

Why this answer

Options A, B, and D are correct. AWS Glue DataBrew can profile data, visualize distributions, and detect anomalies. Amazon SageMaker Data Wrangler provides interactive data preparation and visualization.

Amazon Athena can be used to run SQL queries for data quality checks. Option C is wrong because AWS Data Pipeline is for workflow orchestration, not EDA. Option E is wrong because Amazon Kinesis Data Analytics is for streaming data, not batch EDA.

Full explanation →

1021

MCQhard

A company is building a recommendation system using collaborative filtering on Amazon SageMaker. The dataset contains user-item interactions with a long-tail distribution: a few items have millions of interactions, while most items have very few. The model currently uses matrix factorization with ALS. The recall@20 metric is low for niche items. Which modification would most likely improve recall for long-tail items?

A.Increase the regularization parameter to prevent overfitting

B.Add explicit features like item category and user demographics

C.Increase the number of latent factors in the matrix

D.Use implicit feedback with confidence weighting to downweight popular items

AnswerD

Confidence weighting reduces the influence of overly popular items, allowing the model to learn patterns for niche items.

Why this answer

Implicit feedback models can incorporate confidence weights that downweight popular items, helping the model focus on less frequent items. Adding explicit features would not directly address the long-tail. Increasing the number of factors might help but could also overfit.

Regularization is already present; adjusting it might not target the issue specifically.

Full explanation →

1022

MCQhard

A company is using SageMaker to host a model that performs real-time fraud detection. The model receives high request volumes with occasional spikes. The company wants to ensure that the endpoint can handle spikes without throttling while minimizing cost. Which scaling strategy should be used?

A.Use a target tracking scaling policy with a target value of 70% for the SageMakerVariantInvocationsPerInstance metric.

B.Use a simple scaling policy with a step adjustment based on the InvocationsPerInstance metric.

C.Manually adjust the instance count based on monitoring dashboards.

D.Use a scheduled scaling action to add instances during peak hours.

AnswerA

Automatically scales based on utilization.

Why this answer

A target tracking scaling policy with the SageMakerVariantInvocationsPerInstance metric is the correct choice because it automatically adjusts the instance count to maintain a target utilization (e.g., 70%), handling spikes without manual intervention while minimizing cost by scaling down during low traffic. This is the recommended approach for real-time endpoints with variable traffic, as it aligns with AWS best practices for dynamic scaling.

Exam trap

The trap here is that candidates often confuse simple scaling (step adjustments) with target tracking, assuming any metric-based policy works, but target tracking is specifically designed for maintaining a utilization target and is the only option that handles irregular spikes without manual or scheduled intervention.

How to eliminate wrong answers

Option B is wrong because simple scaling policies with step adjustments require predefined thresholds and cooldown periods, which can lead to over-provisioning or under-provisioning during sudden spikes, lacking the smooth, proportional response of target tracking. Option C is wrong because manually adjusting instance count based on dashboards is reactive, error-prone, and cannot handle rapid spikes without causing throttling or waste, defeating the goal of cost minimization. Option D is wrong because scheduled scaling only works for predictable traffic patterns, not for occasional spikes that occur at irregular times, leading to either throttling during unscheduled surges or unnecessary cost during off-peak hours.

Full explanation →

1023

Multi-Selecthard

Which THREE techniques help reduce overfitting in a neural network? (Select THREE.)

Select 3 answers

A.Dropout

B.L2 Regularization

C.Increasing the number of layers

D.Using a larger batch size

E.Early Stopping

AnswersA, B, E

Dropout is a regularization technique that reduces overfitting.

Why this answer

Dropout randomly drops units during training, L2 regularization penalizes large weights, and early stopping halts training when validation error increases. Data augmentation can also help but is not listed. Batch normalization may help but primarily for training stability.

Full explanation →

1024

Multi-Selectmedium

A data scientist is training a binary classification model to predict customer churn. The dataset has 10,000 samples with 500 churners (5% positive class). Which TWO techniques should the scientist use to address the class imbalance? (Choose TWO.)

Select 2 answers

A.Use SMOTE to oversample the minority class

B.Tune the decision threshold after training

C.Randomly undersample the majority class to match minority size

D.Oversample the minority class by duplicating existing samples

E.Set class_weight='balanced' in the classifier

AnswersA, E

SMOTE creates synthetic samples to balance classes.

Why this answer

Option A (SMOTE) generates synthetic samples for the minority class. Option C (class_weight='balanced') adjusts loss function weights. Option B (undersampling majority) can be used but is not always preferred; Option D (oversampling with replacement) may cause overfitting; Option E (threshold tuning) is post-training.

Full explanation →

1025

Multi-Selectmedium

A data engineer needs to transform and move 2 TB of data from an Amazon RDS for PostgreSQL instance to Amazon S3 daily. The transformation includes filtering, joining with data in S3, and aggregating. Which AWS services can be used together to accomplish this with minimal operational overhead? (Choose THREE.)

Select 3 answers

A.Amazon EMR

B.Amazon Redshift

C.Amazon S3

D.AWS Glue Data Catalog

E.AWS Glue

AnswersC, D, E

Target storage for transformed data.

Why this answer

Option A (Glue) is the ETL service, Option C (S3) is the target, and Option D (Glue Data Catalog) manages metadata. Option B (EMR) adds overhead, Option E (Redshift) is a warehouse, not needed.

Full explanation →

1026

MCQeasy

A data scientist is analyzing a dataset with 100 features and wants to identify which features are most correlated with the target variable. Which AWS service is most appropriate for this task?

A.Amazon QuickSight

B.Amazon Athena

C.AWS Glue DataBrew

D.Amazon SageMaker Data Wrangler

AnswerD

Data Wrangler provides data analysis and feature correlation within SageMaker Studio.

Why this answer

Amazon SageMaker Data Wrangler provides built-in data analysis and visualization capabilities, including correlation analysis, making it suitable for this task. Option A (Amazon QuickSight) is a BI tool for dashboards, not embedded data wrangling. Option C (AWS Glue) is for ETL jobs.

Option D (Amazon Athena) is for querying data in S3.

Full explanation →

1027

Multi-Selecteasy

Which TWO actions are best practices for securing a SageMaker notebook instance? (Select TWO.)

Select 2 answers

A.Disable direct internet access for the notebook instance.

B.Enable root access for users to install packages.

C.Launch the notebook instance in a private subnet in a VPC.

D.Store data in the notebook's local storage for performance.

E.Use a shared IAM user for all data scientists.

AnswersA, C

Disabling internet access prevents data exfiltration.

Why this answer

Best practices include using VPC to isolate the notebook, enabling encryption at rest, using IAM roles with least privilege, and disabling direct internet access. Direct internet access should be disabled to prevent data exfiltration. Root access should be disabled for notebook instances.

Full explanation →

1028

MCQhard

A data scientist is trying to create a training job named 'test-model' using an IAM role with the attached policy. The creation fails with an AccessDenied error. What is the most likely cause?

A.The Resource is set to '*' and should be specific.

B.The Deny statement uses 'StringNotEquals' which should be 'StringEquals'.

C.The IAM role does not have permission to assume the SageMaker execution role.

D.The Deny statement uses a wildcard '*' in the condition value, which is not supported for StringNotEquals.

AnswerD

Wildcards are not supported in StringNotEquals conditions, causing unexpected denial.

Why this answer

The Deny statement uses 'StringNotEquals' with a wildcard '*' in the condition value, which is not supported for the 'StringNotEquals' condition operator in IAM policies. The 'StringNotEquals' operator requires exact string matching and does not support wildcards; using '*' will cause the condition to never match, effectively making the Deny statement non-functional or causing unexpected behavior. This mismatch leads to an AccessDenied error because the policy evaluation fails to properly deny or allow the action.

Exam trap

The trap here is that candidates may assume 'StringNotEquals' supports wildcards like 'StringNotLike' does, or they may focus on the Resource wildcard (Option A) as the obvious cause, missing the subtle condition operator mismatch.

How to eliminate wrong answers

Option A is wrong because setting the Resource to '*' is generally acceptable for service-linked roles or broad permissions, and the error is specifically about an AccessDenied due to a policy condition issue, not resource specificity. Option B is wrong because 'StringNotEquals' is a valid condition operator; the issue is not the operator itself but the use of a wildcard in its value, which is unsupported. Option C is wrong because the IAM role's ability to assume the SageMaker execution role is a separate permission (sts:AssumeRole) and not directly related to the training job creation failure caused by the Deny statement's condition syntax.

Full explanation →

1029

MCQmedium

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 100 Mbps connection to AWS. The transfer must be completed within one week. Which approach should the engineer use?

A.Use AWS Snowball Edge device to physically transfer the data.

B.Use Amazon S3 Transfer Acceleration.

C.Use AWS DataSync to transfer the data over the network.

D.Use multiple concurrent AWS CLI copy commands over VPN.

AnswerA

Snowball Edge can handle large data volumes without network limitations.

Why this answer

AWS Snowball Edge is a physical device that can handle large data volumes over slow networks. 50 TB over 100 Mbps would take about 50 days, exceeding the one-week requirement. Option B is wrong because AWS DataSync still uses the network. Option C is wrong because S3 Transfer Acceleration improves speed but not enough.

Option D is wrong because VPN is not designed for bulk data transfer.

Full explanation →

1030

MCQhard

A machine learning engineer is tuning a gradient boosting model using SageMaker Hyperparameter Tuning. The objective is to minimize MAE. The tuning job uses 20 training jobs. After 10 jobs, the best objective value is 5.2. Which action should the engineer take to potentially improve the result?

A.Set early stopping to avoid overfitting.

B.Change the objective metric to RMSE.

C.Increase the total number of training jobs to 50.

D.Switch the tuning strategy from Bayesian to Random search.

AnswerC

More jobs allow broader exploration and may find a better configuration.

Why this answer

Option C is correct because increasing the total number of training jobs from 20 to 50 gives the Bayesian optimization algorithm more opportunities to explore the hyperparameter space and exploit promising regions. With only 10 jobs completed, the tuning job may not have converged to the global minimum of MAE, and additional jobs can refine the search, especially since Bayesian search builds a probabilistic model that improves with more observations.

Exam trap

The trap here is that candidates mistakenly think early stopping (Option A) applies to the tuning job itself rather than to individual training jobs, or they assume changing the metric (Option B) will indirectly improve MAE, when in fact the tuning job's objective must directly match the business metric.

How to eliminate wrong answers

Option A is wrong because early stopping is a technique to halt training of a single model when validation performance stops improving, not a mechanism to improve the tuning job's best objective value; it prevents overfitting per job but does not help the hyperparameter search find a better configuration. Option B is wrong because changing the objective metric to RMSE would optimize for a different loss function, which contradicts the stated goal of minimizing MAE and could lead to a model that performs worse on the actual target metric. Option D is wrong because switching from Bayesian to Random search would discard the information already gathered from the first 10 jobs, likely reducing sample efficiency and making it harder to find a better result within the remaining budget.

Full explanation →

1031

Multi-Selectmedium

A data scientist is training a random forest model for a binary classification task. The dataset has 100,000 samples and 500 features. The model is overfitting. Which TWO actions are MOST likely to reduce overfitting?

Select 2 answers

A.Increase the number of trees in the forest

B.Reduce the maximum depth of each tree

C.Increase the number of features considered at each split

D.Use all features for each tree

E.Increase the minimum number of samples required to split an internal node

AnswersB, E

Shorter trees are simpler and less likely to overfit.

Why this answer

Reducing the maximum depth of each tree limits the complexity of individual trees, preventing them from memorizing noise and specific patterns in the training data. This is a standard regularization technique for random forests that directly combats overfitting by controlling the variance of the model.

Exam trap

AWS often tests the misconception that adding more trees always reduces overfitting, but the trap here is that without controlling tree complexity (depth or split criteria), more trees can still produce an overfit ensemble, especially when individual trees are allowed to grow unchecked.

Full explanation →

1032

MCQeasy

A machine learning engineer is deploying a model that was trained on a large dataset stored in Amazon S3. The model needs to be retrained daily with new data. Which approach is the MOST cost-effective for storing the training data while allowing quick access for retraining?

A.Store all data in S3 Standard

B.Use S3 Glacier Deep Archive

C.Use S3 Intelligent-Tiering

D.Use S3 One Zone-IA

AnswerC

Intelligent-Tiering automatically optimizes costs for data with changing access patterns.

Why this answer

Option B is correct because S3 Intelligent-Tiering automatically moves data between access tiers to optimize costs, and it provides low-latency access. Option A is wrong because S3 Standard is more expensive for data that is not frequently accessed. Option C is wrong because S3 Glacier is for long-term archival, not for data that needs to be accessed daily.

Option D is wrong because S3 One Zone-IA is cheaper but may not be suitable for critical data and still costs more than Intelligent-Tiering for varying access patterns.

Full explanation →

1033

Multi-Selectmedium

A data scientist is using Amazon SageMaker to train a model and wants to track experiments, including parameters and metrics. Which THREE actions should be taken? (Choose three.)

Select 3 answers

A.Use SageMaker Studio to manually record experiments.

B.Use Amazon CloudWatch Logs to store experiment data.

C.Create an experiment in SageMaker Experiments.

D.Use the SageMaker SDK to log parameters and metrics in the training script.

E.Use the SageMaker SDK to create a trial and trial component.

AnswersC, D, E

Experiments organize runs.

Why this answer

Options A, B, and D are correct. SageMaker Experiments tracks parameters and metrics; adding the SDK to the training script logs them; creating an experiment integrates tracking. Option C is wrong because CloudWatch is for monitoring but not designed for experiment tracking.

Option E is wrong because SageMaker Studio is the interface, not the tracking mechanism itself.

Full explanation →

1034

MCQhard

A machine learning team is using SageMaker Processing jobs to run feature engineering on large datasets. The job takes a long time to complete. Which change would most likely reduce the processing time?

A.Increase the number of instances in the processing cluster

B.Switch to local mode to avoid network overhead

C.Change the processing script from Python to PySpark

D.Use a larger instance type, e.g., from r5.xlarge to r5.24xlarge

AnswerA

More instances allow parallel processing, reducing overall time.

Why this answer

Option B is correct: Increasing the instance count allows distributed processing, reducing time. Option A (larger instance) helps but is less scalable. Option C (local mode) is for testing, not production.

Option D (changing framework) may not improve performance.

Full explanation →

1035

MCQmedium

A data scientist is using Amazon SageMaker to train a linear regression model. The training job fails with the error: 'AlgorithmError: Input data has NaN values'. Which step should the data scientist take to resolve this issue?

A.Convert the data to a sparse format

B.Switch to a different algorithm that handles missing values

C.Impute missing values or remove rows with NaN values

D.Increase the number of training instances

AnswerC

Handling missing values by imputation or removal resolves the NaN error.

Why this answer

The error indicates NaN values in the input data. The correct action is to handle missing values before training. Option A is wrong because increasing instance count does not fix data issues.

Option C is wrong because the error is data-related, not algorithm-related. Option D is wrong because the issue is NaN values, not data format.

Full explanation →

1036

MCQmedium

A company uses SageMaker to host a model for real-time predictions. The model is updated weekly. To minimize downtime during model updates, what should the company do?

A.Create a new endpoint configuration with the new model and update the endpoint to use the new configuration

B.Create a second endpoint with the new model and use an Application Load Balancer to route traffic

C.Update the existing endpoint configuration with the new model URL

D.Delete the existing endpoint and create a new one with the updated model

AnswerA

SageMaker supports blue/green deployment by updating endpoint to new configuration, minimizing downtime.

Why this answer

Option D is correct: Creating a new endpoint configuration with the new model and updating the endpoint with a blue/green deployment minimizes downtime. Option A (delete and recreate) causes downtime. Option B (update endpoint directly) causes brief downtime.

Option C (multiple endpoints with load balancer) works but is more complex than SageMaker's built-in blue/green.

Full explanation →

1037

MCQhard

A data scientist is training a deep learning model on a large dataset using Amazon SageMaker. The training job is taking too long. The scientist notices that GPU utilization is low and data loading is the bottleneck. Which action should the scientist take to improve training performance?

A.Increase the number of training instances

B.Use Pipe mode for the training data channel

C.Change the instance type to a CPU instance

D.Reduce the batch size

AnswerB

Pipe mode streams data, reducing I/O bottleneck.

Why this answer

Low GPU utilization with data loading bottleneck indicates that the CPU cannot feed data fast enough. Using Pipe mode streams data directly from S3 without downloading, reducing I/O overhead. Increasing instance count may not help if each GPU is underutilized.

Changing to a CPU instance would be slower. Reducing batch size would reduce GPU utilization further. Option A: Pipe mode is correct.

Option B: More instances may not address data loading bottleneck. Option C: CPU instance is slower. Option D: Smaller batch size reduces GPU utilization.

Full explanation →

1038

MCQmedium

A data scientist is training a deep learning model using TensorFlow on Amazon SageMaker. The training job uses a single GPU instance but the GPU utilization is low. Which action is MOST likely to improve GPU utilization?

A.Increase the batch size

B.Use a smaller instance type

C.Add more features

D.Decrease the number of epochs

AnswerA

Larger batch size better utilizes GPU.

Why this answer

Increasing the batch size allows the GPU to process more data in parallel per training step, which keeps the GPU compute units busier and reduces idle time. In TensorFlow on SageMaker, a small batch size can cause the GPU to finish computation quickly and then wait for the next batch to be loaded, leading to low utilization. This is the most direct way to improve GPU throughput without changing the instance or model architecture.

Exam trap

The trap here is that candidates confuse low GPU utilization with overfitting or model complexity, leading them to choose options like adding features or reducing epochs, when the real issue is underutilization of parallel compute resources due to insufficient batch size.

How to eliminate wrong answers

Option B is wrong because using a smaller instance type would reduce GPU compute capacity, likely worsening utilization and increasing training time. Option C is wrong because adding more features increases the input dimensionality, which may increase computation per sample but does not address the root cause of low GPU utilization (insufficient parallelism). Option D is wrong because decreasing the number of epochs reduces total training time but does not affect how efficiently the GPU is used during each step; utilization per step remains unchanged.

Full explanation →

1039

Multi-Selectmedium

A company is using Amazon Kinesis Data Streams with 10 shards to ingest clickstream data. Each record is approximately 50 KB. The data is consumed by a Lambda function that writes to DynamoDB. The Lambda function is experiencing throttling errors. Which TWO actions should the data engineer take to resolve the issue? (Choose TWO.)

Select 2 answers

A.Increase the record size to 1 MB to reduce the number of records

B.Switch to Kinesis Data Firehose instead of Data Streams

C.Request a limit increase for the Lambda function's concurrent execution limit

D.Increase the number of shards in the Kinesis stream

E.Increase the batch size in the Lambda event source mapping

AnswersC, E

This directly alleviates throttling by allowing more concurrent executions.

Why this answer

Option A (increase shards) increases throughput but may not solve Lambda throttling. Option B (increase Lambda concurrency limit) directly addresses throttling. Option C (increase batch size) reduces number of Lambda invocations.

Option D (use Firehose) changes architecture. Option E (increase record size) is irrelevant. The correct answers are B and C because they reduce the number of concurrent Lambda executions and increase efficiency.

Full explanation →

1040

MCQhard

A company is using Amazon SageMaker Ground Truth to create a labeled dataset for object detection. The labeling job is taking longer than expected. The team notices that many workers are spending a lot of time on images with no objects. Which labeling strategy should they use to reduce costs and time?

A.Use a private workforce instead of public.

B.Create a pre-labeling task where workers only identify if an object exists, then send only positive images for full labeling.

C.Use automated data labeling with a pre-trained model to filter empty images.

D.Increase the number of workers per dataset object.

AnswerB

This two-stage approach reduces work on empty images.

Why this answer

Ground Truth supports automated data labeling and can use a pre-built model to filter out images with no objects. However, the most effective way is to use a pre-labeling task with a machine learning model to automatically reject images without objects. Alternatively, using a 'verify' labeling task where workers only verify if objects exist can be efficient.

The best option is to use a 'verify' task mode, which is available for object detection.

Full explanation →

1041

Multi-Selectmedium

Which TWO actions can reduce inference latency for a SageMaker real-time endpoint? (Choose 2.)

Select 2 answers

A.Choose a larger instance type with more compute capacity.

B.Add more instances behind the endpoint.

C.Use batch transform instead.

D.Compile the model using SageMaker Neo.

E.Switch to asynchronous inference.

AnswersA, D

More compute reduces per-request latency.

Why this answer

Using a larger instance with more CPU/memory (Option A) and compiling the model with SageMaker Neo (Option D) both reduce latency. Option B (adding instances) increases throughput but not per-request latency. Option C (asynchronous) is for different use case.

Option E (batch transform) is for offline inference.

Full explanation →

1042

Multi-Selectmedium

A company runs a data lake on Amazon S3 with AWS Glue for ETL. The data science team needs to train machine learning models on historical data, but they are concerned about data quality issues such as missing values, duplicates, and outliers. The team wants to build a data quality monitoring solution that automatically detects anomalies and alerts the data engineering team. Which THREE steps should the team take to implement this solution? (Choose THREE.)

Select 3 answers

A.Use AWS Glue DataBrew to create data quality rules that check for missing values, duplicates, and outliers, and schedule them to run regularly.

B.Use Amazon Kinesis Data Analytics to continuously monitor streaming data for data quality issues.

C.Create Amazon CloudWatch alarms based on the data quality metrics and trigger Amazon SNS notifications when thresholds are breached.

D.Implement the Deequ library on Amazon EMR to compute data quality metrics and store them in Amazon CloudWatch.

E.Use Amazon SageMaker Processing jobs to run custom data quality scripts and store results in SageMaker Experiments.

AnswersA, C, D

DataBrew has built-in data quality functionalities.

Why this answer

Option A is correct because AWS Glue DataBrew provides built-in data quality checks and profiling. Option C is correct because Deequ is an open-source library that can run on Amazon EMR or Glue to compute data quality metrics. Option D is correct because CloudWatch alarms can be set on custom metrics to send alerts via SNS.

Option B is incorrect because SageMaker is for model training, not data quality monitoring. Option E is incorrect because Kinesis Data Analytics is for real-time streaming analytics, not batch data quality checks.

Full explanation →

1043

MCQhard

A data scientist is training a deep learning model on Amazon SageMaker using a large dataset stored in S3. The training job is taking too long due to high I/O latency waiting for data to be downloaded from S3. Which action would MOST effectively reduce the I/O latency?

A.Use File mode for the training channel

B.Increase the number of training instances

C.Use Pipe mode for the training channel

D.Use Amazon SageMaker Elastic Inference

AnswerC

Pipe mode streams data directly from S3, reducing disk I/O and latency.

Why this answer

Pipe mode streams data directly from S3 into the training algorithm without writing to disk, eliminating the I/O latency caused by downloading files to the local storage. This is the most effective solution because the bottleneck is data transfer from S3, and Pipe mode reduces it to near-zero latency by feeding data on the fly.

Exam trap

The trap here is that candidates confuse File mode (which downloads fully) with Pipe mode (which streams), or mistakenly think adding more instances (Option B) solves a per-instance I/O bottleneck, when in fact it does not address the root cause of S3 download latency.

How to eliminate wrong answers

Option A is wrong because File mode downloads the entire dataset to the training instance's local disk before training starts, which actually increases I/O latency due to the full download overhead. Option B is wrong because increasing the number of training instances does not reduce per-instance I/O latency; it distributes the workload but each instance still suffers from the same S3 download bottleneck. Option D is wrong because Amazon SageMaker Elastic Inference accelerates model inference, not training data loading, so it has no effect on I/O latency during training.

Full explanation →

1044

MCQmedium

A company is building a recommendation system using Amazon SageMaker. The data is stored in a large S3 bucket with millions of small CSV files. The team wants to train a factorization machines model. Which data ingestion strategy will be MOST efficient?

A.Use a SageMaker Processing job with a Spark container to read the files and write a single RecordIO file.

B.Use Amazon Athena to query the data and output to a single CSV.

C.Point the training job directly to the S3 bucket containing the CSV files.

D.Use SageMaker Data Wrangler to create a data flow and export to a training dataset.

AnswerA

Spark can efficiently combine many small files into a single format optimized for training.

Why this answer

Using SageMaker Processing with Spark (Option C) can efficiently read many small files and convert them to a single RecordIO file, which is optimal for SageMaker training. Option A (direct training) would be slow due to many small files. Option B (Athena) is for SQL queries, not data conversion.

Option D (Data Wrangler) is for smaller datasets and manual analysis.

Full explanation →

1045

MCQeasy

A company wants to serve predictions from a model using a REST API with low latency. Which SageMaker deployment option is most appropriate?

A.SageMaker Notebook instance

B.SageMaker real-time endpoint

C.SageMaker Processing job

D.SageMaker Batch Transform

AnswerB

Real-time endpoints provide low-latency REST API.

Why this answer

Option B is correct because SageMaker real-time endpoints provide a REST API. Option A is wrong because SageMaker Notebook is for development. Option C is wrong because SageMaker Batch Transform is for offline predictions.

Option D is wrong because SageMaker Processing is for data processing.

Full explanation →

1046

MCQmedium

A data engineer needs to automate the transformation of CSV files to Parquet format as soon as they are uploaded to an S3 bucket. The transformed files should be stored in another S3 bucket. Which solution is the most cost-effective and requires the least maintenance?

A.Configure an S3 event notification to invoke a Lambda function.

B.Configure an S3 event notification to invoke an AWS Glue job.

C.Run an Amazon EMR cluster continuously to watch for new files.

D.Set up an EC2 instance with a cron job to poll the S3 bucket.

AnswerA

Lambda is serverless, pay-per-execution, ideal for this use case.

Why this answer

S3 Event Notification to Lambda is serverless and cost-effective. Glue jobs have a minimum billing of 1 minute. EMR is overkill.

EC2 requires management.

Full explanation →

1047

MCQmedium

A company is building a recommendation system for an e-commerce platform. The system needs to suggest products to users based on past purchases and browsing history. Which approach would be most appropriate for this use case?

A.Content-based filtering using product descriptions

B.K-means clustering of users based on demographics

C.Collaborative filtering using past user-item interactions

D.Matrix factorization on user-item ratings

AnswerC

Collaborative filtering leverages user behavior patterns to make recommendations.

Why this answer

Collaborative filtering is the most appropriate approach because it leverages past user-item interactions (e.g., purchases, clicks) to identify patterns and recommend items that similar users have liked. This method directly captures user behavior and preferences without requiring explicit product metadata, making it ideal for e-commerce recommendation systems where implicit feedback is abundant.

Exam trap

AWS often tests the distinction between collaborative filtering and matrix factorization, where candidates mistakenly choose matrix factorization (Option D) because it is a popular technique, but the question's emphasis on 'past purchases and browsing history' (implicit feedback) makes collaborative filtering the more direct and practical choice, as matrix factorization typically requires explicit ratings or careful adaptation for implicit data.

How to eliminate wrong answers

Option A is wrong because content-based filtering relies solely on product descriptions or features, which ignores the collaborative signal from other users' behavior and fails to capture serendipitous recommendations or cross-category preferences. Option B is wrong because K-means clustering based on demographics groups users by static attributes (e.g., age, location), which does not model dynamic purchase behavior or item preferences, leading to poor recommendation accuracy. Option D is wrong because matrix factorization on user-item ratings assumes explicit numerical ratings (e.g., 1-5 stars), which are often sparse or unavailable in e-commerce; it also requires a dense rating matrix and cannot directly handle implicit feedback like browsing history without additional preprocessing.

Full explanation →

1048

MCQhard

A financial services company is deploying a machine learning model for credit risk assessment. The model must have an inference latency under 200ms and must be able to handle up to 1000 transactions per second (TPS). The company wants to minimize costs. The model is a gradient boosting model implemented in XGBoost. Which SageMaker deployment option should the team choose?

A.Use SageMaker Batch Transform to process transactions in batches.

B.Use SageMaker asynchronous inference for queued requests.

C.Deploy the model on a SageMaker real-time endpoint with multiple instances behind a load balancer.

D.Use SageMaker Serverless Inference for automatic scaling.

AnswerC

Real-time endpoints provide sub-second latency and can scale to 1000 TPS.

Why this answer

SageMaker real-time endpoints are designed for low-latency, high-throughput inference. They can scale horizontally to handle TPS requirements. SageMaker Batch Transform (B) is for offline processing.

SageMaker Serverless Inference (C) has cold starts and may not meet latency requirements under high load. SageMaker asynchronous inference (D) is for near-real-time but has higher latency.

Full explanation →

1049

Multi-Selecthard

A data scientist is performing EDA on a dataset with 10 million rows. The dataset has a column 'income' with outliers. The data scientist wants to detect and handle outliers. Which THREE approaches are appropriate?

Select 3 answers

A.Calculate z-scores and flag values beyond 3 standard deviations

B.Apply min-max scaling to the column

C.Convert the column to one-hot encoding

D.Visualize the distribution with box plots

E.Use the interquartile range (IQR) to identify outliers

AnswersA, D, E

Z-score is a common method.

Why this answer

IQR method, z-score, and visualization like box plots are standard for outlier detection. Option D is wrong because min-max scaling does not handle outliers. Option E is wrong because one-hot encoding is for categorical data.

Full explanation →

1050

Multi-Selectmedium

A data scientist is training a binary classifier using a large dataset with class imbalance (90% negative, 10% positive). After training a logistic regression model, the F1 score is low but accuracy is high. Which TWO actions should the data scientist take to improve model performance? (Choose 2.)

Select 2 answers

A.Switch to evaluation metrics such as F1 score or AUC-ROC instead of accuracy.

B.Apply feature scaling to ensure all features contribute equally.

C.Add more features to the model to improve its capacity.

D.Resample the training data using techniques like SMOTE to balance the classes.

E.Increase the regularization parameter to reduce overfitting.

AnswersA, D

Correct: Metrics like F1 are robust to class imbalance.

Why this answer

Option A (resample training data) and Option C (use different evaluation metric) are correct because class imbalance causes the model to be biased toward the majority class, leading to high accuracy but poor F1. Resampling (e.g., SMOTE) balances classes, and using F1 or AUC-ROC focuses on minority class performance. Option B (feature scaling) is a general preprocessing step but doesn't directly address imbalance.

Option D (increase regularization) might reduce overfitting but doesn't target imbalance. Option E (add more features) may not help if the model is already biased.

Full explanation →

Page 14 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →