AWS Certified Machine Learning Specialty MLS-C01 MLS-C01 Questions 1276–1350 | Page 18/24

1276

MCQeasy

A company wants to perform automated hyperparameter tuning for a model. Which Amazon SageMaker feature should be used?

A.Amazon SageMaker Clarify

B.Amazon SageMaker Ground Truth

C.Amazon SageMaker Debugger

D.Amazon SageMaker automatic model tuning

AnswerD

Purpose-built for hyperparameter optimization.

Why this answer

Option B is correct because SageMaker automatic model tuning (hyperparameter tuning) automates hyperparameter optimization. Option A is wrong because Debugger is for monitoring. Option C is wrong because Ground Truth is for labeling.

Option D is wrong because Clarify is for bias detection.

Full explanation →

1277

Multi-Selectmedium

A data scientist is performing EDA on a dataset with a binary target variable. Which THREE techniques can help assess the relationship between a continuous feature and the target?

Select 3 answers

A.Scatter plot against another continuous feature

B.KDE plot grouped by target

C.Histogram colored by target

D.Bar chart of feature values

E.Box plot grouped by target

AnswersB, C, E

KDE plots show smoothed density per class.

Why this answer

Box plots (comparing distributions for each class), histograms (overlay or side-by-side), and KDE plots (probability density) are all effective for visualizing the relationship between a continuous feature and a binary target. Option D (scatter plot) requires two continuous variables. Option E (bar chart) is for categorical features.

Full explanation →

1278

MCQhard

A company is building a sentiment analysis model using Amazon SageMaker BlazingText. The training data consists of 100,000 product reviews. The data scientist wants to use the Word2Vec algorithm to generate word embeddings. Which configuration is required to use the continuous bag-of-words (CBOW) architecture?

A.Set the mode parameter to 'supervised'.

B.Set the mode parameter to 'batch_skipgram'.

C.Set the mode parameter to 'cbow'.

D.Set the mode parameter to 'skipgram'.

AnswerC

The 'cbow' mode enables the continuous bag-of-words architecture in BlazingText.

Why this answer

In BlazingText, the 'mode' parameter controls the training objective. Setting 'mode' to 'cbow' enables the continuous bag-of-words architecture. 'skipgram' is for skip-gram. 'batch_skipgram' is for large-scale skip-gram. 'supervised' is for text classification.

Full explanation →

1279

MCQeasy

A company runs a nightly AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes to Amazon S3. The job fails intermittently with 'ERROR: cannot execute INSERT in a read-only transaction'. What is the most likely cause?

A.The IAM role used by Glue does not have permissions to insert into Redshift

B.The JDBC driver version is incompatible with the Redshift cluster

C.The Redshift cluster is in a read-only state due to a failover or maintenance

D.The Glue job's connection pool is exhausted

AnswerC

During failover, the secondary cluster may be read-only, causing this error.

Why this answer

This error occurs when the Glue job tries to write to a Redshift table that is in a read-only transaction, often because the Redshift cluster is in a read-only state due to a failover or maintenance. Option A (JDBC timeout) gives a different error. Option B (connection pool) would give a different error.

Option D (IAM permissions) would give an access denied error.

Full explanation →

1280

MCQhard

A data scientist is building a binary classification model to predict customer churn. The dataset is highly imbalanced, with only 5% of customers churning. The scientist evaluates several models using accuracy, precision, recall, and F1 score. Which metric is most appropriate for comparing model performance in this scenario?

A.Accuracy

B.F1 score

C.Precision

D.Recall

AnswerB

F1 score balances precision and recall, making it suitable for imbalanced datasets.

Why this answer

F1 score is the harmonic mean of precision and recall and is suitable for imbalanced datasets where accuracy can be misleading. Accuracy would be high even if the model predicts no churn ever (95% accuracy). Precision and recall each consider only one aspect, but F1 balances both.

Full explanation →

1281

MCQmedium

A data scientist is exploring a dataset with a large number of features. The scientist suspects that some features are redundant because they are highly correlated with each other. Which technique should the scientist use during EDA to identify and remove such redundant features?

A.Chi-square test

B.Principal Component Analysis (PCA)

C.Correlation matrix heatmap

D.Variance Inflation Factor (VIF)

AnswerD

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity.

Why this answer

Option B is correct because VIF quantifies multicollinearity by measuring how much the variance of a coefficient is inflated due to correlation with other features. Option A is wrong because PCA creates new features, not identification of redundant ones. Option C is wrong because correlation matrix shows pairwise correlations but VIF is more comprehensive.

Option D is wrong because Chi-square is for categorical features.

Full explanation →

1282

Multi-Selecteasy

A data scientist is building a binary classification model to predict customer churn. The dataset has 10,000 samples with 500 churners (positive class). Which TWO techniques should be used to address the class imbalance? (Choose 2.)

Select 2 answers

A.Use a higher learning rate during training

B.Use L1 regularization on the model

C.Use random undersampling of the majority class

D.Use SMOTE to generate synthetic samples for the minority class

E.Use principal component analysis (PCA) to reduce dimensionality

AnswersC, D

Undersampling reduces majority class samples, balancing the dataset.

Why this answer

SMOTE and undersampling are standard techniques for handling class imbalance.

Full explanation →

1283

MCQeasy

A data scientist is training a binary classification model on a dataset where the positive class represents only 1% of the data. The model's accuracy is 99%, but the recall for the positive class is 0%. Which metric should the scientist use to evaluate the model's performance effectively?

A.Area under the ROC curve (ROC AUC)

B.Area under the Precision-Recall curve (PR AUC)

C.Accuracy

D.F1 score

AnswerB

PR AUC is robust to class imbalance.

Why this answer

In a highly imbalanced dataset where the positive class is only 1%, accuracy is misleading because a model can achieve 99% accuracy by simply predicting the negative class for all samples, resulting in 0% recall for the positive class. The Area under the Precision-Recall curve (PR AUC) is the correct metric because it focuses on the performance of the positive class by evaluating the trade-off between precision and recall, making it sensitive to changes in the minority class. Unlike ROC AUC, which can be overly optimistic in imbalanced settings due to the large number of true negatives, PR AUC provides a more realistic assessment of model performance for rare events.

Exam trap

The trap here is that candidates often choose ROC AUC (Option A) because it is a common default metric, but they fail to recognize that in severe class imbalance, ROC AUC can be artificially inflated by the dominance of true negatives, whereas PR AUC is the correct choice for evaluating minority class performance.

How to eliminate wrong answers

Option A is wrong because ROC AUC evaluates the trade-off between true positive rate and false positive rate, and in highly imbalanced datasets with a large number of true negatives, it can remain high even when the model fails to identify positive samples, giving a false sense of good performance. Option C is wrong because accuracy is a global metric that counts overall correct predictions; in this 1% positive class scenario, a model that always predicts the negative class achieves 99% accuracy but has 0% recall, making it completely useless for detecting the positive class. Option D is wrong because the F1 score, while better than accuracy, is a single threshold-dependent metric that can be misleading if the model's precision is high but recall is zero (F1 would be 0), and it does not capture performance across all thresholds like PR AUC does.

Full explanation →

1284

MCQeasy

A data scientist is using Amazon SageMaker to train a model and wants to automatically stop the training job if the loss does not improve for a certain number of epochs. Which SageMaker feature can be used for this purpose?

A.SageMaker Experiments

B.Custom early stopping callback in the training script

C.SageMaker Automatic Model Tuning

D.SageMaker Debugger

AnswerB

Implementing a custom callback that stops training when loss stagnates is the most direct method.

Why this answer

SageMaker provides built-in early stopping via the 'StoppingCondition' parameter in the training job definition, or through custom training scripts that use callbacks. The simplest way is to set MaxRuntimeInSeconds, but for early stopping based on loss, the data scientist should implement a custom callback in the training script.

Full explanation →

1285

MCQhard

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

A.Use Amazon EMR with Spark to convert data to Parquet and use on-demand instances.

B.Use Amazon EMR with Spark to convert data to Parquet and store in S3, using spot instances for task nodes.

C.Use AWS Glue to convert data to gzip-compressed CSV and query with Athena.

D.Use Amazon EMR with Hive to transform data to compressed CSV and store in S3.

AnswerB

Parquet reduces scan size, spot instances reduce cost.

Why this answer

Option B is correct because converting gzip-compressed CSV to Parquet reduces storage size and improves query performance due to columnar storage and predicate pushdown. Using spot instances for task nodes significantly lowers compute cost, while the 30-minute SLA is achievable with Spark on EMR processing 5-minute windows of data.

Exam trap

The trap here is that candidates may overlook the cost savings of spot instances for transient, fault-tolerant workloads, or assume that any compression (like gzip CSV) is sufficient for performance, ignoring the benefits of columnar formats like Parquet for analytical queries.

How to eliminate wrong answers

Option A is wrong because using on-demand instances for task nodes increases cost unnecessarily; spot instances are suitable for fault-tolerant, transient workloads like data transformation. Option C is wrong because AWS Glue is not optimized for high-volume, low-latency ETL on terabytes of daily log data, and converting to gzip-compressed CSV does not improve query performance over Parquet. Option D is wrong because Hive on EMR is slower than Spark for large-scale data processing, and storing as compressed CSV does not provide the performance benefits of columnar formats like Parquet.

Full explanation →

1286

MCQeasy

A company has deployed a real-time inference endpoint using SageMaker for a fraud detection model. The model uses a Random Forest classifier. The endpoint receives predictions but the latency is too high. The metric shows p99 latency of 500ms, but the requirement is under 200ms. The team has already optimized the instance type to the maximum allowed by their budget. The data scientist suggests: A) Reducing the number of trees in the Random Forest model. B) Switching to a linear model like Logistic Regression. C) Enabling SageMaker's batch transform instead of real-time endpoint. D) Adding more instances to the endpoint behind a load balancer. Which option will MOST effectively reduce latency while maintaining acceptable accuracy?

A.Switch to a linear model like Logistic Regression

B.Reduce the number of trees in the Random Forest model

C.Enable SageMaker's batch transform

D.Add more instances to the endpoint

AnswerB

Fewer trees mean faster inference, though accuracy may drop slightly; it's a direct latency reduction.

Why this answer

Reducing the number of trees directly reduces inference time of Random Forest, but may degrade accuracy. However, it's the most direct method for latency reduction while keeping the model type. Switching to Logistic Regression may reduce latency but likely reduces accuracy significantly.

Batch transform is not suitable for real-time. Adding instances helps throughput but not per-request latency.

Full explanation →

1287

Multi-Selecthard

A company is deploying a machine learning model using Amazon SageMaker. To reduce costs, they want to use SageMaker Managed Spot Training. Which THREE conditions must be met for the training job to use spot instances? (Choose THREE.)

Select 3 answers

A.The model must be deployed to a serverless endpoint

B.The training job must be able to handle interruptions gracefully

C.The chosen instance type must be available in the spot market

D.The training script must save checkpoints to an S3 bucket periodically

E.The training job must be configured to run in a VPC

AnswersB, C, D

Spot instances can be reclaimed; the job must be fault-tolerant.

Why this answer

Options A, C, and D are correct. Spot training requires checkpointing (A), the instance type must support spot (C), and the training job must be stoppable (D). Option B is not required — spot can be used in a VPC.

Option E is not required; spot is only for training, not inference.

Full explanation →

1288

Multi-Selectmedium

Which TWO metrics are appropriate for evaluating a binary classification model when the cost of false negatives is high?

Select 2 answers

A.Accuracy

B.AUC-ROC

C.Recall

D.F1 score

E.Precision

AnswersC, D

Recall measures the proportion of actual positives correctly identified.

Why this answer

When false negatives are costly, we want to minimize them, so recall (true positive rate) is important. Precision is also important to avoid too many false positives. F1 score balances both, but recall directly measures false negatives.

AUC-ROC is a general measure. Accuracy can be misleading. So recall and F1 score are appropriate.

Options: A: Recall, B: F1 score (both correct). C: Precision, D: AUC-ROC, E: Accuracy are not the best choices in this context.

Full explanation →

1289

Multi-Selecthard

A company is using Amazon Redshift for data warehousing. The data engineering team observes that query performance degrades over time due to data skew. Which three strategies should the team implement to improve performance?

Select 3 answers

A.Choose appropriate distribution keys based on join and group-by columns.

B.Increase the number of nodes in the Redshift cluster.

C.Run VACUUM and ANALYZE commands regularly.

D.Define appropriate sort keys to minimize the number of blocks scanned.

E.Drop unused indexes on large tables.

AnswersA, C, D

Good distribution keys reduce data movement and improve performance.

Why this answer

Option A, B, and E are correct. Choosing appropriate distribution keys reduces data movement. Regular VACUUM and ANALYZE reclaims space and updates statistics.

Using sort keys improves query performance by reducing scans. Option C (increasing node count) may help but is costly and not a targeted fix for skew. Option D (dropping indexes) is not applicable to Redshift (no indexes).

Full explanation →

1290

MCQhard

Refer to the exhibit. A data scientist is setting up an IAM policy for EDA on a data lake. The scientist needs to run exploratory SQL queries using Amazon Athena and save results to a new S3 bucket. What is a critical missing permission in this policy?

A.s3:ListBucket on the output bucket

B.s3:PutObject on the output S3 bucket

C.glue:GetDatabase

D.athena:StopQueryExecution

AnswerB

Athena needs to write query results to an S3 bucket.

Why this answer

Option B is correct because Athena writes query results to an S3 bucket, which requires s3:PutObject permission on the output bucket. Option A is wrong because the policy includes necessary Athena permissions. Option C is wrong because glue:GetTable is already included.

Option D is wrong because s3:ListBucket is included.

Full explanation →

1291

MCQhard

An ML team is using SageMaker Autopilot to automatically build a binary classification model. The dataset has 500,000 rows and 200 columns, with a severe class imbalance (1% positive). Which configuration should the team set to address the imbalance?

A.Specify the 'objective' as 'F1' or 'AUC' to optimize for imbalanced data.

B.Set the 'problem_type' to 'MulticlassClassification' to handle imbalance.

C.Use the 'AutoML' job with 'EnsembleMode' and 'SMOTE' sampling.

D.Configure the data split to use stratified sampling based on the target.

AnswerA

F1 and AUC are better metrics for imbalanced classification.

Why this answer

SageMaker Autopilot's AutoML job allows specifying objective metric such as F1 or AUC for imbalanced data (Option D). Option A (undersampling) is not built into Autopilot. Option B (class weights) is not directly configurable.

Option C (SMOTE) is not supported by Autopilot.

Full explanation →

1292

MCQhard

A machine learning engineer is using SageMaker to train an XGBoost model on a dataset with a severe class imbalance (1:1000). The goal is to maximize recall on the minority class. Which hyperparameter tuning strategy is MOST appropriate?

A.Set max_delta_step to a high value

B.Increase subsample ratio to 1.0

C.Set scale_pos_weight to the ratio of negative to positive samples

D.Set objective to 'binary:logistic' and tune max_depth

AnswerC

This parameter adjusts the weight of the minority class, improving recall.

Why this answer

XGBoost's 'scale_pos_weight' parameter can be set to the ratio of negative to positive instances to help the model focus on the minority class. Adjusting max_delta_step or subsample may help but are secondary. Setting objective to 'binary:logistic' is default, not addressing imbalance.

Full explanation →

1293

Drag & Dropmedium

Drag and drop the steps to evaluate a trained model using SageMaker Model Monitor in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Model Monitor requires enabling data capture, creating baseline, schedule, and reviewing reports.

Full explanation →

1294

MCQmedium

Refer to the exhibit. A data scientist is configuring SageMaker Model Monitor for data quality checks. The configuration above is used. What is the purpose of the `ProbabilityThresholdAttribute` set to "0.5"?

A.It filters the input data to only include predictions above the threshold

B.It specifies the threshold for sampling data for monitoring

C.It sets the threshold for the accuracy metric

D.It defines the probability threshold used to convert model output to binary predictions for monitoring

AnswerD

This threshold is used to compute predicted labels for monitoring purposes.

Why this answer

In Model Monitor, `ProbabilityThresholdAttribute` is used for binary classification models to define the threshold for converting probabilities to predicted labels. It is used to capture baseline distribution of predictions. It does not set the threshold for the endpoint inference; that is done in the model.

It is used for monitoring drift in prediction distribution. Option B is correct. Option A: It does not sample data.

Option C: It does not define the metric. Option D: It does not filter input data.

Full explanation →

1295

MCQhard

A data scientist is building a recommender system using collaborative filtering. The dataset is sparse (99% missing values). Which algorithm is best suited?

A.Random Forest

B.K-Nearest Neighbors

C.Matrix Factorization (e.g., SVD)

D.Hidden Markov Model

AnswerC

Matrix factorization works well on sparse data.

Why this answer

Matrix factorization (e.g., SVD) is best suited for sparse collaborative filtering because it learns latent factors that capture underlying user-item interactions, effectively handling the 99% missing values by generalizing patterns rather than relying on explicit pairwise similarities. Unlike memory-based methods, it decomposes the sparse user-item matrix into lower-dimensional representations, enabling accurate predictions even when most entries are unobserved.

Exam trap

Cisco often tests the misconception that K-Nearest Neighbors (KNN) is the default for collaborative filtering, but the trap here is that extreme sparsity (99% missing) makes pairwise similarity calculations unreliable, whereas matrix factorization explicitly models latent factors to overcome data sparsity.

How to eliminate wrong answers

Option A is wrong because Random Forest is a supervised ensemble method that requires a dense feature matrix and cannot inherently handle missing values in a collaborative filtering context; it would fail to leverage the implicit feedback structure of the sparse user-item matrix. Option B is wrong because K-Nearest Neighbors (KNN) is a memory-based collaborative filtering approach that computes similarities between users or items, but with 99% missing values, pairwise distances become unreliable and the algorithm suffers from poor scalability and the 'curse of dimensionality'. Option D is wrong because Hidden Markov Model (HMM) is designed for sequential or temporal data with hidden states, not for static user-item interaction matrices; it does not model the latent factor structure needed for collaborative filtering in sparse settings.

Full explanation →

1296

Multi-Selecthard

Which TWO SageMaker features can be used to perform hyperparameter optimization? (Choose 2)

Select 2 answers

A.SageMaker Debugger

B.SageMaker Pipelines

C.SageMaker Model Monitor

D.SageMaker automatic model tuning

E.SageMaker Experiments

AnswersD, E

This is the built-in hyperparameter tuning service.

Why this answer

Option A (SageMaker automatic model tuning) is the built-in hyperparameter tuning. Option D (SageMaker Experiments) can track and compare tuning jobs, but not directly run them. However, the question asks for features that can be used to perform HPO.

SageMaker automatic model tuning is the primary feature. SageMaker SDK can be used to implement custom tuning, but it's not a feature name. SageMaker Debugger (B) and Model Monitor (C) are not for HPO.

SageMaker Pipelines (E) can orchestrate HPO but is not a direct tuning feature. The best answer is A and D (Experiments can be used to track HPO runs). Alternatively, A and something else.

Let's reconsider: SageMaker automatic model tuning (A) is the official HPO. SageMaker Experiments (D) can be used to track and analyze tuning jobs, but doesn't perform tuning. The question says 'perform hyperparameter optimization'.

Typically, only automatic model tuning performs it. However, sometimes 'SageMaker SDK' is considered. To align with MLS-C01, the correct answer is A and D (Experiments can be used to run multiple trials).

I'll go with A and D.

Full explanation →

1297

MCQeasy

During training of a SageMaker built-in object detection algorithm, the loss is not decreasing after several epochs. Which troubleshooting step should be taken first?

A.Increase the mini-batch size

B.Add more classes to the dataset

C.Check whether the learning rate is appropriate

D.Increase the number of epochs

AnswerC

Learning rate is a critical hyperparameter; incorrect value often causes loss not to decrease.

Why this answer

When the loss is not decreasing during training of a SageMaker built-in object detection algorithm, the most common cause is an inappropriate learning rate. A learning rate that is too high can cause the loss to oscillate or diverge, while one that is too low can cause the loss to plateau. Checking and adjusting the learning rate is the first troubleshooting step because it directly controls the step size of gradient updates and is a fundamental hyperparameter in optimization.

Exam trap

The trap here is that candidates often assume increasing the number of epochs (Option D) will always reduce loss, but they fail to recognize that a plateauing loss is typically a sign of a hyperparameter issue like learning rate, not insufficient training time.

How to eliminate wrong answers

Option A is wrong because increasing the mini-batch size typically stabilizes gradient estimates but does not directly address a plateauing loss; it can even slow convergence if the batch size becomes too large. Option B is wrong because adding more classes to the dataset increases task complexity and would likely worsen the loss, not help it decrease. Option D is wrong because increasing the number of epochs does not fix the underlying optimization issue; if the loss is not decreasing due to a poor learning rate, more epochs will simply continue the same ineffective training.

Full explanation →

1298

MCQhard

A company wants to serve a scikit-learn model via SageMaker. The inference code requires a custom preprocessing step that is not in the default scikit-learn container. What is the simplest way to deploy?

A.Create a custom Docker image extending the SageMaker scikit-learn container

B.Package the code in a Lambda layer and use SageMaker hosting

C.Use SageMaker Batch Transform with a custom processing script

D.Use SageMaker Neo to compile the model and add preprocessing

AnswerA

Extending the container with the custom preprocessing is straightforward and supported.

Why this answer

Option C is correct: extending the SageMaker scikit-learn container with a Dockerfile is the simplest. Option A (Lambda) may have compatibility issues. Option B (SageMaker Batch Transform) is for batch, not real-time.

Option D (SageMaker Neo) optimizes for hardware, not custom code.

Full explanation →

1299

MCQmedium

A company is using Amazon SageMaker to deploy a real-time inference endpoint for a computer vision model. The endpoint receives bursts of traffic with up to 500 requests per second, but the load is unpredictable. Which scaling strategy is MOST cost-effective while maintaining low latency?

A.Manually provision enough instances to handle peak load

B.Use provisioned concurrency on SageMaker Serverless Inference

C.Use a multi-model endpoint to reduce the number of instances

D.Configure automatic scaling with a target tracking policy and add a buffer to handle bursts

AnswerD

Autoscaling with a target tracking policy adjusts instances based on demand, and a buffer helps absorb sudden spikes.

Why this answer

Option C is correct because SageMaker can add instances in response to increased load, and using a buffer helps absorb sudden spikes. Option A (provisioned concurrency) is for serverless but not SageMaker. Option B (manual scaling) is not cost-effective for unpredictable traffic.

Option D (multi-model endpoints) is for serving multiple models, not for scaling.

Full explanation →

1300

Multi-Selecthard

A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)

Select 3 answers

A.AWS Glue Data Catalog for schema-on-read.

B.Amazon Redshift for data warehousing.

C.Amazon S3 as the primary storage layer.

D.Amazon EMR for data processing.

E.S3 Lifecycle policies to transition data to Glacier.

AnswersA, C, E

Glue enables schema-on-read for analytics.

Why this answer

AWS Glue Data Catalog is correct because it provides a centralized metadata repository that enables schema-on-read for data stored in Amazon S3. It allows you to define table schemas and partitions without transforming the underlying data, so analytics tools like Amazon Athena and Amazon EMR can query the data with the schema applied at read time.

Exam trap

The trap here is that candidates often confuse Amazon Redshift as a data lake storage layer due to its analytics capabilities, but it is a data warehouse with schema-on-write and higher costs for infrequently accessed data, making it unsuitable for the described requirements.

Full explanation →

1301

MCQmedium

A data scientist is using Amazon SageMaker to train a linear regression model. After training, the scientist notices that the model has a high bias. What is the most likely cause?

A.The training dataset has too many features

B.The model is too complex and overfits the data

C.The regularization parameter is too high

D.The model is too simple and underfits the data

AnswerD

Linear regression can underfit if relationship is nonlinear.

Why this answer

High bias is typically caused by the model being too simple to capture patterns in the data. Option B is wrong because high variance would cause overfitting. Option C is wrong because regularization reduces overfitting, not bias.

Option D is wrong because too many features would increase variance, not bias.

Full explanation →

1302

MCQhard

A data scientist is analyzing a dataset with a large number of categorical features. The target variable is binary. Which technique should the scientist use to assess the relationship between each categorical feature and the target?

A.ANOVA

B.Point-biserial correlation

C.Cramér's V

D.Chi-square test of independence

AnswerD

Chi-square tests association between two categorical variables.

Why this answer

The chi-square test of independence is appropriate for testing association between categorical features and a binary target. ANOVA is for continuous target. Mutual information measures dependency but is not a hypothesis test.

Point-biserial correlation is for continuous and binary. Cramér's V is a measure of association after chi-square.

Full explanation →

1303

MCQhard

A company deploys a SageMaker endpoint for real-time inference. After a week, the response latency increases from 50 ms to 500 ms. CPU utilization is at 30%. What is the most likely cause?

A.The model has a memory leak

B.The instance type is underpowered for the inference load

C.The inference code makes a call to a downstream service that is throttling requests

D.The SageMaker endpoint is experiencing a network outage

AnswerC

Downstream throttling can increase latency without high CPU on the endpoint.

Why this answer

Increased latency with low CPU suggests a bottleneck elsewhere, often due to throttling by downstream services like a database. Option B is correct. Option A would show high CPU.

Option C would cause errors. Option D would cause high memory usage.

Full explanation →

1304

MCQhard

Refer to the exhibit. A data scientist is reviewing CloudWatch logs for a SageMaker real-time endpoint. The log shows that a prediction took 15 ms. The endpoint is configured with an ml.c5.large instance and the model is a small scikit-learn model. The latency requirement is under 10 ms. Which action would most likely reduce the latency?

A.Use a larger instance type

B.Add more instances to the endpoint

C.Change the model to a TensorFlow model

D.Enable SageMaker Batch Transform

E.Increase the batch size for inference

AnswerA

More CPU power reduces latency.

Why this answer

Option D is correct because using a larger instance (ml.c5.xlarge) provides more CPU resources, reducing inference time. Option A (increase batch size) is not applicable for real-time single requests. Option B (enable SageMaker Batch Transform) is for offline.

Option C (add more instances) does not reduce per-request latency. Option E (use a different framework) is not likely the issue.

Full explanation →

1305

MCQmedium

Refer to the exhibit. A data scientist is trying to run a SageMaker training job using a script that reads data from the S3 bucket 'my-bucket' and writes the model artifact to the same bucket. The training job fails with an access denied error. What is the likely cause?

A.The IAM role does not have permission to write to the S3 bucket for the model artifact

B.The IAM role does not have sagemaker:CreateModel permission

C.The IAM role does not have s3:ListBucket permission

D.The IAM role does not have ec2:DescribeInstances permission

AnswerA

The policy only allows PutObject on training-data/*, but the model artifact might be saved to a different prefix (e.g., output/).

Why this answer

The training job fails with an access denied error because the IAM role used by SageMaker lacks the s3:PutObject permission (or equivalent write access) for the S3 bucket 'my-bucket'. While the script reads data from the bucket, writing the model artifact requires explicit write permissions on the same bucket. Without this, SageMaker cannot upload the model artifact, causing the job to fail.

Exam trap

The trap here is that candidates may focus on the read operation (data input) and overlook the write operation (model artifact output), or confuse S3 permissions with SageMaker-specific API actions like CreateModel.

How to eliminate wrong answers

Option B is wrong because sagemaker:CreateModel is a permission for creating a SageMaker model resource after training, not for writing to S3 during the training job; the error occurs during training, not model creation. Option C is wrong because s3:ListBucket is a read permission for listing objects, and the job already reads data successfully (the error is on write), so lack of ListBucket would cause a different error (e.g., 403 on list). Option D is wrong because ec2:DescribeInstances is unrelated to S3 access; it is used for managing EC2 instances, not for SageMaker training jobs writing to S3.

Full explanation →

1306

MCQhard

A machine learning engineer is using Amazon SageMaker to train a deep learning model. The training job is failing with a 'ResourceLimitExceeded' error. The engineer checks the account limits and sees that the current limit for the instance type is 2, and they are already using 2 instances for other jobs. Which approach would resolve the issue MOST cost-effectively?

A.Request a service limit increase for the current instance type

B.Use a different instance type that is available and has sufficient capacity

C.Use a managed spot training instead of on-demand

D.Stop the other training jobs to free up resources

AnswerB

Different instance types have separate limits and may be available immediately.

Why this answer

Option B is correct because using a different instance type within the same family often has separate limits. Option A increases cost. Option C may not resolve if the limit is account-wide.

Option D changes the request, not the limit.

Full explanation →

1307

MCQeasy

A company wants to store semi-structured data from IoT sensors in a cost-effective manner for occasional querying. The data is not updated once written. Which Amazon S3 storage class is the most cost-effective for this use case?

A.S3 Standard

B.S3 One Zone-Infrequent Access

C.S3 Intelligent-Tiering

D.S3 Glacier Deep Archive

AnswerD

Correct: Deep Archive is the lowest cost for rarely accessed data with long retrieval times.

Why this answer

S3 Glacier Deep Archive is the lowest-cost storage class for rarely accessed data with retrieval times of 12 hours. Option D is correct. Option A (S3 Standard) is for frequent access.

Option B (S3 Intelligent-Tiering) incurs monitoring costs. Option C (S3 One Zone-IA) is for infrequent access but has a minimum storage charge.

Full explanation →

1308

MCQhard

A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?

A.Increase the write capacity of the DynamoDB lease table to handle more workers.

B.Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput.

C.Switch to using AWS Lambda as the consumer to handle scaling automatically.

D.Increase the number of EC2 instances to match the number of shards.

AnswerB

Enhanced fan-out gives dedicated throughput per consumer.

Why this answer

The primary bottleneck is DynamoDB write capacity being fully utilized during peak hours. Enhanced fan-out (option B) provides each consumer with a dedicated 2 MB/second read throughput per shard, eliminating the need for consumers to contend for the shared 2 MB/second per shard. This reduces the load on the DynamoDB lease table because workers no longer need to poll for records, which in turn lowers the write operations to the lease table and alleviates the DynamoDB write capacity issue.

Exam trap

The trap here is that candidates assume increasing DynamoDB write capacity (option A) is the direct fix for write capacity exhaustion, but they miss that enhanced fan-out reduces the underlying cause of those writes by eliminating polling-based contention.

How to eliminate wrong answers

Option A is wrong because increasing EC2 instances to match shards does not address the DynamoDB write capacity bottleneck; it may even increase lease table writes due to more workers contending for leases. Option C is wrong because increasing the write capacity of the DynamoDB lease table treats a symptom (high write load from KCL workers) rather than the root cause (contention for shard throughput); enhanced fan-out reduces the need for frequent lease updates. Option D is wrong because switching to AWS Lambda does not inherently solve the DynamoDB write capacity issue; Lambda still uses KCL under the hood with DynamoDB as the lease table, and the same write contention would persist unless enhanced fan-out is also used.

Full explanation →

1309

Multi-Selectmedium

A data scientist is building a binary classifier to predict customer churn. The dataset is highly imbalanced (5% churn). Which TWO techniques can help improve the model's ability to detect churn?

Select 2 answers

A.Downsample the majority class to balance the dataset

B.Use Synthetic Minority Over-sampling Technique (SMOTE)

C.Use class weights in the loss function to penalize misclassifications of the minority class

D.Use accuracy as the evaluation metric

E.Increase the model complexity by adding more layers

AnswersB, C

SMOTE generates synthetic samples for the minority class.

Why this answer

Correct options: B (Synthetic Minority Over-sampling Technique) and C (Use class weights in the loss function). SMOTE generates synthetic samples for the minority class, and class weights penalize misclassifications of the minority class more heavily. Option A (Downsampling the majority class) can be used but may discard data; not as effective as SMOTE.

Option D (Use accuracy as evaluation metric) is misleading. Option E (Increase model complexity) may cause overfitting.

Full explanation →

1310

MCQeasy

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket. The data scientist wants to use the Pipe mode for training to stream data directly from S3 instead of downloading it first. Which of the following is a prerequisite for using Pipe mode?

A.The training data must be compressed using Gzip.

B.The S3 bucket must have public read access.

C.The training data must be stored as a single large file.

D.The training data must be in RecordIO-protobuf or TFRecord format.

AnswerD

Pipe mode streams data line by line; RecordIO and TFRecord are supported.

Why this answer

Option B is correct because Pipe mode requires the training data to be in a format that supports streaming, such as RecordIO or TFRecord. Option A is wrong because SageMaker can handle S3 data regardless of bucket policy. Option C is wrong because Pipe mode does not require data to be in a single file.

Option D is wrong because Pipe mode does not require the data to be uncompressed.

Full explanation →

1311

MCQhard

A machine learning team is building a model to predict customer churn. The dataset has 20 features and 50,000 rows. After initial EDA, they notice that the target variable 'churn' is highly imbalanced (5% churn, 95% non-churn). Which EDA step should the team prioritize to address this imbalance before model training?

A.Remove outliers in the majority class to balance the dataset.

B.Analyze the distribution of each feature separately for churn and non-churn groups.

C.Perform stratified cross-validation to ensure balanced folds.

D.Apply Principal Component Analysis (PCA) to reduce noise.

AnswerB

This helps identify which features differentiate the classes and informs whether resampling or cost-sensitive methods are needed.

Why this answer

Option D is correct because understanding the distribution of features across churn and non-churn classes helps identify which features drive churn. Option A is wrong because PCA reduces dimensionality but does not address imbalance. Option B is wrong because cross-validation is a modeling step, not EDA.

Option C is wrong because removing outliers may worsen imbalance.

Full explanation →

1312

MCQeasy

A machine learning engineer is deploying a model using Amazon SageMaker and wants to automatically scale the endpoint based on the number of incoming requests. Which scaling policy should be used?

A.Step scaling

B.Scheduled scaling

C.Target tracking scaling

D.Simple scaling

AnswerC

Target tracking automatically adjusts capacity based on a target metric.

Why this answer

SageMaker endpoints support Application Auto Scaling, which can use a target tracking scaling policy based on a metric like InvocationsPerInstance. Simple scaling and step scaling are also possible but target tracking is simpler. Scheduled scaling is for predictable traffic.

Option A: Target tracking scaling is correct. Option B: Simple scaling requires manual thresholds. Option C: Step scaling is more complex.

Option D: Scheduled scaling is for predictable patterns.

Full explanation →

1313

Multi-Selecteasy

Which TWO of the following are benefits of using SageMaker Managed Spot Training? (Select TWO.)

Select 2 answers

A.No need to checkpoint the model

B.Potential for significant cost savings

C.Faster training times

D.Guaranteed instance availability

E.Lower training cost compared to on-demand instances

AnswersB, E

Savings can be up to 90%.

Why this answer

Options A and C are correct. Managed Spot Training uses spare EC2 capacity at a lower cost (A) and can significantly reduce training costs (C). Option B is wrong because spot instances can be interrupted.

Option D is wrong because spot training may take longer due to interruptions. Option E is wrong because spot instances are not guaranteed to be available.

Full explanation →

1314

MCQeasy

A data scientist wants to evaluate the performance of a multiclass classification model. The model outputs probabilities for 10 classes. Which metric is most appropriate for evaluating the model's ranking performance across all classes?

A.F1 score (macro-averaged)

B.Accuracy

C.Mean Absolute Error

D.Log loss (cross-entropy)

E.ROC AUC (one-vs-rest macro-averaged)

AnswerD

Log loss directly measures the quality of probability predictions for multiclass problems.

Why this answer

Option D is correct because log loss measures the performance of a classification model where the prediction is a probability value, and it penalizes false classifications. Option A (Accuracy) ignores probability calibration. Option B (ROC AUC) is for binary classification.

Option C (F1 score) is for binary or per-class. Option E (Mean Absolute Error) is for regression.

Full explanation →

1315

MCQeasy

A machine learning engineer needs to deploy a model that requires custom inference code with dependencies. Which SageMaker deployment option should be used?

A.Use a SageMaker notebook instance as an endpoint.

B.Create a custom Docker container and deploy to SageMaker endpoint.

C.Use a built-in SageMaker algorithm.

D.Use a SageMaker batch transform job.

AnswerB

Custom container provides flexibility for custom code and dependencies.

Why this answer

Option A is correct because a custom container allows full control over dependencies and inference code. Option B is incorrect because a built-in algorithm may not support custom code. Option C is incorrect because a SageMaker notebook instance is for development, not deployment.

Option D is incorrect because a batch transform job is for batch inference, not real-time.

Full explanation →

1316

MCQeasy

During EDA, a data scientist finds that a feature has a skewness value of 2.5. What does this indicate about the data distribution?

A.The distribution is right-skewed

B.The distribution is symmetric

C.The distribution is left-skewed

D.The distribution has no outliers

AnswerA

Positive skewness indicates a long right tail.

Why this answer

A skewness > 1 indicates a highly right-skewed (positive skew) distribution, meaning the tail extends to the right. Option A is wrong because left skew is negative. Option B is wrong because symmetric distributions have skewness near 0.

Option D is wrong because skewness describes shape, not the presence of outliers specifically.

Full explanation →

1317

Multi-Selecthard

A company uses AWS Glue to run ETL jobs on a daily basis. The jobs read from Amazon RDS and write to Amazon S3. The data volume has grown, and the jobs are taking longer to complete. The team wants to optimize the jobs for cost and performance. Which combination of techniques should the team implement? (Choose THREE.)

Select 3 answers

A.Use a larger Glue worker type, such as G.2X, for more memory per worker.

B.Enable job bookmarks to process only new data since the last run.

C.Increase the number of partitions in the output S3 data to improve parallelism.

D.Increase the maximum number of DPUs for the job to 100.

E.Use pushdown predicates in the JDBC connection to filter data at the source.

AnswersA, B, E

Larger workers provide more resources per task, improving performance.

Why this answer

Option A, B, and D are correct. Using job bookmarks enables incremental processing, reducing the amount of data read. Using pushdown predicates filters data at the source, reducing data transfer.

Using a larger number of worker type (e.g., G.1X or G.2X) increases memory and CPU per worker, improving performance. Option C is wrong because adding more partitions after the job does not speed up the job. Option E is wrong because increasing the number of DPUs increases cost linearly and may not be as effective as using larger workers.

Full explanation →

1318

MCQhard

Refer to the exhibit. A data engineer examines the output of 'aws glue get-job-run' for a failed job. The job run state is FAILED, but ErrorMessage is empty. The job ran for 3600 seconds (1 hour) before failing. What is the MOST likely cause of the failure?

A.The JDBC connection to the source database timed out.

B.The IAM role does not have sufficient permissions to access S3.

C.The job ran out of memory due to insufficient DPU allocation.

D.The Python script has a syntax error.

AnswerC

Out-of-memory errors may not always produce a detailed error message in the job run output.

Why this answer

Option D is correct because the job ran for exactly 1 hour and then failed with no error message, indicating it hit the timeout. The default timeout is 2880 minutes (48 hours), but the job may have a custom timeout set. However, the exhibit shows Timeout: 2880 minutes, which is not hit.

Wait, ExecutionTime: 3600 seconds = 60 minutes. The Timeout is 2880 minutes, so not timeout. Another common cause is out-of-memory.

But the question states 'empty error message' - often Glue jobs fail silently due to resource constraints like memory. Option D: The job ran out of memory because MaxCapacity is 10 DPUs, which may be insufficient for the data size. Option A is wrong because permissions would show Access Denied error.

Option B is wrong because Python errors would appear in logs. Option C is wrong because no error message suggests it's not a connection timeout.

Full explanation →

1319

MCQmedium

A data scientist is training a model using Amazon SageMaker and wants to automatically stop training when the model stops improving. Which feature should be used?

A.Use SageMaker Debugger to monitor the loss metric.

B.Configure a CloudWatch alarm on the training job's CPU utilization.

C.Use SageMaker Hyperparameter Tuning with random search.

D.Enable early stopping in the training job configuration.

AnswerD

Stops training if improvement plateaus.

Why this answer

Option D is correct because SageMaker's built-in early stopping feature automatically halts a training job when the model's objective metric (e.g., loss or accuracy) ceases to improve over a specified number of steps or epochs. This is configured directly in the training job's `StoppingCondition` parameter, which monitors the metric defined in the `MetricDefinitions` and stops training if no improvement is detected, saving compute time and avoiding overfitting.

Exam trap

The trap here is that candidates confuse SageMaker Debugger's monitoring capabilities with automatic stopping, but Debugger only provides hooks for custom actions (e.g., via rules like `LossNotDecreasing`) and does not natively halt training without additional configuration, whereas early stopping is a direct, built-in feature of the training job configuration.

How to eliminate wrong answers

Option A is wrong because SageMaker Debugger is designed for debugging and profiling training jobs (e.g., capturing tensors, monitoring system bottlenecks), not for automatically stopping training based on metric stagnation; it can emit alerts but does not natively trigger a stop. Option B is wrong because a CloudWatch alarm on CPU utilization monitors infrastructure health (e.g., resource exhaustion), not model performance metrics like loss or accuracy, so it cannot determine when the model stops improving. Option C is wrong because SageMaker Hyperparameter Tuning with random search is a strategy for exploring hyperparameter combinations to find optimal values, not a mechanism to stop an individual training job early; early stopping can be used within a tuning job, but the feature itself is separate and configured via the training job's `StoppingCondition`.

Full explanation →

1320

MCQhard

A machine learning team is deploying a real-time inference endpoint on Amazon SageMaker for a model that requires low latency (<100 ms). The model is a PyTorch model with custom pre- and post-processing logic. The team uses a SageMaker Model with a custom inference container. After deployment, they observe that the endpoint takes over 500 ms for the first request, but subsequent requests are fast (~50 ms). What is the MOST likely cause?

A.The instance type is too small to handle the model size.

B.The model is too large and exceeds the instance memory.

C.The container has a cold start delay because the model needs to be loaded into memory from Amazon S3 on the first request.

D.The endpoint is not configured with auto-scaling.

AnswerC

Cold start occurs when no idle instances are available; model loading from S3 adds latency.

Why this answer

Option B is correct because the first request often suffers from cold start latency due to container initialization and model loading. Option A is wrong because the issue is transient and not related to instance type. Option C is wrong because the model is large but cold start is the primary cause.

Option D is wrong because the issue is not about auto-scaling but about initialization.

Full explanation →

1321

MCQeasy

A machine learning team needs to deploy a model that makes real-time predictions with latency under 100 ms. The model is a deep neural network with 500 MB of parameters. Which AWS service should they use?

A.AWS Glue

B.AWS Lambda with a container image

C.Amazon SageMaker real-time endpoint

D.Amazon EMR

AnswerC

SageMaker real-time endpoints provide low-latency inference for large models.

Why this answer

Amazon SageMaker real-time endpoints are designed for low-latency inference. Option B is wrong because AWS Lambda has a 250 MB deployment package limit and higher latency for large models. Option C is wrong because Amazon EMR is for big data processing, not real-time inference.

Option D is wrong because AWS Glue is for ETL jobs.

Full explanation →

1322

MCQhard

A machine learning team is deploying a real-time inference endpoint for a recommendation model using Amazon SageMaker. The model takes a long time to load (several minutes) due to its size (5 GB). Which deployment strategy minimizes the cold start latency?

A.Use a single instance with a large memory size

B.Use Multi-Model Endpoints to keep the model loaded between invocations

C.Use SageMaker Serverless Inference

D.Use a larger instance type with more vCPUs

AnswerB

Multi-Model Endpoints allow models to stay loaded in memory, reducing cold start.

Why this answer

Option D is correct because Multi-Model Endpoints allow loading models on demand, but with a large model, it may still be slow. However, Multi-Model Endpoints are designed to reduce cold start by keeping models loaded. Option A is wrong because increasing instance count doesn't reduce load time per instance.

Option B is wrong because Serverless Inference has cold starts. Option C is wrong because a single instance with larger memory doesn't reduce load time significantly.

Full explanation →

1323

MCQeasy

A data engineer is tasked with building a pipeline to process streaming data from IoT devices. The devices send data in JSON format every second. The pipeline must aggregate data in 5-minute windows and store the results in Amazon S3. The engineer needs to handle late-arriving data (up to 1 hour) and ensure exactly-once semantics. Which combination of AWS services should they use?

A.Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Analytics for windowed aggregations, and Amazon Kinesis Data Firehose to write to Amazon S3.

B.Amazon Kinesis Data Streams for ingestion, AWS Glue Streaming ETL for aggregation, and Amazon S3 for storage.

C.Amazon SQS for ingestion, AWS Lambda for aggregation, and Amazon S3 for storage.

D.Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Firehose for transformation, and Amazon S3 for storage.

AnswerA

Kinesis Data Analytics supports windowed aggregations and exactly-once processing; Firehose delivers to S3 with minimal overhead.

Why this answer

Option A is correct because Kinesis Data Analytics supports windowed aggregations, can handle late data via watermarking, and provides exactly-once processing when used with Kinesis Data Streams. Option B is wrong because Kinesis Data Firehose does not allow custom windowed aggregations. Option C is wrong because Glue Streaming is a batch-oriented service.

Option D is wrong because Lambda does not have built-in support for exactly-once semantics for streaming applications.

Full explanation →

1324

Multi-Selectmedium

A data scientist is training a deep learning model for object detection using Amazon SageMaker. The training job is using a single GPU instance and is taking too long. Which THREE actions can reduce training time? (Choose THREE.)

Select 3 answers

A.Use a CPU instance instead of GPU

B.Enable mixed precision training with FP16

C.Use a GPU instance with more GPUs, such as p3.16xlarge

D.Reduce the batch size

E.Use distributed training across multiple instances

AnswersB, C, E

Mixed precision uses half-precision floats, speeding up computation and reducing memory usage.

Why this answer

Option B is correct because enabling mixed precision training with FP16 reduces memory usage and accelerates computation by using half-precision floating-point numbers where possible, which is particularly effective on NVIDIA GPUs with Tensor Cores (e.g., V100, A100). This can nearly double throughput for deep learning models without sacrificing model accuracy, as critical operations still use FP32 precision.

Exam trap

The trap here is that candidates often confuse reducing batch size with speeding up training, but in practice, smaller batches increase the number of gradient updates and can lead to longer wall-clock time, especially on GPU instances where larger batches better utilize parallel hardware.

Full explanation →

1325

Multi-Selectmedium

Which TWO options are valid ways to reduce inference latency for a model deployed on a SageMaker real-time endpoint? (Select TWO.)

Select 2 answers

A.Use SageMaker batch transform instead of real-time endpoint

B.Deploy the model to multiple instances behind a load balancer

C.Enable SageMaker Neo to compile the model for the target instance

D.Use a GPU instance type for the endpoint

E.Increase the endpoint's invocation timeout

AnswersC, D

Neo optimizes model for faster inference.

Why this answer

Using a GPU instance can accelerate model computations. Enabling SageMaker Neo compilation optimizes the model for target hardware. Option C is wrong because increasing timeout does not reduce latency.

Option D is wrong because batch transform is not real-time. Option E is wrong because increased instance count does not directly reduce latency per request.

Full explanation →

1326

MCQhard

A data scientist is working on a predictive maintenance project for a manufacturing company. Sensor data is collected every second from 100 machines and stored in an Amazon S3 bucket as Parquet files, partitioned by machine_id and date. The dataset is massive (10 TB) and contains over 2000 features per machine. The data scientist needs to perform exploratory data analysis to identify which features are most predictive of machine failure. They have access to Amazon SageMaker Studio with a SageMaker Data Wrangler flow. The initial data exploration is taking too long due to the volume of data. The data scientist wants to speed up the analysis without losing accuracy in feature selection. Which course of action is most appropriate?

A.Switch to using Amazon EMR with Spark to perform distributed feature selection on the full dataset

B.Reduce the data to a single partition by concatenating all files and use only one machine's data

C.Use SageMaker Data Wrangler to create a stratified sample by machine_id and date, then analyze the sample

D.Use Amazon Athena to query a random sample of rows from the dataset

AnswerC

Correct: Stratified sampling preserves distribution of key variables and reduces data size.

Why this answer

Option B is correct because using SageMaker Data Wrangler's sampling capabilities allows faster exploration while preserving statistical properties for feature selection. Option A is wrong because reducing to a single partition loses time series context. Option C is wrong because moving to a smaller instance may cause memory issues.

Option D is wrong because random sample of rows may break time series ordering.

Full explanation →

1327

MCQmedium

A data engineer runs a SQL query on Amazon Athena to explore a dataset stored in S3 as CSV. The query returns zero rows for a column that should have numeric values. Which step should the engineer take to diagnose the issue?

A.Verify that the S3 bucket has encryption enabled.

B.Run an AWS Glue crawler to update the table schema.

C.Add a partition to the table for the date column.

D.Check the table schema in AWS Glue Data Catalog to ensure the column data type is correct.

AnswerD

Incorrect data type can cause Athena to return null values.

Why this answer

Option B is correct because checking the schema and data type conversion can reveal issues like unquoted commas or wrong format. Option A is wrong because the issue is likely with data types, not encryption. Option C is wrong because adding a partition won't fix data type issues.

Option D is wrong because crawling does not change data types if schema is inferred incorrectly.

Full explanation →

1328

MCQmedium

A data scientist is deploying a regression model in Amazon SageMaker that predicts housing prices. The model shows high bias (underfitting). Which action is most likely to reduce bias?

A.Reduce the amount of training data

B.Increase regularization strength

C.Use a simpler model

D.Add more features or increase model complexity

AnswerD

More complex models can capture patterns better.

Why this answer

High bias (underfitting) means the model is too simple to capture the underlying patterns in the data. Adding more features or increasing model complexity (e.g., using polynomial features, deeper trees, or a more flexible algorithm) directly addresses underfitting by giving the model greater capacity to learn from the data. In Amazon SageMaker, this could involve using a more complex built-in algorithm like XGBoost with deeper trees or adding feature engineering transformations in a processing job.

Exam trap

The trap here is that candidates often confuse bias with variance and incorrectly choose regularization or simpler models, which are solutions for overfitting (high variance), not underfitting (high bias).

How to eliminate wrong answers

Option A is wrong because reducing the amount of training data would exacerbate underfitting by providing even less information for the model to learn from. Option B is wrong because increasing regularization strength penalizes model complexity further, which would increase bias and worsen underfitting. Option C is wrong because using a simpler model would reduce capacity even more, directly increasing bias rather than reducing it.

Full explanation →

1329

Multi-Selecteasy

A data scientist is evaluating a regression model. Which TWO metrics are appropriate for evaluating regression performance?

Select 2 answers

A.Root Mean Squared Error (RMSE)

B.F1 score

C.Area Under the ROC Curve (AUC)

D.R-squared

E.Precision

AnswersA, D

RMSE measures average prediction error.

Why this answer

Correct options: C (Root Mean Squared Error) and D (R-squared). RMSE measures average error magnitude, R-squared measures variance explained. Option A (F1 score) is for classification.

Option B (Precision) is for classification. Option E (AUC) is for classification.

Full explanation →

1330

MCQhard

A machine learning team is using SageMaker to train a model with a custom Docker container. The training script runs locally but fails on SageMaker with a 'Permission denied' error when writing to /opt/ml/model. What is the likely cause?

A.The container's user does not have write permission to /opt/ml/model

B.The Docker image is too large

C.The training script is trying to read from /opt/ml/input/data instead of /opt/ml/input/data/training

D.The training data is not in the correct S3 bucket

AnswerA

SageMaker mounts /opt/ml/model as a volume; the user must have write access.

Why this answer

SageMaker expects the container to write the model artifact to /opt/ml/model. If the user in the container lacks write permissions, it fails. Option C is correct.

Option A is unrelated. Option B would cause different errors. Option D is about reading data.

Full explanation →

1331

MCQeasy

Refer to the exhibit. A data engineer has deployed this CloudFormation template. The Glue job 'my-etl-job' reads from the S3 bucket 'my-data-lake-bucket' and writes transformed data to another bucket. After 30 days, the data engineer notices that the Glue job fails with 'Input data not found' errors. What is the most likely cause?

A.The temporary directory 'my-temp-dir' is being cleaned up by the lifecycle configuration.

B.The script location 's3://my-scripts/etl.py' is being deleted by the lifecycle rule.

C.The job bookmark option 'job-bookmark-enable' is causing the job to skip newly arriving data.

D.The lifecycle configuration deletes objects from the bucket after 30 days, removing the input data.

AnswerD

The ExpirationInDays: 30 rule deletes objects older than 30 days, which may include input data.

Why this answer

Option A is correct. The lifecycle rule expires objects after 30 days, removing the input data that the Glue job expects. Option B is incorrect because job bookmarks track state but do not cause data loss.

Option C is incorrect because the temp directory is separate. Option D is incorrect because the script location is not affected by the lifecycle rule.

Full explanation →

1332

Multi-Selectmedium

A company uses SageMaker to run training jobs on a schedule. The training data is stored in an S3 bucket that receives new data every hour. Which TWO approaches can the company use to trigger a training job when new data arrives?

Select 2 answers

A.Use an SQS queue to buffer S3 events and poll from a training instance

B.Set up an Amazon EventBridge rule that triggers on S3 Object Created events and targets a Lambda function

C.Use AWS Step Functions with a scheduled execution

D.Use CloudWatch Logs to monitor S3 access logs and trigger a Lambda function

E.Configure an S3 event notification to invoke a Lambda function that starts the training job

AnswersB, E

EventBridge can capture S3 events and invoke Lambda.

Why this answer

Options B and D are correct. Option B: S3 events can invoke Lambda to start training. Option D: EventBridge rule can trigger on custom events.

Option A is wrong because CloudWatch Logs are for logs, not events. Option C is wrong because SQS is not directly integrated with SageMaker. Option E is wrong because Step Functions would need a trigger.

Full explanation →

1333

MCQmedium

During EDA, a data scientist finds that a feature has a skewed distribution. They want to apply a log transformation to make it more Gaussian-like. Which Amazon SageMaker feature is most appropriate for this transformation?

A.Amazon SageMaker Data Wrangler

B.Amazon SageMaker Clarify

C.Amazon SageMaker JumpStart

D.Amazon SageMaker Ground Truth

AnswerA

Data Wrangler provides a visual interface for transformations like log scaling.

Why this answer

Option C is correct because SageMaker Data Wrangler provides a visual interface to apply transformations like log scaling without writing code. Option A is wrong because SageMaker Ground Truth is for labeling, not transformation. Option B is wrong because SageMaker JumpStart is for pre-built models.

Option D is wrong because SageMaker Clarify is for bias detection and explainability.

Full explanation →

1334

MCQeasy

A company uses Amazon Redshift for data warehousing. The data engineering team needs to load data from multiple S3 buckets into Redshift daily. Each bucket contains files in different formats (CSV, JSON, Parquet). Which AWS service is BEST suited to automate this ingestion process?

A.Amazon EMR with Apache Spark

B.AWS Data Pipeline

C.AWS Database Migration Service (DMS)

D.AWS Glue

AnswerD

Glue provides crawlers for schema discovery and ETL jobs for loading into Redshift.

Why this answer

Option A is correct. AWS Glue can crawl the S3 buckets to discover schema and run ETL jobs to transform and load data into Redshift. Option B (DMS) is for database migration.

Option C (EMR) requires more management. Option D (Data Pipeline) is less flexible.

Full explanation →

1335

MCQmedium

A company is using Amazon SageMaker to deploy a model for real-time inference. The endpoint uses an ml.c5.xlarge instance. The company wants to reduce costs without affecting performance. The current traffic pattern shows a daily peak of 500 requests per second for 2 hours, and the rest of the day sees fewer than 50 requests per second. The model has a cold start time of about 30 seconds. What should the company do?

A.Switch to a serverless inference endpoint.

B.Configure an auto scaling policy that scales down during low traffic and keep a minimum of 1 instance.

C.Use a single ml.c5.xlarge instance and rely on it.

D.Use SageMaker Batch Transform for all predictions.

AnswerB

Auto scaling reduces instances during low traffic, and minimum instance prevents cold starts.

Why this answer

Option D is correct because adding a scaling policy to scale down during low traffic reduces cost, and keeping a minimum instance ensures low latency during low traffic without cold starts. Option A is wrong because serverless endpoints have cold starts and may not handle 500 TPS. Option B is wrong because Batch Transform is not for real-time.

Option C is wrong because one instance during peak may cause latency.

Full explanation →

1336

Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose 3.)

Select 3 answers

A.Ability to compress data before delivery

B.Ability to encrypt data at rest

C.Need for custom data processing using AWS Lambda

D.Data retention requirements

E.Latency requirements for data delivery to S3

AnswersC, D, E

Kinesis Data Streams supports custom processing with Lambda, Firehose has limited transformation.

Why this answer

Kinesis Data Streams provides custom processing with shard-level throughput and retention up to 365 days. Firehose automatically delivers to destinations like S3, Redshift, and Elasticsearch with near-real-time latency. Option A is wrong because both support encryption.

Option D is wrong because both support compression before delivery. Option E is wrong because both can handle streaming data, but Firehose is simpler for delivery to S3.

Full explanation →

1337

MCQmedium

A machine learning team is deploying a model using Amazon SageMaker. They need to automatically retrain the model every week with new data and update the endpoint without downtime. Which approach should they use?

A.Use SageMaker Ground Truth to label new data and trigger retraining

B.Use SageMaker batch transform to periodically generate predictions and replace the model

C.Use AWS Lambda to trigger retraining on a schedule and deploy a new endpoint

D.Use SageMaker automatic model tuning with a schedule and update the endpoint using CreateEndpointConfig and UpdateEndpoint

E.Use SageMaker Pipelines to automate retraining and deploy a new endpoint with blue/green deployment

AnswerD

This allows retraining and zero-downtime update.

Why this answer

Option C is correct because SageMaker automatic model tuning (hyperparameter tuning jobs) can be scheduled, and updating the endpoint with a new model can be done with CreateEndpointConfig and UpdateEndpoint for zero-downtime deployment. Option A (Lambda + retraining) is possible but not the most integrated. Option B (SageMaker Pipelines) can orchestrate retraining but updating endpoint may still be needed.

Option D (batch transform) is for inference, not retraining. Option E (SageMaker Ground Truth) is for labeling.

Full explanation →

1338

MCQmedium

A company is using Amazon SageMaker to train a model on a dataset that is updated daily. The data is stored in an S3 bucket. The training pipeline uses AWS Step Functions to orchestrate data preprocessing and model training. The preprocessing step uses a SageMaker Processing job that reads data from S3, cleans it, and writes the output back to S3. The team notices that the training step often fails due to insufficient disk space on the processing instance. Which change should the team make to resolve this issue without increasing cost?

A.Enable automatic scaling for the processing job.

B.Use AWS Batch instead of SageMaker Processing.

C.Use a larger instance type with more memory.

D.Configure the processing job to use local instance store (SSD) for scratch space.

AnswerD

Local instance store provides additional disk space without additional cost.

Why this answer

Option B is correct because using local instance store provides more disk space at no extra cost compared to EBS. Option A is wrong because using a different instance type may increase cost. Option C is wrong because using S3 as intermediate storage is not the issue.

Option D is wrong because the team already uses SageMaker Processing, which is appropriate.

Full explanation →

1339

MCQhard

A team is building a data lake on Amazon S3 and using AWS Glue to catalog data. They notice that Glue crawlers are taking too long to update the catalog for a large dataset with millions of small files. Which approach will MOST improve crawler performance?

A.Increase the frequency of the crawler runs.

B.Consolidate the small files into larger files (e.g., 100 MB each).

C.Partition the data by date in S3.

D.Use a custom classifier to parse the data.

AnswerB

Fewer, larger files reduce overhead and crawler scan time.

Why this answer

Option D is correct because consolidating small files into larger files reduces the number of objects the crawler must scan. Option A is wrong because increasing crawler frequency doesn't reduce scan time. Option B is wrong because using a custom classifier doesn't reduce scan time.

Option C is wrong because partitioning helps but still many files per partition.

Full explanation →

1340

MCQhard

A data scientist is analyzing clickstream data from a website. The data is stored in Amazon S3 as JSON files, each containing nested arrays. The scientist needs to flatten the nested structures and compute user session durations. Which approach is most efficient for this EDA task?

A.Use Amazon EMR with Apache Spark to process the data.

B.Use Amazon Athena with JSON SerDe to query the data and compute session duration with SQL.

C.Use AWS Glue DataBrew to flatten the JSON and create new columns for session duration.

D.Use Amazon QuickSight to visualize the raw data without flattening.

AnswerC

DataBrew is built for data preparation and can handle nested JSON visually.

Why this answer

AWS Glue DataBrew provides a visual interface to flatten nested JSON and compute derived metrics like session duration without writing code. Option B (Athena with JSON SerDe) can query but requires SQL that handles arrays. Option C (EMR with Spark) is more complex.

Option D (QuickSight) is visualization only.

Full explanation →

1341

MCQeasy

A company is using Amazon SageMaker to train a XGBoost model for predicting customer churn. The training data is stored in an S3 bucket as CSV files. The data scientist runs a hyperparameter tuning job with 50 training jobs. The tuning job completes, but the best model's accuracy on the holdout set is lower than expected. The data scientist suspects that the hyperparameter ranges are too narrow. Which corrective action is most appropriate?

A.Increase the number of training jobs in the tuning job

B.Switch to a different algorithm like Random Forest

C.Expand the hyperparameter ranges for key parameters such as 'max_depth', 'learning_rate', and 'subsample'

D.Change the tuning strategy from random search to Bayesian optimization

AnswerC

Wider ranges allow the tuning job to explore more of the hyperparameter space, potentially finding better configurations.

Why this answer

Option C is correct because the data scientist suspects the hyperparameter ranges are too narrow, which directly limits the model's ability to find an optimal configuration. Expanding ranges for key XGBoost parameters like 'max_depth', 'learning_rate', and 'subsample' allows the tuning job to explore a broader space of model complexities and regularization levels, potentially improving accuracy on the holdout set. This is the most direct fix for the stated problem, as it addresses the root cause rather than increasing job count or changing the search strategy.

Exam trap

The trap here is that candidates often confuse 'more training jobs' (Option A) with 'broader search space', failing to recognize that increasing jobs only refines sampling within existing bounds, not expands them.

How to eliminate wrong answers

Option A is wrong because increasing the number of training jobs does not address the core issue of narrow hyperparameter ranges; it only samples the same limited space more densely, which may not yield a better model if the true optimum lies outside the current bounds. Option B is wrong because switching to a different algorithm like Random Forest is an unnecessary and drastic change; the problem is explicitly about hyperparameter ranges, not algorithm suitability, and XGBoost is a strong choice for tabular churn data. Option D is wrong because changing from random search to Bayesian optimization improves sampling efficiency but does not expand the search space; if the ranges are too narrow, even a more intelligent search cannot find a better configuration outside those bounds.

Full explanation →

1342

MCQeasy

A company is streaming clickstream data from a website to Amazon Kinesis Data Streams. The data is consumed by a Lambda function that enriches each record with geolocation information before writing to an S3 bucket. Recently, the Lambda function has been failing with throttling errors. What is the MOST likely cause?

A.The Lambda function's payload size exceeds the 6 MB limit

B.The Lambda function's concurrent execution limit has been reached

C.The Lambda function's reserved concurrency is set too high

D.The Kinesis stream has exceeded the default shard limit of 500

AnswerB

Lambda throttles when the number of concurrent executions exceeds the account limit.

Why this answer

Option D is correct because the default Lambda concurrent execution limit (1000) can be reached if the stream has many shards or high throughput. Option A (shard limit) is a Kinesis limit, not Lambda. Option B (buffer size) is not a Lambda issue.

Option C (record size) would cause a different error.

Full explanation →

1343

MCQhard

A company is using SageMaker to host a model for real-time inference. They notice that the endpoint's latency increases over time. The model is stateless and the inference code does not log any errors. What is the MOST likely cause?

A.Memory leak in the inference container

B.Gradual increase in request payload size

C.Endpoint auto scaling is adding new instances

D.Model is accumulating state from previous requests

AnswerA

Memory leaks cause slowdown over time.

Why this answer

Memory leaks cause gradual performance degradation. Option A is correct. Option B is wrong because the model is stateless.

Option C is wrong because auto scaling would add instances, not degrade existing ones. Option D is wrong because the model is stateless.

Full explanation →

1344

MCQhard

A company is using Amazon SageMaker to host a model for real-time inference. The model is a large ensemble of 10 XGBoost models, each 2 GB. The endpoint uses a single ml.c5.18xlarge instance. The inference latency is high (average 2 seconds). Which change would most effectively reduce latency?

A.Use SageMaker Multi-Model Endpoints to serve each model independently

B.Switch to a GPU instance type

C.Add more instances behind a load balancer

D.Use SageMaker Batch Transform instead of real-time endpoint

AnswerA

Multi-Model Endpoints reduce serialization overhead by loading models on demand.

Why this answer

Option A is correct because serialization/deserialization of large models is a bottleneck; SageMaker Multi-Model Endpoints can reduce overhead by loading only the requested model. Option B (GPU) may not help if the bottleneck is CPU. Option C (batch transform) is for offline inference.

Option D (more instances) helps throughput but not per-request latency.

Full explanation →

1345

MCQeasy

A data scientist is using Amazon SageMaker to train a linear regression model. The training data has 10 features, and the scientist wants to interpret the model's coefficients. Which algorithm should they use?

A.Amazon SageMaker XGBoost

B.Amazon SageMaker K-Means

C.Amazon SageMaker Factorization Machines

D.Amazon SageMaker Linear Learner

AnswerD

Produces linear coefficients for interpretation.

Why this answer

Option A is correct because Linear Learner provides interpretable coefficients. Option B is wrong because XGBoost is tree-based and less interpretable. Option C is wrong because K-Means is unsupervised.

Option D is wrong because Factorization Machines are for high-dimensional sparse data.

Full explanation →

1346

MCQhard

A company stores customer transaction data in Amazon S3. A data scientist needs to perform exploratory data analysis using Amazon SageMaker. The dataset is 500 GB in CSV format. Which approach is most cost-effective and time-efficient for initial data profiling?

A.Use Amazon S3 Select to sample rows directly from S3

B.Load the entire dataset into a SageMaker notebook instance and use pandas

C.Convert the data to Parquet format and then use Athena to query

D.Use AWS Glue ETL to transform the data and then analyze in Athena

AnswerA

S3 Select allows efficient querying of a subset without full data movement.

Why this answer

Option D is correct because Amazon S3 Select can query a subset of rows from S3 without loading the entire dataset, enabling quick profiling. Option A is wrong because loading full data is expensive and slow. Option B is wrong because Glue ETL processes full dataset.

Option C is wrong because converting to Parquet adds overhead for initial profiling.

Full explanation →

1347

MCQmedium

A data scientist is performing EDA on a dataset containing customer transaction records. The dataset includes columns: 'transaction_id', 'customer_id', 'transaction_amount', 'transaction_date', and 'product_category'. The data scientist wants to check for duplicate transactions and identify any suspicious patterns, such as multiple transactions from the same customer on the same day with the same amount. The dataset has 5 million rows. The data scientist is using a SageMaker Studio notebook with a ml.t3.medium instance. The data is stored in S3. What is the most efficient way to perform this analysis?

A.Use a SageMaker Spark processing job with PySpark to aggregate and detect duplicates.

B.Use Amazon Athena to run SQL queries to find duplicates.

C.Load the entire dataset into a pandas DataFrame and use groupby operations.

D.Use AWS Glue DataBrew to create a profile and manually inspect.

AnswerA

Spark can handle large data efficiently.

Why this answer

Using Spark on SageMaker allows distributed processing of large data. Option A is wrong because pandas may run out of memory. Option B is wrong because Athena requires SQL queries and external setup.

Option D is wrong because DataBrew is for profiling but not custom duplicate analysis.

Full explanation →

1348

MCQmedium

A company uses an XGBoost model to predict equipment failures. The model has high precision but low recall. The business impact of a false negative is very high (missing a failure). Which action would MOST effectively increase recall while keeping precision reasonably high?

A.Increase the regularization parameter lambda

B.Set the objective to 'reg:squarederror'

C.Decrease the probability threshold for the positive class

D.Increase the number of boosting rounds

AnswerC

Lower threshold increases recall but may reduce precision.

Why this answer

Decreasing the probability threshold for the positive class means the model will classify a case as a failure at a lower predicted probability, which captures more true positives (increases recall). However, this also allows more false positives, so precision may drop, but the trade-off is acceptable given the high cost of false negatives. This is a standard post-training calibration technique for imbalanced classification problems.

Exam trap

Cisco often tests the misconception that increasing boosting rounds or regularization directly improves recall, when in fact the probability threshold is the primary lever for trading off precision and recall after training.

How to eliminate wrong answers

Option A is wrong because increasing the regularization parameter lambda (L2 regularization) reduces model complexity and can lead to underfitting, which typically decreases both precision and recall, not selectively increase recall. Option B is wrong because setting the objective to 'reg:squarederror' treats the problem as regression, not classification, so the model outputs continuous values without a probability threshold, making it unsuitable for recall-focused binary classification. Option D is wrong because increasing the number of boosting rounds can lead to overfitting, which may increase variance and actually degrade recall on unseen data, and does not directly control the trade-off between precision and recall.

Full explanation →

1349

MCQmedium

A machine learning engineer is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires low latency (under 100 ms) and high throughput (1000 requests per second). Which SageMaker deployment option is MOST suitable?

A.Deploy the model on a single endpoint with automatic scaling based on CPU utilization.

B.Use SageMaker Serverless Inference with provisioned concurrency.

C.Use SageMaker Inference Recommender to find the optimal instance type and endpoint configuration.

D.Use a multi-model endpoint to load multiple copies of the model on the same instance.

AnswerC

Inference Recommender runs load tests and suggests the best instance and configuration to meet latency and throughput targets.

Why this answer

Option C is correct because SageMaker Inference Recommender provides automated testing and recommendations for instance type and configuration to meet latency and throughput requirements. Option A is wrong because Multi-Model Endpoints are designed for multiple small models, not optimized for a single large model's throughput. Option B is wrong because Serverless Inference has a maximum concurrency and may not achieve 1000 TPS with low latency.

Option D is wrong because a single endpoint may not handle the load; Auto Scaling helps but does not guarantee optimal instance choice.

Full explanation →

1350

MCQhard

A machine learning engineer is using AWS Step Functions to orchestrate a SageMaker training job followed by a Lambda function for post-processing. The training job completes successfully, but the Lambda function fails with a timeout error. What is the MOST likely cause?

A.The Lambda function's IAM role lacks permissions to access the training output

B.The Lambda function execution time exceeds the maximum timeout limit

C.The Step Functions state machine has a misconfigured retry policy

D.The SageMaker training job output data is too large for Lambda to process

AnswerB

Lambda timeout is 15 minutes max.

Why this answer

Lambda has a maximum execution timeout of 15 minutes. If post-processing takes longer, it will timeout. Option A is correct.

Option B is wrong because the training job completed successfully. Option C is wrong because Step Functions itself is not the cause. Option D is wrong because the failure is a timeout, not a permission issue.

Full explanation →

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1276–1350