AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 151225

1755 questions total · 24pages · All types, answers revealed

Page 2

Page 3 of 24

Page 4
151
MCQmedium

A company uses AWS Glue ETL jobs to process data from multiple sources. The job fails with the error: 'An error occurred while calling o123.pyWriteDynamicFrame. Insufficient memory.' The job runs on a G.1X worker type with 10 workers. What should be changed to resolve this error?

A.Increase the number of workers to 20.
B.Enable the Spark UI to monitor the job.
C.Change the worker type to G.2X.
D.Reduce the number of partitions in the DynamicFrame.
AnswerA

More workers increase parallelism and reduce memory pressure per worker.

Why this answer

The error 'Insufficient memory' in AWS Glue ETL jobs typically indicates that the total memory across all executors is insufficient for the data being processed. Increasing the number of workers from 10 to 20 doubles the total memory and compute capacity available, allowing the job to handle larger datasets without running out of memory. This is the most direct and effective fix for a memory exhaustion error when using the G.1X worker type.

Exam trap

The trap here is that candidates often confuse 'insufficient memory' with a per-worker memory limit and choose to upgrade the worker type (G.2X), but the error is about total cluster memory, which is more effectively addressed by increasing the number of workers.

How to eliminate wrong answers

Option B is wrong because enabling the Spark UI only provides monitoring and debugging capabilities; it does not allocate additional memory or resolve the underlying memory shortage. Option C is wrong because changing the worker type to G.2X doubles the memory per worker (from 16 GB to 32 GB), but the error is about total memory insufficiency, and increasing the number of workers (option A) is a more scalable and cost-effective approach that directly addresses the error without requiring a change in worker type. Option D is wrong because reducing the number of partitions in the DynamicFrame would actually increase the data size per partition, potentially worsening memory pressure on individual executors, not resolving the overall memory shortage.

152
MCQhard

A data pipeline uses Amazon Kinesis Data Streams to ingest event data. The data is consumed by an AWS Lambda function, which writes to Amazon DynamoDB. The Lambda function is experiencing throttling errors, and the DynamoDB write capacity is underutilized. The events must be processed in order per shard. Which solution most effectively addresses the throttling?

A.Use an SQS FIFO queue between Kinesis and Lambda to buffer
B.Increase the Lambda function's reserved concurrency
C.Increase the write capacity units of the DynamoDB table
D.Increase the number of shards in the Kinesis Data Stream
AnswerD

More shards allow more parallel Lambda invocations, reducing throttling.

Why this answer

Adding more shards to the Kinesis stream increases the number of concurrent Lambda invocations, spreading the load. Option B (increasing DynamoDB write capacity) does not address Lambda throttling. Option C (using SQS FIFO) would decouple but may cause duplicates.

Option D (increasing Lambda reserved concurrency) alone may not help if Lambda is throttled due to concurrency limits; adding shards is more effective.

153
MCQhard

A team is deploying a model for fraud detection. The dataset is highly imbalanced (99% legitimate, 1% fraudulent). They trained a logistic regression model and achieved 99% accuracy on the test set. However, the model fails to detect most fraud cases. Which metric should the team focus on to evaluate the model?

A.Mean squared error
B.Precision
C.Recall
D.Accuracy
AnswerC

Recall measures the proportion of actual fraud cases correctly identified.

Why this answer

Accuracy is misleading for imbalanced datasets. Recall (true positive rate) measures how well the model detects fraud. Option A is wrong because accuracy is already high but misleading.

Option C is wrong because precision may be high but recall low. Option D is wrong because mean squared error is for regression.

154
MCQmedium

A company's machine learning model is overfitting to the training data. The data scientist has already tried reducing the model complexity and adding regularization, but the model still overfits. Which technique could the data scientist use to further reduce overfitting?

A.Use data augmentation to increase the training dataset size
B.Decrease the batch size
C.Increase the number of training epochs
D.Increase the learning rate
AnswerA

Data augmentation creates more training examples, which helps the model generalize better and reduces overfitting.

Why this answer

Data augmentation artificially increases the size and diversity of the training dataset by applying transformations (e.g., rotations, flips, noise injection) to existing samples. This exposes the model to more varied examples, reducing its tendency to memorize noise and improving generalization — directly countering overfitting when other methods have failed.

Exam trap

Cisco often tests the misconception that hyperparameter tuning (e.g., batch size, learning rate) is a primary cure for overfitting, when in fact these parameters primarily affect optimization dynamics, not the fundamental data scarcity or memorization issue that data augmentation directly addresses.

How to eliminate wrong answers

Option B is wrong because decreasing the batch size introduces noisier gradient estimates, which can sometimes act as a mild regularizer, but it does not fundamentally address overfitting caused by insufficient or repetitive training data. Option C is wrong because increasing the number of training epochs typically worsens overfitting by allowing the model more iterations to memorize the training set. Option D is wrong because increasing the learning rate can destabilize training (e.g., divergence) and does not reduce overfitting; it may even cause the model to skip over generalizable minima.

155
MCQmedium

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint for real-time inference. The model requires a preprocessing step that involves tokenizing text and converting it to a numerical format. To minimize latency, where should the preprocessing logic be implemented?

A.Inside the SageMaker inference container using the inference.py script
B.Using Amazon SageMaker batch transform
C.As a separate AWS Lambda function called before the endpoint
D.On the client side before sending the request
AnswerA

Including preprocessing in the container reduces latency by processing data locally.

Why this answer

To minimize latency, it's best to include the preprocessing logic inside the inference container that serves the model. This avoids additional network calls to separate preprocessing services.

156
MCQmedium

A data scientist is using SageMaker to train a deep learning model for image classification. The training job is taking too long. Which approach can reduce training time?

A.Use SageMaker's distributed data parallelism
B.Use SageMaker Neo to compile the model
C.Increase the number of epochs
D.Use a smaller image size
AnswerA

Distributed training speeds up training by parallelizing across GPUs.

Why this answer

SageMaker's distributed data parallelism splits the training data across multiple GPUs or instances, allowing each worker to process a different subset of the data simultaneously. This reduces the wall-clock time per epoch by parallelizing the computation, which directly addresses the 'taking too long' issue for deep learning image classification models.

Exam trap

AWS often tests the distinction between training acceleration (distributed data parallelism) and inference optimization (Neo), leading candidates to mistakenly choose Neo for training speed improvements.

How to eliminate wrong answers

Option B is wrong because SageMaker Neo compiles trained models for optimized inference on target hardware, not for speeding up training. Option C is wrong because increasing the number of epochs increases training time, the opposite of what is needed. Option D is wrong because using a smaller image size reduces model accuracy and may not significantly reduce training time if the model architecture and batch size remain unchanged; it is a data preprocessing choice, not a training acceleration technique.

157
MCQhard

A company's ML pipeline uses AWS Step Functions to orchestrate data preprocessing, training, and evaluation. The training step occasionally fails due to a transient error. What is the most robust way to handle this without manual intervention?

A.Implement a retry policy with exponential backoff on the training step in the state machine
B.Configure a CloudWatch alarm to notify the team when the step fails
C.Use a parallel state to run multiple training instances simultaneously
D.Use a custom Lambda function to catch the error and restart the training step
AnswerA

Step Functions supports retry policies for transient errors.

Why this answer

Retry with exponential backoff handles transient errors. Option A is wrong because a separate Lambda adds complexity. Option B is wrong because it only alerts, doesn't fix.

Option D is wrong because it restarts the whole workflow, wasting resources.

158
MCQmedium

A data engineering team is building a real-time clickstream analytics pipeline on AWS. They need to ingest millions of events per second from mobile apps and websites, process them with low latency, and store the results in Amazon S3 for downstream analysis. Which combination of AWS services should the team use to minimize operational overhead while meeting these requirements?

A.Use Amazon MQ to ingest streaming data, AWS Lambda to process each message, and save output to Amazon S3.
B.Use Amazon Kinesis Data Streams to ingest data, Amazon EMR to process with Spark Streaming, and save output to Amazon S3.
C.Use Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Analytics for real-time processing, and Amazon Kinesis Data Firehose to deliver results to Amazon S3.
D.Use AWS Glue to ingest data into Amazon RDS, then use AWS Glue ETL jobs to transform and load into Amazon S3.
AnswerC

This combination provides serverless, low-latency ingestion, processing, and delivery with minimal operational overhead.

Why this answer

Option C is correct because Amazon Kinesis Data Streams can handle high-throughput ingestion, Kinesis Data Analytics processes streaming data with low latency, and Kinesis Data Firehose delivers processed data to S3 with minimal overhead. Option A is wrong because AWS Glue is a batch ETL service, not suitable for real-time processing. Option B is wrong because Amazon EMR is a managed Hadoop cluster that requires more operational overhead.

Option D is wrong because Amazon MQ is a message broker for standard messaging protocols, not optimized for real-time analytics.

159
MCQhard

A data scientist uses SageMaker Autopilot to automatically build a binary classification model. The dataset has 50 features and 100,000 rows. After the experiment, Autopilot provides multiple candidate models. Which candidate should the data scientist select to minimize inference latency for real-time predictions?

A.The model with the smallest memory footprint
B.The model with the highest validation accuracy
C.The model with the lowest validation loss
D.The model that is a linear learner
AnswerD

Linear models are fast for inference due to simple computations.

Why this answer

SageMaker Autopilot explores various algorithms including linear models, tree-based ensembles, and neural networks. For real-time inference with low latency, simpler models like linear or logistic regression or shallow decision trees are preferred. XGBoost with many trees or deep neural networks increase latency.

160
MCQhard

A team has deployed a SageMaker endpoint for a sentiment analysis model. The model was trained on text data from social media. After deployment, the team notices that the model's accuracy has dropped significantly after 3 months. Which action should the team take to detect and address this issue?

A.Use SageMaker A/B testing to compare with a new model.
B.Enable SageMaker Model Monitor to detect data drift and trigger a retraining pipeline.
C.Re-deploy the model using the same training script.
D.Create a CloudWatch alarm on invocation errors.
AnswerB

Model Monitor can detect drift and trigger automated retraining.

Why this answer

Setting up SageMaker Model Monitor to detect drift and triggering a retraining pipeline (Option D) automates detection and correction. Option A (re-deploy) does not address root cause. Option B (CloudWatch alarm) only monitors latency, not accuracy.

Option C (A/B testing) helps compare but does not detect drift automatically.

161
Multi-Selecteasy

A data engineer needs to collect and analyze log data from multiple EC2 instances in real-time. The solution should be serverless and scalable. Which TWO AWS services should be used?

Select 2 answers
A.Amazon Kinesis Data Firehose
B.Amazon EMR
C.Amazon Athena
D.Amazon OpenSearch Service
E.Amazon S3
AnswersA, D

Firehose can ingest streaming data.

Why this answer

Amazon Kinesis Data Firehose is the correct choice because it is a fully managed, serverless service that can capture, transform, and load streaming log data from EC2 instances into destinations like Amazon S3 or Amazon OpenSearch Service in near real-time, with no infrastructure to manage. It automatically scales to handle high-throughput data streams, making it ideal for real-time log analytics.

Exam trap

The trap here is that candidates often choose Amazon S3 alone for storage, forgetting that a real-time ingestion layer like Kinesis Data Firehose is required to collect and stream the data from EC2 instances into a queryable destination.

162
MCQhard

A data scientist is using Amazon SageMaker Studio notebooks for EDA. They want to share a reproducible report that includes code, visualizations, and narrative text with their team. Which approach should they use?

A.Save the notebook as an .ipynb file and share it via Amazon S3.
B.Use Amazon SageMaker Clarify to generate an EDA report.
C.Export the results to Amazon QuickSight and create a dashboard.
D.Use Amazon SageMaker Autopilot to generate a report.
AnswerA

Notebooks combine code, output, and narrative.

Why this answer

Option C is correct because a Jupyter notebook (in Studio) contains code and markdown and can be shared. Option A (QuickSight dashboard) is for interactive dashboards; Option B (SageMaker Autopilot) is for automated ML; Option D (SageMaker Clarify) is for bias detection.

163
MCQeasy

A data scientist needs to perform exploratory data analysis on a 100 GB CSV file stored in Amazon S3. The data is not sensitive. The scientist wants to use SQL queries to filter and aggregate the data without setting up a server or moving the data. Which service should be used?

A.AWS Glue
B.Amazon EMR
C.Amazon Athena
D.Amazon Redshift Spectrum
AnswerC

Athena is serverless and allows SQL queries on S3 data.

Why this answer

Option B is correct because Amazon Athena is a serverless query service that allows SQL queries directly on data in S3. Option A is wrong because Redshift Spectrum requires a Redshift cluster. Option C is wrong because EMR requires a cluster.

Option D is wrong because Glue is for ETL, not ad-hoc querying.

164
MCQhard

A team is training a deep learning model using TensorFlow on a single GPU instance in SageMaker. The GPU utilization is below 30%. Which change will MOST improve GPU utilization?

A.Reduce the number of epochs
B.Increase the batch size
C.Use SageMaker Distributed Training with multiple GPUs
D.Switch to a CPU instance
AnswerB

Larger batches keep the GPU busy with more data per iteration.

Why this answer

Increasing the batch size makes more efficient use of GPU memory and parallel processing, improving utilization.

165
MCQmedium

A data scientist is using Amazon SageMaker Autopilot to automatically build a binary classification model. The dataset has 50 features and 100,000 rows. After the experiment completes, the best candidate model achieves an F1 score of 0.85 on the validation set. However, when deployed to a real-time endpoint, the model's F1 score drops to 0.72 on production data. The data distributions between training and production are similar. What is the MOST likely cause of the performance drop?

A.Concept drift occurred between training and production.
B.The production data contains missing values that were not present in training.
C.The inference endpoint uses a different instance type than training.
D.The Autopilot pipeline used features that are not available at inference time (data leakage).
AnswerD

If Autopilot used future information or features derived from the target, the validation score would be inflated.

Why this answer

Option C is correct. Data leakage during Autopilot's feature engineering can lead to overly optimistic validation scores. Option A is wrong because similar distributions suggest no drift.

Option B is wrong because Autopilot handles missing values. Option D is wrong because inference instances typically don't affect accuracy.

166
MCQmedium

A company is building a binary classifier to predict equipment failure. The dataset has 99% negative (no failure) and 1% positive (failure) examples. The data scientist uses a random forest model with default settings. The model achieves 99% accuracy on the test set but fails to identify any actual failures. Which metric should the data scientist use to evaluate the model?

A.RMSE
B.R-squared
C.Recall
D.Precision
AnswerC

Recall measures the proportion of actual positives correctly identified, which is critical for imbalanced data.

Why this answer

Recall (sensitivity) measures the proportion of actual positive cases correctly identified. With 99% negative examples, a model can achieve 99% accuracy by simply predicting 'no failure' for all instances, but this yields 0% recall for the failure class. Since the goal is to detect rare failures, recall is the appropriate metric to evaluate the model's ability to find positive cases.

Exam trap

The trap here is that candidates see 99% accuracy and assume the model is performing well, failing to recognize that accuracy is a poor metric for imbalanced datasets, and they overlook recall as the metric that reveals the model's inability to detect the minority class.

How to eliminate wrong answers

Option A is wrong because RMSE (Root Mean Squared Error) is a regression metric used for continuous target variables, not for binary classification problems. Option B is wrong because R-squared measures the proportion of variance explained in a regression model, which is meaningless for evaluating a binary classifier's ability to detect failures. Option D is wrong because precision measures the proportion of predicted positives that are actually positive; while useful, it does not capture the model's failure to identify any actual failures (the model has 0% recall, but precision would be undefined or 0/0 if no positives are predicted).

167
MCQhard

An organization stores sensitive customer data in S3. A data pipeline uses AWS Glue to transform the data and load it into Amazon Redshift. The security team requires that data be encrypted at rest in S3 and in transit between S3 and Glue, and between Glue and Redshift. Which configuration meets these requirements?

A.Use S3 client-side encryption, and use VPC Peering between Glue and Redshift.
B.Use S3 default encryption with SSE-KMS, and use Network Load Balancer for Redshift.
C.Enable S3 server-side encryption with SSE-S3, and use SSL for both Glue connections.
D.Enable S3 default encryption with SSE-KMS, use a VPC endpoint for S3, and configure Glue to use SSL for Redshift connection.
AnswerD

SSE-KMS encrypts at rest, VPC endpoint uses AWS network, SSL encrypts connection to Redshift.

Why this answer

Option D is correct because it ensures encryption at rest in S3 via SSE-KMS, encrypts data in transit between S3 and Glue by using a VPC endpoint (which enforces HTTPS/TLS), and encrypts data in transit between Glue and Redshift by configuring SSL for the Redshift connection. SSE-KMS provides envelope encryption with a customer-managed key, while the VPC endpoint and SSL satisfy the in-transit encryption requirements.

Exam trap

The trap here is that candidates often assume VPC Peering alone provides encryption in transit, but it only provides network isolation without encryption, and they may overlook that SSL must be explicitly configured for the Glue-to-Redshift connection.

How to eliminate wrong answers

Option A is wrong because client-side encryption does not guarantee server-side encryption at rest in S3 (the security team requires encryption at rest in S3, which is typically satisfied by server-side encryption), and VPC Peering alone does not enforce encryption in transit between Glue and Redshift (it only provides network connectivity, not TLS/SSL). Option B is wrong because a Network Load Balancer (NLB) for Redshift does not inherently encrypt traffic between Glue and Redshift; NLB operates at Layer 4 and does not terminate TLS unless explicitly configured with a TLS listener, which is not mentioned. Option C is wrong because SSE-S3 encrypts data at rest but does not provide encryption in transit between S3 and Glue (SSL must be explicitly enabled for the Glue connection to S3, and the option does not specify SSL for the S3-to-Glue leg).

168
MCQhard

A company uses SageMaker to train a model each night. The training data is stored in an S3 bucket with SSE-S3 encryption. The training job fails with an access denied error. Which configuration is needed?

A.Configure the training job to run in a VPC with S3 VPC Endpoint
B.Create an IAM role with S3 read access and assign it to the SageMaker training job
C.Enable SSE-KMS on the S3 bucket
D.Add a bucket policy allowing s3:GetObject for all principals
AnswerB

SageMaker needs a role with permissions to read S3 data.

Why this answer

Option C is correct because SageMaker needs an IAM role with S3 permissions and the role must be passed to the training job. Option A is wrong because KMS is not used with SSE-S3. Option B is wrong because bucket policy alone is insufficient without the SageMaker role.

Option D is wrong because VPC is not related to encryption.

169
Multi-Selecthard

A company is using Amazon SageMaker to deploy a model for real-time inference. The model takes 200 ms to respond, but the requirement is 100 ms. Which THREE actions could reduce latency? (Choose THREE.)

Select 3 answers
A.Use a larger instance with more compute capacity
B.Prune the model to remove unnecessary weights
C.Switch to a CPU-based instance
D.Use SageMaker Neo to compile the model for the target instance
E.Increase the batch size for inference
AnswersA, B, D

More powerful instances reduce inference time.

Why this answer

Using a more powerful instance reduces compute time. Model pruning reduces model size and computation. SageMaker Neo optimizes models for target hardware.

Option D is wrong because increasing batch size increases latency. Option E is wrong because CPU instances are typically slower than GPU for deep learning.

170
MCQmedium

A company is training a deep learning model on a large dataset using Amazon SageMaker. The training script uses TensorFlow and requires GPUs. The training job is failing with an out-of-memory error. Which configuration change should be made to resolve this issue?

A.Use a larger instance type with more GPU memory.
B.Increase the number of instances in the training job.
C.Switch to using spot instances to reduce cost.
D.Enable distributed training across multiple instances.
AnswerA

Larger instance types have more GPU memory, resolving the OOM error.

Why this answer

The training job is failing with an out-of-memory error, which indicates that the model or batch size exceeds the GPU memory capacity of the current instance. Using a larger instance type with more GPU memory directly addresses this by providing additional VRAM, allowing the model to fit in memory and the training to proceed without failure.

Exam trap

The trap here is that candidates confuse horizontal scaling (adding instances) with vertical scaling (increasing instance size), assuming that more instances will magically fix a per-GPU memory limit, when in fact distributed training requires the model to fit on each GPU unless model parallelism is explicitly implemented.

How to eliminate wrong answers

Option B is wrong because increasing the number of instances does not increase the GPU memory available to a single training process; it only distributes the workload across multiple machines, which does not resolve a local out-of-memory error on each GPU. Option C is wrong because switching to spot instances reduces cost but does not change the instance's hardware specifications, so the GPU memory remains the same and the out-of-memory error persists. Option D is wrong because enabling distributed training across multiple instances partitions the data or model across GPUs, but if the model itself does not fit on a single GPU, you would need model parallelism or a larger instance; simply distributing the workload does not increase per-GPU memory and may still result in out-of-memory errors on each GPU.

171
MCQmedium

A data scientist needs to choose an algorithm for a regression problem with 50 features and 1 million training examples. The model must be interpretable and the training data fits in memory. Which algorithm is most appropriate?

A.Principal Component Analysis (PCA)
B.Linear regression
C.XGBoost
D.k-Nearest Neighbors
AnswerB

Linear regression is interpretable and efficient for large datasets.

Why this answer

Option A is correct because linear regression is interpretable, scales well to large datasets, and is suitable for regression. Option B is wrong because XGBoost is less interpretable. Option C is wrong because k-NN is computationally expensive at inference and not interpretable.

Option D is wrong because PCA is dimensionality reduction, not a regression algorithm.

172
MCQeasy

A company is using Amazon DynamoDB to store sensor data. The data is exported to Amazon S3 using DynamoDB Streams and AWS Lambda for long-term archival. Recently, the Lambda function has been failing due to 'ProvisionedThroughputExceededException' on the DynamoDB stream. What is the most likely cause?

A.The Lambda function is processing records too slowly, causing the stream to throttle.
B.The DynamoDB stream is disabled.
C.The DynamoDB table's write capacity is too low.
D.The Lambda function does not have enough memory allocated.
AnswerA

Correct: Slow processing can lead to throttling; increasing batch size or concurrency can help.

Why this answer

The error indicates that the DynamoDB stream's read throughput is being exceeded. The Lambda function's event source mapping reads from the stream, and if it is processing too slowly or there are too many shards, it can throttle. Option B (increase the batch size) reduces the number of reads per second.

Option A (increase Lambda memory) may not help. Option C (disable stream) would stop the export. Option D (increase DynamoDB write capacity) does not affect stream reads.

173
MCQeasy

A data scientist loads a large dataset from Amazon S3 into a pandas DataFrame using a SageMaker notebook. The dataset contains a mix of numeric and categorical features. The data scientist wants to quickly check for missing values. Which pandas function is most appropriate?

A.df.info()
B.df.describe()
C.df.shape
D.df.isnull().sum()
AnswerD

This returns the sum of null values per column.

Why this answer

Option C is correct because df.isnull().sum() returns the count of missing values per column. Option A is wrong because df.info() provides column data types and non-null counts, but not missing value counts directly. Option B is wrong because df.describe() only summarizes numeric columns.

Option D is wrong because df.shape returns the dimensions, not missing values.

174
Multi-Selecteasy

A data scientist is exploring a dataset with categorical variables. Which TWO EDA techniques are appropriate for understanding the relationship between a categorical feature and a continuous target? (Choose TWO.)

Select 2 answers
A.Correlation matrix
B.Violin plots
C.Scatter plot with categorical variable on x-axis
D.Bar chart of category counts
E.Side-by-side box plots
AnswersB, E

Violin plots show density and distribution across categories.

Why this answer

Options A and C are correct. Box plots show distribution of continuous variable across categories. Violin plots combine box plot and density.

Option B is wrong because scatter plot is for two continuous variables. Option D is wrong because bar chart of counts shows frequency, not relationship with target. Option E is wrong because correlation matrix is for numerical features.

175
MCQhard

A data scientist is building a fraud detection model using a dataset of 500,000 credit card transactions. The dataset contains 20 features, including transaction amount, merchant category, time since last transaction, and customer age. The target variable 'is_fraud' has 0.1% positive examples. Initial EDA reveals that the transaction amount distribution is highly skewed with a long tail. Also, there are missing values in the 'customer_age' field (5% missing). The data scientist needs to prepare the data for training a binary classifier. Which combination of preprocessing steps should the data scientist apply to address these issues and improve model performance? (Select TWO.)

A.Use SMOTE to generate synthetic samples of the minority class.
B.Apply standard scaling to all numerical features.
C.Apply log transformation to the transaction amount to reduce skewness.
D.Impute missing values in customer_age with the mean of the non-missing values.
E.Drop the transaction amount feature because of its skewness.
AnswerC, D

Log transformation is effective for reducing right skewness and can make the distribution more Gaussian-like, which benefits many models.

Why this answer

Option C is correct because applying a log transformation to the highly skewed transaction amount reduces skewness and compresses the dynamic range, which helps many machine learning algorithms (especially those sensitive to feature scales like logistic regression or SVM) converge faster and perform better. This is a standard technique for handling right-skewed distributions without losing data.

Exam trap

The trap here is that candidates often confuse handling skewness with scaling—they may choose standard scaling (Option B) thinking it addresses skewness, but standard scaling only centers and scales the data, not corrects the shape of the distribution.

How to eliminate wrong answers

Option A is wrong because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class, but with only 0.1% fraud cases (500 out of 500,000), SMOTE would create an extremely large synthetic dataset that risks overfitting and does not address the skewed transaction amount or missing values. Option B is wrong because standard scaling (z-score normalization) is not appropriate for highly skewed features like transaction amount; scaling after log transformation would be valid, but applying standard scaling directly to a skewed distribution does not reduce skewness and can still leave the feature non-Gaussian, harming model performance. Option E is wrong because dropping the transaction amount feature due to skewness discards valuable predictive information; skewness can be corrected via transformation (e.g., log) rather than deletion, which would reduce model accuracy.

176
Multi-Selectmedium

A company is deploying a machine learning model for real-time fraud detection using Amazon SageMaker. The model must have a p99 inference latency under 50ms. Which TWO actions should the ML team take to meet the latency requirement?

Select 2 answers
A.Use a multi-model endpoint to reduce cold starts.
B.Use SageMaker Neo to compile and optimize the model for the target instance type.
C.Use SageMaker Batch Transform for near-real-time inference.
D.Configure automatic scaling to add instances based on CPU utilization.
E.Select a GPU instance type such as ml.g4dn.xlarge.
AnswersB, E

Neo optimizes the model to run faster on specific hardware.

Why this answer

Using SageMaker Neo (A) optimizes the model for the target hardware, reducing latency. Using GPU instances (B) can accelerate inference for compute-intensive models. SageMaker Batch Transform (C) is for offline inference.

Automatic scaling (D) handles throughput, not latency. Multi-model endpoint (E) can help with many models but not single-model latency.

177
Multi-Selecteasy

A data scientist is training a linear regression model and wants to check for multicollinearity among the features. Which TWO methods can be used to detect multicollinearity? (Choose TWO.)

Select 2 answers
A.Examine the R-squared value of the model
B.Compute the correlation matrix between features
C.Check the p-values of the coefficients
D.Calculate Variance Inflation Factor (VIF) for each feature
E.Plot the residuals vs. fitted values
AnswersB, D

High pairwise correlations between features (e.g., >0.8) suggest multicollinearity.

Why this answer

Option B is correct because computing the correlation matrix between features directly reveals pairwise linear relationships. High correlation coefficients (e.g., >0.8 or <-0.8) between two predictors indicate potential multicollinearity, which can destabilize coefficient estimates in linear regression.

Exam trap

AWS often tests the distinction between diagnosing model fit (R-squared, residual plots) and diagnosing predictor multicollinearity, leading candidates to mistakenly choose methods that evaluate model performance rather than feature interdependence.

178
MCQhard

A company uses Amazon Kinesis Data Streams with a shard count of 5. The data producer sends 1000 records per second, each 1 KB in size. The consumer application reads from the stream using the Kinesis Client Library (KCL) and processes records. The consumer is experiencing high latency and falling behind. What is the most effective way to improve consumer throughput?

A.Switch to Kinesis Data Analytics for processing.
B.Use enhanced fan-out to dedicate read throughput to the consumer.
C.Increase the record size to 5 KB.
D.Increase the number of shards in the stream.
AnswerD

Correct: More shards provide more read capacity and allow parallel processing.

Why this answer

Increasing the number of shards increases the read throughput and allows more consumers to read in parallel. Option B (increase the number of shards) is correct. Option A (increase record size) is irrelevant.

Option C (use Kinesis Data Analytics) adds another service. Option D (use enhanced fan-out) is for multiple consumers, not for a single consumer falling behind.

179
Multi-Selectmedium

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?

Select 2 answers
A.Apply a Random Forest classifier to predict outliers.
B.Use Z-score and flag values with absolute Z-score > 3.
C.Remove any value that is more than one standard deviation from the mean.
D.Use DBSCAN clustering with default parameters.
E.Use the interquartile range (IQR) and flag values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
AnswersB, E

Z-score >3 is a common outlier threshold.

Why this answer

The Z-score method (Option B) is a standard statistical technique for detecting outliers in a univariate continuous feature. It measures how many standard deviations a data point is from the mean, and flagging values with an absolute Z-score greater than 3 is a common threshold because, under a normal distribution, approximately 99.7% of data falls within three standard deviations, making points beyond this likely outliers.

Exam trap

Cisco often tests the misconception that removing values more than one standard deviation from the mean is a valid outlier detection technique, when in fact it removes a large portion of normal data and is not a standard practice.

180
Multi-Selecthard

A machine learning engineer is using SageMaker's built-in XGBoost algorithm for a multi-class classification problem. The training job completes but the model accuracy is low. Which THREE hyperparameters should the engineer tune to improve performance?

Select 3 answers
A.eta (learning rate)
B.num_round
C.subsample
D.max_depth
E.colsample_bytree
AnswersA, B, D

Learning rate controls contribution of each tree; tuning helps convergence.

Why this answer

XGBoost hyperparameters: 'num_round' (number of boosting rounds), 'eta' (learning rate), and 'max_depth' (tree depth) are key for improving accuracy. 'subsample' can help but is less direct. 'min_child_weight' also important but these three are most common. 'colsample_bytree' is for feature subsampling.

181
Multi-Selectmedium

A company is using Amazon SageMaker to train a model. The training data includes sensitive personally identifiable information (PII). The company needs to ensure that the training data is protected and that the trained model does not inadvertently expose PII. Which TWO actions should the company take? (Choose TWO.)

Select 2 answers
A.Encrypt the training data in S3 using AWS KMS
B.Use server-side encryption with S3-managed keys
C.Use SageMaker's data processing to redact PII before training
D.Enable AWS CloudTrail to log all access to the data
E.Grant public read access to the training data for faster access
AnswersA, C

Encryption protects data at rest.

Why this answer

Options A and D are correct. A: Encrypt the data at rest using KMS. D: Use SageMaker's data processing to redact PII before training.

Option B (grant public access) is wrong. Option C (CloudTrail) is for auditing, not protection. Option E (server-side encryption with S3-managed keys) is less secure than KMS but still encryption; however, the question asks for TWO actions and A is the best encryption choice, while D directly addresses PII exposure.

182
MCQmedium

A company is building a multiclass classification model using Amazon SageMaker. The dataset has 100 classes and is highly imbalanced. The model currently achieves high accuracy on the majority classes but poor performance on minority classes. Which technique should the data scientist use to improve minority class performance?

A.Apply random oversampling with replacement
B.Apply principal component analysis (PCA)
C.Use class weights to penalize misclassifications of minority classes
D.Remove samples from majority classes
AnswerC

Effectively balances the loss function.

Why this answer

Option C is correct because class weights penalize errors on minority classes more, improving their recall. Option A is wrong because removing majority classes would lose data. Option B is wrong because oversampling without replacement may cause overfitting.

Option D is wrong because principal component analysis (PCA) is dimensionality reduction, not for imbalance.

183
Multi-Selectmedium

Which THREE of the following are valid metrics for evaluating a regression model?

Select 3 answers
A.R-squared (R²)
B.F1 score
C.Mean Absolute Error (MAE)
D.Root Mean Squared Error (RMSE)
E.Accuracy
AnswersA, C, D

R-squared is a common regression metric.

Why this answer

R-squared (R²) is a valid regression metric that measures the proportion of variance in the dependent variable explained by the independent variables. It ranges from 0 to 1, with higher values indicating better fit, and is commonly used alongside other error metrics to assess model performance.

Exam trap

The trap here is that candidates confuse classification metrics (F1 score, Accuracy) with regression metrics, especially when the question asks for 'valid metrics' without specifying the model type, leading them to select metrics they are more familiar with from classification tasks.

184
MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 1 Gbps connection to AWS. The transfer must be completed within 10 days. What is the MOST efficient approach?

A.Use AWS Snowball Edge to physically ship the data.
B.Use Amazon S3 Transfer Acceleration to speed up the upload.
C.Use AWS DataSync over the existing network connection.
D.Set up a VPN connection and use multi-part upload directly to S3.
AnswerA

Snowball Edge provides high-speed local transfer and avoids network bottlenecks.

Why this answer

Option C is correct because AWS Snowball Edge can transfer large volumes faster than a network, especially with limited bandwidth. Option A is wrong because a 1 Gbps line would take over 5 days for 50 TB, but may be unreliable. Option B is wrong because 50 TB over VPN would be very slow.

Option D is wrong because AWS DataSync is network-based and limited by bandwidth.

185
Multi-Selecteasy

A data scientist is performing exploratory data analysis on a dataset with missing values. Which TWO approaches are appropriate for handling missing data in a way that retains as much data as possible?

Select 2 answers
A.Use a model that handles missing values natively, such as XGBoost
B.Replace missing values with a constant like -999
C.Impute missing values with the median
D.Drop columns with high missing percentage
E.Drop rows with any missing value
AnswersA, C

Correct: Some algorithms can handle missing values without imputation.

Why this answer

Correct options: A and C. Option A (imputation) retains all rows and fills missing values. Option C (use algorithms that handle missing values) avoids deletion.

Option B is wrong because dropping rows reduces data. Option D is wrong because dropping columns loses features. Option E is wrong because replacing with a constant may bias.

186
MCQmedium

A data engineer needs to transform a large dataset stored in Amazon S3 using Apache Spark. The engineer wants to minimize costs and avoid managing infrastructure. Which AWS service should be used?

A.Amazon Athena
B.Amazon SageMaker
C.Amazon EMR
D.AWS Glue
AnswerD

Glue is serverless and ideal for cost-sensitive, intermittent ETL jobs.

Why this answer

AWS Glue is a serverless Spark ETL service that can process large datasets stored in S3 without requiring infrastructure management.

187
MCQhard

A data engineer uses the IAM policy above for an AWS Lambda function that processes data in S3 and triggers an AWS Glue job. The Lambda function is unable to start the Glue job. What is the most likely cause?

A.The Glue job name in the resource ARN is misspelled.
B.The policy does not allow s3:PutObject on the bucket.
C.The policy does not include iam:PassRole permission.
D.The policy does not include s3:GetObject on the bucket.
AnswerC

To start a Glue job, Lambda must pass an execution role; iam:PassRole is required.

Why this answer

Option B is correct. The policy lacks 'glue:StartJobRun' action for the resource? Actually, the policy includes 'glue:StartJobRun' but the resource is specific to the job. However, the Lambda function's execution role may not have permission to pass the IAM role to Glue (iam:PassRole).

Option B is correct: missing 'iam:PassRole' permission. Option A is wrong because the policy has GetObject on the bucket. Option C is wrong because Glue job name is correct.

Option D is wrong because S3 PutObject is allowed.

188
MCQmedium

A company is deploying a machine learning model using AWS Lambda for real-time inference. The model is a large ensemble model that takes approximately 500 MB of memory. The Lambda function is configured with 1024 MB of memory and a timeout of 15 seconds. The company observes that the function frequently times out during inference. The company wants to keep using Lambda for its serverless benefits. Which solution should the company implement to reduce inference time?

A.Increase the Lambda function memory to 3008 MB to provide more CPU resources.
B.Deploy the model on Amazon SageMaker hosting instead of Lambda.
C.Use AWS Step Functions to invoke the Lambda function asynchronously.
D.Use Amazon ElastiCache to cache model predictions and reduce computation.
AnswerA

Correct: More memory allocates more CPU, reducing inference time.

Why this answer

Lambda has a maximum memory of 10,240 MB and a maximum timeout of 15 minutes. Increasing memory to 3008 MB gives more CPU power and reduces inference time. Option A is correct.

Option B (ElastiCache) adds latency and cost. Option C (Step Functions) adds orchestration overhead. Option D (SageMaker) moves away from serverless, which the company wants to keep.

189
Multi-Selecthard

A data scientist is using SageMaker to train a deep learning model. The training job runs on a single GPU instance and is taking too long. Which THREE actions can the data scientist take to reduce training time? (Choose three.)

Select 3 answers
A.Increase the size of the EBS volume attached to the instance.
B.Increase the number of instances but keep the same total data.
C.Switch to Pipe input mode to reduce I/O waiting time.
D.Use a larger GPU instance type, such as p3.16xlarge.
E.Use distributed training across multiple GPU instances.
AnswersC, D, E

Streams data directly, reducing I/O bottleneck.

Why this answer

Option C is correct because Pipe input mode streams training data directly from Amazon S3 to the GPU instance, reducing I/O waiting time compared to the default File mode, which downloads the entire dataset to the EBS volume first. This minimizes the time the GPU spends idle waiting for data, especially for large datasets that cannot fit entirely in memory.

Exam trap

The trap here is that candidates may confuse increasing instance count (Option B) with distributed training (Option E), not realizing that simply adding instances without distributed training code and configuration does not parallelize the workload and can even increase overhead.

190
Multi-Selectmedium

A data scientist is building a regression model to predict housing prices. The dataset includes numerical features such as square footage, number of bedrooms, and year built, as well as categorical features such as neighborhood and roof type. Which TWO preprocessing steps are most important to apply before training a linear regression model?

Select 2 answers
A.Apply principal component analysis (PCA) for dimensionality reduction
B.One-hot encode categorical features
C.Remove outliers using IQR
D.Add interaction terms between all features
E.Normalize or standardize numerical features
AnswersB, E

One-hot encoding converts categorical variables into numerical form suitable for linear regression.

Why this answer

Linear regression requires numerical features and is sensitive to feature scales. Encoding categorical variables as numerical is necessary, and scaling numerical features ensures that no single feature dominates the model.

191
MCQhard

Refer to the exhibit. A data engineer runs an Athena query and gets a failure. What is the most likely cause?

A.The query result location uses a bucket with default encryption enabled.
B.The SQL query syntax is incorrect.
C.The IAM role used does not have permissions to write to S3.
D.The output S3 bucket specified in the query result configuration already exists.
AnswerD

Athena expects the output bucket to be created by the service; if it already exists, it may cause this error.

Why this answer

Option D is correct. The error message says 'Bucket my-query-results should not exist', which indicates that the output location bucket must not exist beforehand; Athena creates it if it doesn't exist. Option A is incorrect because the error does not mention permissions.

Option B is incorrect because the error is about the bucket existing, not the query syntax. Option C is incorrect because the error does not mention encryption.

192
Multi-Selecthard

Which TWO of the following are appropriate methods for handling missing data in a dataset?

Select 2 answers
A.Dropping features with more than 50% missing values
B.Mean imputation for all features
C.Multiple imputation
D.Using algorithms that handle missing values internally (e.g., XGBoost)
E.Listwise deletion (removing rows with missing values)
AnswersC, D

Multiple imputation accounts for uncertainty by creating multiple datasets.

Why this answer

Multiple imputation and using algorithms that handle missing values (e.g., XGBoost) are valid. Listwise deletion reduces sample size. Mean imputation may bias distributions.

Dropping features with many missing values may lose information.

193
MCQhard

A data scientist is building a recommendation system for an e-commerce platform. The dataset contains user interactions (clicks, purchases) and item metadata. The scientist wants to use matrix factorization. Which algorithm should be used?

A.SageMaker Image Classification
B.SageMaker BlazingText
C.SageMaker XGBoost
D.SageMaker Factorization Machines
AnswerD

Factorization Machines are designed for recommendation and matrix factorization.

Why this answer

Option D is correct because SageMaker provides the Factorization Machines algorithm for recommendation and matrix factorization. Option A is wrong because XGBoost is for gradient boosting, not matrix factorization. Option B is wrong because BlazingText is for text.

Option C is wrong because Image Classification is for images.

194
MCQmedium

A company is using Amazon SageMaker to train a deep learning model on a large dataset. The training job is taking too long. The team wants to reduce training time without changing the model architecture. Which action should they take?

A.Increase the learning rate by a factor of 10
B.Use SageMaker's distributed training with multiple instances
C.Reduce the number of epochs
D.Reduce the batch size
AnswerB

Distributed training parallelizes the workload.

Why this answer

SageMaker's distributed training with multiple instances splits the dataset and model computations across several machines, enabling parallel processing that significantly reduces wall-clock training time. This approach leverages data parallelism or model parallelism without altering the model architecture, directly addressing the need for faster training.

Exam trap

Cisco often tests the misconception that simply adjusting hyperparameters like learning rate or batch size can solve performance issues, when the correct answer is to leverage distributed computing resources that SageMaker provides natively.

How to eliminate wrong answers

Option A is wrong because increasing the learning rate by a factor of 10 can cause the optimizer to overshoot minima, leading to divergence or unstable training, and does not guarantee reduced training time without risking model quality. Option C is wrong because reducing the number of epochs directly reduces the amount of training iterations, which may lower model accuracy or prevent convergence, and is not a valid method to reduce training time while preserving model performance. Option D is wrong because reducing the batch size typically increases the number of weight updates per epoch and can slow down training due to less efficient hardware utilization and increased communication overhead, especially on GPUs.

195
MCQeasy

A data scientist is training a binary classification model on a dataset with a severe class imbalance (95% negative, 5% positive). The model achieves 95% accuracy but only correctly identifies 10% of the positive class. Which metric should the data scientist use to evaluate model performance?

A.Log loss
B.F1 score
C.Accuracy
D.Area under the ROC curve (AUC)
AnswerB

F1 score balances precision and recall, making it suitable for imbalanced datasets where the minority class is important.

Why this answer

The F1 score is the harmonic mean of precision and recall, making it robust to class imbalance. With 95% accuracy but only 10% recall on the positive class, the model is essentially a trivial classifier that predicts the majority class. F1 score captures both false positives and false negatives, providing a balanced view of performance on the minority class.

Exam trap

The trap here is that candidates see high accuracy and assume the model is good, but AWS tests the understanding that accuracy is meaningless for imbalanced datasets, and that AUC can be misleadingly high even when minority class recall is poor.

How to eliminate wrong answers

Option A is wrong because log loss measures the probabilistic confidence of predictions and can be misleading when class imbalance is severe, as it is dominated by the majority class. Option C is wrong because accuracy is misleading in imbalanced datasets; a model predicting all negatives achieves 95% accuracy without learning anything about the positive class. Option D is wrong because AUC measures the model's ability to rank positive instances higher than negative ones, but it can still be high even when recall on the positive class is low, as it aggregates performance across all thresholds.

196
MCQeasy

A data scientist is training a neural network on a GPU instance in Amazon SageMaker. The training job fails with an 'OutOfMemoryError'. Which action should the data scientist take to resolve this issue?

A.Enable automatic hyperparameter tuning.
B.Switch to distributed training across multiple instances.
C.Use a smaller instance type with less GPU memory.
D.Reduce the batch size in the training script.
AnswerD

Smaller batch size reduces memory footprint.

Why this answer

Option C is correct because reducing the batch size reduces memory usage per iteration. Option A is wrong because using a smaller instance type would have less memory. Option B is wrong because hyperparameter tuning does not directly reduce memory.

Option D is wrong because distributed training typically increases memory usage per node.

197
MCQmedium

A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The input data is located in an S3 bucket, and the team needs to transform the data before inference using a custom Python script. The transformation should run on a serverless infrastructure and must be triggered automatically when new data arrives in S3. Which combination of services should the team use?

A.Use AWS Lambda functions triggered by S3 events to run the transformation, then invoke a SageMaker endpoint.
B.Use AWS Glue jobs triggered by S3 events.
C.Use Amazon SageMaker Processing jobs triggered by S3 events.
D.Use Amazon Kinesis Data Firehose to transform data and deliver to SageMaker.
AnswerA

Lambda provides serverless compute triggered by S3 events, and can call SageMaker endpoints.

Why this answer

Option C is correct because S3 events can trigger Lambda, which runs the custom script, and the output can be sent to a SageMaker endpoint for inference. Option A is wrong because SageMaker Processing is not serverless (it runs on EC2 instances). Option B is wrong because Glue is for ETL, not real-time inference.

Option D is wrong because Kinesis Data Firehose is for streaming ingestion, not suitable for S3-triggered batch processing.

198
MCQhard

A healthcare company is building a model to predict patient readmission within 30 days. They have structured electronic health records (EHR) data with 200 features. The data includes missing values, categorical variables with high cardinality (e.g., diagnosis codes), and a severe class imbalance (5% readmission). They need to deploy a model on SageMaker that is interpretable and achieves high recall for the positive class. Which combination of techniques should they use?

A.Use XGBoost with SMOTE, feature selection via SHAP, and deploy as a SageMaker endpoint
B.Use logistic regression with one-hot encoding and random undersampling
C.Use PCA for dimensionality reduction, then train a linear SVM with class weights
D.Use a deep neural network with embeddings for categorical variables and oversample the minority class
AnswerA

XGBoost handles missing values, SMOTE addresses imbalance, SHAP provides interpretability.

Why this answer

XGBoost with SMOTE and SHAP balances interpretability and performance.

199
Multi-Selecthard

A machine learning engineer is analyzing a dataset with a large number of features (p >> n). The engineer suspects that many features are irrelevant. Which THREE methods are suitable for feature selection during exploratory data analysis? (Choose THREE.)

Select 3 answers
A.Fit a Lasso regression model and select features with non-zero coefficients
B.Remove features with variance below a threshold (e.g., <0.01)
C.Remove features with high pairwise correlation (e.g., >0.95)
D.Calculate mutual information between each feature and the target, and keep top k features
E.Apply Principal Component Analysis (PCA) and select top components
AnswersB, C, D

Low-variance features provide little information and can be removed.

Why this answer

Option A is correct because correlation-based feature selection removes highly correlated features. Option C is correct because mutual information measures relevance to the target. Option E is correct because Variance Threshold removes low-variance features.

Option B is wrong because PCA is a dimensionality reduction technique, not a feature selection method. Option D is wrong because Lasso regression is a modeling technique, not typically used in EDA.

200
Multi-Selectmedium

A data scientist is using Amazon SageMaker to train a model. The training job uses a custom Docker image stored in Amazon ECR. The training job fails with an error 'CannotPullContainerError'. Which TWO actions should the data scientist take to resolve this issue? (Choose TWO.)

Select 2 answers
A.Use a public Docker image instead of a custom one
B.Confirm that the image tag exists in the ECR repository
C.Verify that the IAM role used for training has permissions to pull from ECR
D.Increase the training job timeout
E.Ensure the training instance has internet access
AnswersB, C

A missing tag causes CannotPullContainerError.

Why this answer

Options B and D are correct because checking ECR permissions and verifying the image tag are common causes. Option A is unrelated to pulling. Option C might be needed but not directly.

Option E is irrelevant to the error.

201
MCQeasy

A machine learning engineer needs to deploy a model that requires low latency (under 10 ms) for real-time inference. The model is a small ensemble of decision trees. Which Amazon SageMaker endpoint configuration is MOST appropriate?

A.Batch transform
B.Training job
C.Real-time endpoint
D.Multi-model endpoint
AnswerC

Real-time endpoints provide low latency.

Why this answer

Real-time endpoints in Amazon SageMaker are designed for low-latency inference (typically under 10 ms) and are the correct choice for deploying a small ensemble of decision trees that needs to respond to individual prediction requests in real time. They keep the model loaded and ready, providing a persistent HTTPS endpoint that can serve predictions with minimal overhead.

Exam trap

The trap here is that candidates often confuse Multi-model endpoints with real-time endpoints, assuming they offer the same low-latency guarantees, but Multi-model endpoints trade off latency for cost efficiency by loading models on demand, which can introduce delays that violate strict latency requirements.

How to eliminate wrong answers

Option A is wrong because Batch Transform is an asynchronous, offline inference service that processes large datasets in batches and does not provide a real-time endpoint; it can take minutes to hours to complete and is unsuitable for sub-10 ms latency. Option B is wrong because a Training job is used to train a model, not to host it for inference; it runs a training algorithm on input data and produces model artifacts, but does not expose an endpoint for serving predictions. Option D is wrong because a Multi-model endpoint is designed to host multiple models on a single endpoint to reduce costs, but it introduces additional latency due to model loading and unloading on demand, making it less suitable for the strict under-10 ms requirement compared to a dedicated real-time endpoint.

202
MCQhard

A data scientist is training a deep learning model on Amazon SageMaker using a custom TensorFlow container. The training job fails with an OutOfMemory error. The instance type is ml.p3.2xlarge with 16 GB GPU memory and 61 GB system memory. The model uses mixed precision training. Which step should the data scientist take to resolve the issue without changing the instance type?

A.Reduce the batch size
B.Use gradient accumulation to simulate a larger batch size
C.Use model parallelism across multiple GPUs
D.Enable automatic mixed precision (AMP)
E.Increase the instance type to ml.p3.8xlarge
AnswerA

Smaller batch size reduces memory usage.

Why this answer

Option D is correct because reducing the batch size lowers memory usage during training. Option A (increase instance type) increases cost and may not be necessary. Option B (use gradient accumulation) simulates larger batch sizes without increasing memory footprint, but does not reduce memory usage per step.

Option C (enable automatic mixed precision) is already in use. Option E (use model parallelism) is complex and may not be applicable for a single model fitting in memory with batch size reduction.

203
MCQeasy

A data scientist is using Amazon SageMaker to train a model, but the training job fails with an 'Out of memory' error. The instance type is ml.p3.2xlarge. Which action should the data scientist take to resolve the issue?

A.Use Pipe input mode.
B.Increase the number of instances.
C.Reduce the mini-batch size in the training script.
D.Use a Spot instance.
AnswerC

Reducing batch size reduces memory consumption.

Why this answer

The 'Out of memory' error on a single ml.p3.2xlarge instance indicates that the GPU memory is insufficient for the current workload. Reducing the mini-batch size directly decreases the memory footprint per training step, allowing the model to fit within the available GPU memory without changing the instance type or incurring additional costs.

Exam trap

The trap here is that candidates confuse storage-related issues (disk space, data loading) with compute memory (GPU RAM), leading them to select Pipe input mode or Spot instances, which do not address the fundamental memory constraint.

How to eliminate wrong answers

Option A is wrong because Pipe input mode streams data directly from Amazon S3 without downloading it to the local disk, which reduces disk storage requirements but does not affect GPU memory consumption during training. Option B is wrong because increasing the number of instances distributes the workload across multiple machines but does not reduce the per-instance memory usage; each instance still faces the same GPU memory constraint. Option D is wrong because using a Spot instance provides cost savings but does not change the hardware specifications or memory capacity of the ml.p3.2xlarge instance, so the out-of-memory error would persist.

204
MCQeasy

A company is using Amazon SageMaker to deploy a machine learning model that predicts equipment failure. The model is a binary classifier that outputs a probability. The company wants to set a threshold such that the model correctly identifies 95% of actual failures (recall >= 0.95). The model's precision at the current threshold of 0.5 is 0.7. The data scientist evaluates the model on a test set and obtains the following confusion matrix at threshold 0.5: TP=95, FN=5, FP=40, TN=860. The total actual positives are 100. Which threshold adjustment should the data scientist make to achieve the recall goal?

A.Decrease the threshold to 0.1
B.Increase the threshold to 0.7
C.Keep the threshold at 0.5
D.Decrease the threshold to 0.3
AnswerC

Recall is already 95%, meeting the requirement.

Why this answer

Option B is correct. Lowering the threshold increases recall by classifying more instances as positive, which reduces false negatives. Currently recall is 95/100 = 0.95, so recall is already 95%.

Actually, recall is already 95% at threshold 0.5. So the requirement is already met. But the question might imply that recall needs to be at least 95%, which it is.

However, the stem says 'the company wants to set a threshold such that the model correctly identifies 95% of actual failures (recall >= 0.95)'. At threshold 0.5, recall is 95/100 = 0.95, so it meets the requirement. So no adjustment is needed.

But the options include 'Keep the threshold at 0.5' as option D. So D is correct. Let me check: If recall is already 0.95, then no change needed.

So answer D.

205
MCQhard

A company runs a real-time analytics platform that ingests IoT sensor data from millions of devices. The data is sent to Amazon Kinesis Data Streams with 16 shards. A custom Java application using the Kinesis Client Library (KCL) processes the data and writes aggregated results to Amazon DynamoDB. The application runs on a fleet of EC2 instances in an Auto Scaling group. Recently, the team noticed that some records are being processed multiple times, resulting in duplicate entries in DynamoDB. The application uses the DynamoDB PutItem API to write records. The team needs to eliminate duplicates without significantly increasing latency. Which solution should the team implement?

A.Enable DynamoDB auto scaling to increase write capacity and reduce throttling, which causes retries and duplicates.
B.Use DynamoDB TransactWriteItems with a condition check that the record's Kinesis sequence number does not already exist in the table.
C.Place an Amazon SQS FIFO queue between the KCL application and DynamoDB to deduplicate messages.
D.Modify the application to use DynamoDB BatchWriteItem instead of PutItem to reduce the number of write requests.
AnswerB

This ensures exactly-once semantics by atomically checking and writing.

Why this answer

Option B is correct because using a DynamoDB transaction with a condition check on the Kinesis sequence number ensures that each record is written only once. Option A is wrong because idempotent writes would require a unique identifier; using PutItem with a condition expression on a unique attribute (like the sequence number) is effectively the same as option B but transactions provide atomicity. Option C is wrong because idempotent writes in DynamoDB are not natively supported; you must use conditional writes.

Option D is wrong because adding a FIFO queue adds latency and complexity without guaranteeing exactly-once processing in the consumer.

206
MCQhard

A machine learning team is building a fraud detection model. The dataset is highly imbalanced (99.9% legitimate, 0.1% fraudulent). Which EDA technique is most important to apply before modeling?

A.Normalize all numerical features to have zero mean and unit variance.
B.Remove outliers from the dataset using the IQR method.
C.Create a stratified train-test split to preserve the class distribution.
D.Perform correlation analysis to remove highly correlated features.
AnswerC

Ensures the rare class appears in both training and test sets.

Why this answer

Stratified sampling ensures the rare class is represented in train/test splits, preserving the imbalanced ratio for evaluation. Option A is wrong because normalization does not address imbalance. Option C is wrong because correlation analysis is not specific to imbalance.

Option D is wrong because removing outliers could eliminate fraud cases.

207
MCQeasy

A data scientist needs to analyze a dataset stored in Amazon S3 as CSV files. The dataset contains 100 columns, and the data scientist wants to quickly understand the distribution of each column, including missing values, data types, and basic statistics. Which AWS service is best suited for this task?

A.AWS Glue DataBrew
B.Amazon SageMaker Data Wrangler
C.Amazon QuickSight
D.Amazon Athena
AnswerA

Why A is correct

Why this answer

Option A is correct because AWS Glue DataBrew provides visual data profiling and preparation without writing code. Option B is wrong because Amazon Athena is an interactive query service, not a profiling tool. Option C is wrong because Amazon QuickSight is for visualization, not data profiling.

Option D is wrong because SageMaker Data Wrangler is for feature engineering within SageMaker, but DataBrew is simpler for initial exploration.

208
Multi-Selectmedium

Which THREE of the following are valid ways to deploy a model using SageMaker? (Select THREE.)

Select 3 answers
A.Deploy to AWS Lambda
B.Deploy to a SageMaker batch transform job
C.Deploy to a SageMaker asynchronous endpoint
D.Deploy to a SageMaker real-time endpoint
E.Deploy to Amazon EC2 directly
AnswersB, C, D

Batch transform processes large batches of data asynchronously.

Why this answer

Options A, C, and E are correct. SageMaker can deploy to a real-time endpoint (A), a batch transform job (C), or an asynchronous endpoint (E). Option B is wrong because SageMaker does not deploy directly to Lambda.

Option D is wrong because SageMaker does not deploy to EC2 directly without containerization.

209
MCQeasy

A company is using Amazon SageMaker to deploy a model. The model is a large ensemble that requires 8 GB of memory. The company wants to minimize endpoint cost. Which instance type should they choose?

A.ml.c5.xlarge
B.ml.m5.large
C.ml.t2.medium
D.ml.c5.large
E.ml.c5.2xlarge
AnswerA

8 GB memory, minimum required.

Why this answer

Option B is correct because ml.c5.xlarge has 8 GB memory, sufficient for the model. Option A (ml.c5.large) has 4 GB, not enough. Option C (ml.c5.2xlarge) has 16 GB, over-provisioned.

Option D (ml.m5.large) has 8 GB but is more expensive than c5. Option E (ml.t2.medium) has 4 GB.

210
MCQmedium

A data scientist is using Amazon SageMaker to train an XGBoost model for a regression problem. The training data contains missing values in some features. Which approach should the data scientist use to handle missing values in XGBoost?

A.Use K-nearest neighbors imputation
B.Leave missing values as-is; XGBoost handles them natively
C.Remove all rows with missing values
D.Impute missing values with the mean of the column
AnswerB

XGBoost can handle missing values by learning the optimal direction to split.

Why this answer

XGBoost has a built-in mechanism to handle missing values natively by learning the best direction to split on missing values during training. For each split, XGBoost assigns missing values to the left or right child node based on which direction minimizes the loss function, making explicit imputation unnecessary for this algorithm.

Exam trap

The trap here is that candidates often default to common imputation techniques (like mean imputation or row removal) without recognizing that XGBoost has a built-in, algorithm-specific method for handling missing values, which is a key differentiator tested in the MLS-C01 exam.

How to eliminate wrong answers

Option A is wrong because K-nearest neighbors imputation is a data preprocessing technique that introduces computational overhead and potential bias, and it is not needed since XGBoost handles missing values internally. Option C is wrong because removing all rows with missing values can lead to significant data loss and reduced model performance, especially when missingness is not completely at random. Option D is wrong because imputing missing values with the mean of the column can distort the underlying distribution and reduce variance, which may degrade model accuracy, and it is unnecessary given XGBoost's native missing value handling.

211
MCQmedium

Refer to the exhibit. A developer has this IAM policy attached to an IAM role used by SageMaker. When attempting to create an endpoint, the operation fails with an access denied error. What is the MOST likely cause?

A.The policy is missing ecr:DescribeRepositories.
B.The policy is missing s3:ListBucket on the model bucket.
C.The policy is missing sagemaker:DescribeEndpoint.
D.The policy is missing sagemaker:InvokeEndpoint.
AnswerB

SageMaker needs to list the bucket to access model artifacts.

Why this answer

The policy grants permissions to create the model, endpoint config, and endpoint, but it does not include sagemaker:InvokeEndpoint (Option A). The error is likely due to missing sagemaker:InvokeEndpoint, but the question asks about creating an endpoint. Actually, creating an endpoint does not require InvokeEndpoint.

The correct answer is that the policy is missing s3:ListBucket (Option B) because SageMaker needs to list objects in the bucket when accessing the model artifacts. Option C (ecr:DescribeRepositories) is not needed. Option D (sagemaker:DescribeEndpoint) is not required for creation.

212
Multi-Selecthard

Which TWO statements about handling categorical variables in exploratory data analysis are correct? (Select TWO.)

Select 2 answers
A.When a categorical feature has high cardinality, consider grouping rare categories.
B.Target encoding always avoids data leakage.
C.One-hot encoding creates binary columns for each category.
D.Label encoding is suitable for nominal categorical variables.
E.Categorical variables should always be dropped if they have many unique values.
AnswersA, C

Grouping reduces dimensionality and overfitting.

Why this answer

Option A is correct because high-cardinality categorical features can lead to overfitting and sparse representations. Grouping rare categories into a single 'Other' bucket reduces dimensionality and noise, improving model generalization without losing significant predictive signal.

Exam trap

Cisco often tests the misconception that label encoding is safe for nominal data, when in fact it imposes an ordinal relationship that can distort model performance.

213
MCQmedium

A data engineer is building a data pipeline that aggregates customer transaction data. The engineer notices that some transactions have duplicate entries due to a system error. Which approach should the engineer use to identify and remove duplicates based on a unique transaction ID?

A.Sort the data by transaction ID and then check consecutive rows for equality
B.Use fuzzy matching to find similar transaction IDs
C.Group by all columns and aggregate with sum
D.Use the drop_duplicates method on the transaction ID column
AnswerD

drop_duplicates removes exact duplicate rows based on specified columns.

Why this answer

Option B is correct because dropping duplicates based on the transaction ID is straightforward and efficient. Option A is wrong because groupby with aggregation may lose information. Option C is wrong because fuzzy matching is for approximate matches, not exact duplicates.

Option D is wrong because sorting then checking consecutive equals is more complex than needed.

214
MCQhard

A data scientist is setting up an IAM role for an Amazon SageMaker training job. The policy shown is attached to the role. The training job fails with an access denied error when trying to read the training data from s3://my-bucket/training/data.csv. What is the most likely reason?

A.The bucket policy on my-bucket denies access to the IAM role
B.The training job is using a different IAM role
C.The IAM policy is missing the s3:ListBucket permission
D.The IAM role does not have permission to access the bucket location
AnswerA

The IAM policy allows GetObject, but the bucket policy may have a deny rule that overrides the allow.

Why this answer

Option A is correct because the IAM policy shown grants s3:GetObject access to the bucket and object, but the training job still fails with an access denied error. The most likely cause is that the bucket policy on my-bucket explicitly denies access to the IAM role, overriding the IAM policy's allow. In AWS, an explicit deny in a resource-based policy (bucket policy) takes precedence over any allow in an identity-based policy (IAM role policy), causing the access denied error despite the IAM policy appearing sufficient.

Exam trap

The trap here is that candidates often assume the IAM policy alone determines access, overlooking that resource-based policies (like S3 bucket policies) can override IAM permissions with explicit denies, especially when the bucket policy is not shown in the question.

How to eliminate wrong answers

Option B is wrong because the question states the IAM role is set up for the training job, and the policy shown is attached to that role; if a different role were used, the error would likely be about role mismatch or missing permissions, not specifically about reading training data from the given S3 path. Option C is wrong because s3:ListBucket is not required to read a specific object if the full object ARN is known; the s3:GetObject permission on the object ARN is sufficient for reading the data.csv file, and the error is access denied, not a missing permission that would cause a different error like 403 Forbidden with a different message. Option D is wrong because the IAM policy explicitly grants s3:GetObject on the bucket and object ARN, so the role does have permission to access the bucket location; the failure is due to an external deny from the bucket policy, not a lack of permission in the IAM policy.

215
MCQhard

A data scientist is building a time series forecasting model for daily sales data. The data exhibits strong seasonality with a weekly pattern and a yearly trend. The scientist wants to use Amazon SageMaker's built-in algorithm. Which algorithm is most appropriate?

A.Amazon SageMaker DeepAR
B.Linear Learner
C.K-Means
D.XGBoost
AnswerA

DeepAR is a built-in algorithm for time series forecasting that handles seasonality and trends.

Why this answer

DeepAR is designed for time series forecasting with seasonality and trend. Option A is wrong because XGBoost is a tree-based model for tabular data, not specialized for time series. Option C is wrong because K-Means is clustering.

Option D is wrong because Linear Learner can model trends but not seasonality natively.

216
Multi-Selectmedium

A data scientist is performing exploratory data analysis on a dataset with 100 features. They want to identify which features are most correlated with the target variable. Which THREE methods are appropriate for this task?

Select 3 answers
A.Pearson correlation coefficient
B.Variance threshold
C.One-hot encoding
D.Feature importance from a random forest
E.Mutual information
AnswersA, D, E

Measures linear correlation between each feature and the target.

Why this answer

Pearson correlation captures linear relationships. Mutual information captures non-linear dependencies. Feature importance from a tree-based model provides a ranking.

Spearman correlation is for monotonic relationships, but the question asks for three methods among the options; the correct set includes Pearson, mutual information, and feature importance. Note: Spearman is also valid but not listed as a correct option here because we need exactly three; the options given make A, B, and D the correct choices.

217
MCQmedium

A machine learning team is using SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The training job fails with an 'AccessDenied' error. Which IAM permission is MOST likely missing from the SageMaker execution role?

A.s3:GetObject
B.s3:ListBucket
C.kms:Decrypt
D.kms:GenerateDataKey
AnswerC

To read encrypted objects, SageMaker needs kms:Decrypt permission.

Why this answer

SageMaker needs kms:Decrypt permission to read encrypted data from S3. The s3:GetObject permission is also needed, but the error specifically for encrypted data often points to missing KMS permissions. s3:ListBucket is for listing, not reading. kms:GenerateDataKey is for writing. kms:Encrypt is for writing encrypted data.

218
Multi-Selectmedium

Which TWO of the following are best practices for hyperparameter tuning using Amazon SageMaker? (Choose 2)

Select 2 answers
A.Use grid search to exhaustively explore all combinations.
B.Use early stopping to terminate poorly performing training jobs.
C.Include all algorithm hyperparameters in the tuning job.
D.Use a larger training dataset to improve tuning results.
E.Use automatic model tuning with Bayesian optimization.
AnswersB, E

Early stopping avoids wasted resources.

Why this answer

Option A is correct because early stopping saves time and cost. Option C is correct because automatic tuning with Bayesian optimization is efficient. Option B is wrong because manual grid search is less efficient.

Option D is wrong because tuning all parameters may be unnecessary. Option E is wrong because more training data does not improve tuning efficiency.

219
MCQhard

A data scientist is training a deep learning model for image classification using TensorFlow on Amazon SageMaker. The model trains slowly, and the GPU utilization is below 20%. Which action will MOST effectively increase GPU utilization and reduce training time?

A.Reduce the training dataset size.
B.Increase the batch size to better saturate the GPU.
C.Switch to a CPU-only instance.
D.Decrease the batch size to reduce memory pressure.
AnswerB

Larger batches improve GPU utilization and throughput.

Why this answer

Option B is correct because increasing batch size provides more work per GPU step, improving utilization. Option A is wrong because decreasing batch size reduces utilization. Option C is wrong because switching to CPU would be slower.

Option D is wrong because reducing data increases risk of underfitting.

220
Multi-Selectmedium

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently with new data. Which TWO approaches can be used to update the model without downtime? (Choose TWO.)

Select 2 answers
A.Delete the existing endpoint and create a new one with the updated model.
B.Directly update the model artifact in the existing endpoint configuration.
C.Use SageMaker A/B testing to gradually shift traffic to the new model variant.
D.Stop the endpoint, update the model, and restart the endpoint.
E.Use a blue/green deployment by deploying the new model on a separate endpoint and then updating the DNS record.
AnswersC, E

A/B testing with production variants allows traffic shifting without downtime.

Why this answer

Option A and C are correct because A/B testing allows gradual rollout, and blue/green deployment with production variants ensures zero downtime. Option B is wrong because recreating the endpoint causes downtime. Option D is wrong because updating the model artifact while endpoint is active is not supported without redeployment.

Option E is wrong because stopping the endpoint causes downtime.

221
MCQmedium

A data scientist needs to process a large dataset (100 TB) for training a machine learning model. The data is stored in Amazon S3. Which approach is most cost-effective and efficient for data processing?

A.Use AWS Glue ETL jobs.
B.Use Amazon EMR with Apache Spark.
C.Use Amazon Athena to run SQL queries.
D.Use Amazon SageMaker Processing with a single large instance.
AnswerB

Distributed processing is efficient for large data.

Why this answer

Option A is correct because Amazon EMR with Spark is designed for large-scale data processing and is cost-effective. Option B is wrong because Athena is for querying, not complex processing. Option C is wrong because Glue is for ETL but may be slower for large data.

Option D is wrong because SageMaker processing with a single instance may not handle 100 TB efficiently.

222
Multi-Selecthard

A data engineering team is designing a streaming data pipeline that ingests 10,000 events per second. Each event is 2 KB. The pipeline must process events with a latency of less than 1 second. The team is considering using Amazon Kinesis Data Streams with 10 shards. Which TWO additional configurations should the team implement to meet the latency requirement? (Choose two.)

Select 2 answers
A.Use record aggregation to reduce the number of records.
B.Configure Kinesis Data Firehose to deliver data to Amazon S3.
C.Increase the number of shards to 100.
D.Enable auto-scaling of shards based on throughput.
E.Use enhanced fan-out for consumers.
AnswersD, E

Auto-scaling ensures that the stream has enough shards to handle peak load without manual intervention.

Why this answer

Correct options: B and D. Using enhanced fan-out allows multiple consumers to read from the stream with dedicated throughput, reducing latency. Auto-scaling shards ensures sufficient capacity as load varies.

Option A (increase shards to 100) is over-provisioning and increases cost. Option C (record aggregation) is for reducing PUT costs but doesn't affect read latency. Option E (S3 delivery) is not relevant to latency.

223
MCQeasy

A data scientist is exploring a dataset with many features and wants to detect multicollinearity. Which technique should the scientist use?

A.Calculate the Variance Inflation Factor (VIF) for each feature.
B.Compute the Pearson correlation matrix between features.
C.Perform ANOVA on each feature against the target.
D.Create pairwise scatter plots of all features.
AnswerA

VIF measures how much the variance of a regression coefficient is inflated due to collinearity.

Why this answer

Variance Inflation Factor (VIF) is a standard metric for detecting multicollinearity. Option A (pairwise scatter plots) can hint but not quantify. Option B (Pearson correlation matrix) shows pairwise linear correlation but not multicollinearity among multiple variables.

Option D (ANOVA) is for comparing means.

224
MCQmedium

A company uses Amazon SageMaker to train a deep learning model on a GPU instance. The training job is taking too long. Which action would MOST likely reduce training time?

A.Reduce the mini-batch size
B.Use distributed data parallelism across multiple smaller instances
C.Use a larger GPU instance type, such as p3.16xlarge
D.Reduce the number of epochs
AnswerC

More powerful GPU accelerates training.

Why this answer

Option C is correct because using a larger GPU instance like p3.16xlarge provides significantly more GPU memory, CUDA cores, and memory bandwidth, which allows for larger batch sizes and more efficient parallel processing of matrix operations. This directly reduces training time for deep learning models by enabling faster forward and backward passes through the network, especially when the model is large enough to fully utilize the additional GPU resources.

Exam trap

The trap here is that candidates often confuse reducing mini-batch size (Option A) with improving training speed, but in GPU-accelerated deep learning, larger batch sizes better utilize GPU parallelism and reduce the number of iterations, making a larger instance the more effective solution.

How to eliminate wrong answers

Option A is wrong because reducing the mini-batch size typically increases the number of weight updates per epoch and can lead to noisier gradients, which often increases training time due to more frequent synchronization and less efficient GPU utilization. Option B is wrong because distributed data parallelism across multiple smaller instances introduces communication overhead (e.g., gradient synchronization via AllReduce) that can outweigh the benefits for a single GPU-bound training job, especially if the model does not fit in the smaller instances' memory. Option D is wrong because reducing the number of epochs directly reduces the amount of training performed, but it does not address the underlying performance bottleneck of the training process and may result in underfitting or incomplete convergence.

225
MCQeasy

A company wants to use Amazon SageMaker to host a model that was trained using a custom algorithm. The model artifact is stored in Amazon S3. The company wants to ensure that the endpoint can automatically scale based on the number of incoming requests. Which configuration should the company use?

A.Create a SageMaker multi-model endpoint with automatic scaling.
B.Create a SageMaker real-time endpoint and configure automatic scaling using a target tracking policy.
C.Use SageMaker Serverless Inference which scales automatically.
D.Use SageMaker Batch Transform with a scheduled job.
AnswerB

Real-time endpoints with auto-scaling adjust instance count based on load.

Why this answer

SageMaker real-time endpoints support automatic scaling using Application Auto Scaling, which can adjust the number of instances based on metrics like request count. Multi-model endpoints (A) are for serving multiple models. Batch Transform (B) is for offline inference.

Serverless Inference (D) scales automatically but has limitations; the question asks for endpoint scaling, and real-time endpoints with auto-scaling is the standard approach.

Page 2

Page 3 of 24

Page 4