AWS Certified Machine Learning Specialty MLS-C01 MLS-C01 Questions 226–300 | Page 4/24

226

MCQeasy

A data scientist needs to perform hyperparameter optimization for a gradient boosting model. Which built-in Amazon SageMaker feature should they use?

A.Amazon SageMaker Automatic Model Tuning

B.Amazon SageMaker Clarify

C.Amazon SageMaker Debugger

D.Amazon SageMaker Neo

AnswerA

Performs hyperparameter optimization.

Why this answer

SageMaker Automatic Model Tuning performs hyperparameter optimization. Option A is wrong because SageMaker Debugger is for monitoring. Option C is wrong because SageMaker Neo is for model compilation.

Option D is wrong because SageMaker Clarify is for bias detection.

Full explanation →

227

Multi-Selectmedium

A data scientist is training a model using SageMaker and wants to automatically stop training when the model stops improving. Which TWO options can be used?

Select 2 answers

A.AWS Step Functions

B.Built-in early stopping in XGBoost

C.SageMaker Debugger

D.CloudWatch Alarms

E.SageMaker Model Monitor

AnswersB, C

Native early stopping support.

Why this answer

Built-in early stopping in XGBoost (Option B) is correct because XGBoost natively supports an `early_stopping_rounds` parameter that halts training when the validation metric stops improving for a specified number of rounds. SageMaker Debugger (Option C) is correct because it can monitor training metrics in real time and trigger a stop action via a built-in or custom rule (e.g., `VanishingGradient` or `LossNotDecreasing`) when the model stops improving, integrating with SageMaker's `StopTraining` API.

Exam trap

The trap here is that candidates may confuse SageMaker Model Monitor (post-deployment monitoring) with SageMaker Debugger (training-time monitoring), or assume CloudWatch Alarms can directly implement early stopping logic when they are only for threshold-based alerts on emitted metrics.

Full explanation →

228

MCQmedium

A company is building a fraud detection model that must achieve low false positive rates. The dataset is highly imbalanced (0.1% positive class). Which metric is most appropriate for model evaluation?

A.RMSE

B.Accuracy

C.Area under the Precision-Recall curve

D.R-squared

AnswerC

Best for imbalanced datasets.

Why this answer

In highly imbalanced datasets (0.1% positive class), the Precision-Recall curve focuses on the performance of the positive class, which is the minority class of interest. Area under the Precision-Recall curve (AUPRC) is insensitive to the large number of true negatives, making it a robust metric for evaluating models where false positives must be minimized. Unlike ROC-AUC, which can be overly optimistic in severe imbalance, AUPRC directly reflects the trade-off between precision and recall for the rare positive class.

Exam trap

AWS often tests the misconception that ROC-AUC is always the best metric for imbalanced classification, but the trap here is that ROC-AUC can be overly optimistic because it considers true negatives, whereas Precision-Recall AUC focuses solely on the positive class and is the correct choice when false positives must be minimized.

How to eliminate wrong answers

Option A is wrong because RMSE (Root Mean Square Error) is a regression metric that measures the average magnitude of errors between continuous values, and is not suitable for binary classification or imbalanced fraud detection. Option B is wrong because Accuracy is misleading in highly imbalanced datasets; a model that predicts the majority class for all instances would achieve 99.9% accuracy but fail to detect any fraud. Option D is wrong because R-squared is a regression metric that measures the proportion of variance explained by the model, and has no relevance to binary classification or precision-recall evaluation.

Full explanation →

229

Multi-Selectmedium

A data scientist is using SageMaker to build a model for fraud detection. The dataset is highly imbalanced. Which THREE techniques should be applied to address class imbalance?

Select 3 answers

A.Train the model only on the majority class.

B.Use accuracy as the evaluation metric.

C.Apply SMOTE to generate synthetic samples of the minority class.

D.Use class weights in the loss function.

E.Undersample the majority class.

AnswersC, D, E

SMOTE creates synthetic examples to balance classes.

Why this answer

Options A, C, and E are correct. Oversampling the minority class (e.g., SMOTE) and undersampling the majority class help balance the dataset. Using class weights in the loss function penalizes misclassifications of minority class more.

Option B is incorrect because using accuracy as the metric can be misleading for imbalanced datasets; precision-recall or AUC is better. Option D is incorrect because training only on majority class would ignore minority class entirely.

Full explanation →

230

Multi-Selectmedium

A data scientist is performing EDA on a dataset with mixed data types (numerical, categorical, text). The dataset is stored in S3. Which TWO AWS services can be used to directly perform statistical summaries and visualizations without writing custom code?

Select 2 answers

A.Amazon SageMaker Studio

B.AWS Glue DataBrew

C.Amazon Athena

D.Amazon SageMaker Data Wrangler

E.Amazon QuickSight

AnswersD, E

Data Wrangler offers visual data analysis and built-in visualizations.

Why this answer

Options A and D are correct. SageMaker Data Wrangler provides a visual interface for data preparation and analysis with built-in transforms and visualizations. QuickSight is a BI service that can connect to S3 data and create dashboards with statistical summaries.

Option B is wrong because Athena is primarily SQL query engine, not visualization. Option C is wrong because Glue DataBrew is for data preparation but requires some configuration. Option E is wrong because SageMaker Studio is an IDE, not a direct analysis service.

Full explanation →

231

MCQmedium

A data scientist is training a deep learning model on Amazon SageMaker using the built-in Object Detection algorithm. The training job is failing with a 'ResourceLimitExceeded' error when trying to launch multiple GPU instances. Which of the following is the MOST likely cause?

A.The training script has a syntax error.

B.The dataset is too large for the selected instance type.

C.The account has reached the limit for the number of GPU instances in the current AWS Region.

D.The S3 bucket containing the training data has insufficient permissions.

AnswerC

ResourceLimitExceeded indicates service limit reached; contact AWS to increase limits.

Why this answer

Option D is correct because the 'ResourceLimitExceeded' error typically indicates that the requested instance type or count exceeds the account's service limit for SageMaker training instances. Option A is wrong because SageMaker does not enforce a limit on the total dataset size. Option B is wrong because S3 bucket permissions would cause a different error (e.g., AccessDenied).

Option C is wrong because a corrupted training script would cause a different error (e.g., ModuleNotFoundError).

Full explanation →

232

MCQhard

A company is deploying a model for real-time inference with SageMaker. The endpoint receives spiky traffic, with occasional bursts of 10x normal load. Which scaling policy is MOST cost-effective while maintaining availability?

A.Provision a large instance type that can handle the peak load at all times.

B.Manually scale the endpoint based on historical traffic patterns.

C.Use a combination of scheduled scaling for predictable peaks and simple scaling for additional bursts.

D.Use a target tracking scaling policy based on average latency.

AnswerC

Scheduled scaling handles known patterns, while simple scaling provides reactive capacity for bursts.

Why this answer

Option C is correct because it combines scheduled scaling for predictable traffic patterns (e.g., known peak hours) with simple scaling to handle unexpected bursts, ensuring availability during 10x load spikes without over-provisioning. This hybrid approach is more cost-effective than always-on large instances, as it dynamically adjusts capacity only when needed, aligning with SageMaker's automatic scaling capabilities.

Exam trap

The trap here is that candidates often assume target tracking (Option D) is always optimal for cost, but it fails for spiky traffic because it reacts to post-burst metrics like latency, not preemptively scaling for sudden load changes.

How to eliminate wrong answers

Option A is wrong because provisioning a large instance type to handle peak load at all times leads to significant cost waste during low-traffic periods, as you pay for unused capacity continuously. Option B is wrong because manual scaling based on historical patterns cannot react quickly enough to sudden 10x bursts, risking latency or downtime during unpredictable spikes. Option D is wrong because target tracking based on average latency is reactive and may cause slow scaling, as latency increases only after the burst has already impacted performance, potentially leading to dropped requests or throttling.

Full explanation →

233

MCQhard

A data scientist is building a recommendation system using matrix factorization. The dataset has 1 million users and 100,000 items, with a sparse user-item interaction matrix. The scientist wants to minimize training time on Amazon SageMaker. Which algorithm would be most appropriate?

A.Linear Learner

B.Factorization Machines

C.K-Means

D.XGBoost

AnswerB

Built for recommendation systems with sparse data.

Why this answer

Factorization Machines (B) are specifically designed for sparse, high-dimensional datasets like the user-item interaction matrix in recommendation systems. They extend matrix factorization by modeling pairwise feature interactions, which is ideal for collaborative filtering tasks. On Amazon SageMaker, the built-in Factorization Machines algorithm is optimized for sparse data and can train efficiently on 1 million users and 100,000 items, minimizing training time compared to general-purpose algorithms.

Exam trap

The trap here is that candidates often choose XGBoost (D) because of its popularity and strong performance on tabular data, but they overlook that it requires dense feature engineering and is not optimized for the sparse, high-cardinality interaction matrices typical in recommendation systems.

How to eliminate wrong answers

Option A (Linear Learner) is wrong because it models only linear relationships and cannot capture the complex pairwise interactions between users and items that are essential for recommendation systems; it also does not handle sparse categorical features efficiently. Option C (K-Means) is wrong because it is an unsupervised clustering algorithm that groups similar data points, not a supervised or matrix factorization method for predicting user-item interactions; it cannot generate personalized recommendations from a sparse interaction matrix. Option D (XGBoost) is wrong because it is a tree-based ensemble method that requires dense feature engineering and is not designed for sparse matrix factorization; it would be computationally expensive and less effective on high-cardinality categorical features like user and item IDs.

Full explanation →

234

Multi-Selecthard

A data engineer is analyzing a large dataset stored in Amazon S3 using AWS Glue and Amazon Athena. They notice that queries against a table with many small files are slow. Which TWO actions can improve query performance?

Select 2 answers

A.Use Athena's automatic compression

B.Increase the number of Glue DPUs

C.Convert files to Apache Parquet format

D.Decrease the number of partitions

E.Use a larger number of partitions

AnswersC, E

Columnar storage reduces I/O and improves compression.

Why this answer

Compacting small files into larger ones reduces overhead. Partitioning the data limits the amount of data scanned. Using Parquet or ORC improves performance, but the question asks for two actions; converting to columnar format is also valid but not listed as a correct option here.

The correct pair is A and B.

Full explanation →

235

MCQeasy

A data scientist is building a time series forecasting model for monthly sales data. The scientist has observed that the sales data shows a clear upward trend and a seasonal pattern that repeats every 12 months. Which algorithm would be most appropriate for this task?

A.ARIMA

B.Random Forest

C.k-means clustering

D.Linear regression with time-based features

AnswerA

ARIMA (or SARIMA) directly models trend and seasonality in time series data.

Why this answer

ARIMA (Autoregressive Integrated Moving Average) is specifically designed for time series forecasting and can handle both trend and seasonality through its parameters: the 'I' (differencing) removes trend, and seasonal ARIMA (SARIMA) extends it with seasonal differencing and seasonal AR/MA terms to capture the 12-month repeating pattern. This makes it the most appropriate choice for monthly sales data with a clear upward trend and annual seasonality.

Exam trap

The trap here is that candidates often choose Linear regression with time-based features (Option D) because they think adding a time index and month dummies is sufficient, but they overlook that ARIMA is purpose-built for time series with autocorrelation and seasonality, while linear regression violates the independence assumption and cannot model the stochastic seasonal patterns without extensive feature engineering.

How to eliminate wrong answers

Option B (Random Forest) is wrong because it is a tree-based ensemble method for regression or classification that does not inherently model temporal dependencies or seasonality; it treats each time point as an independent feature, ignoring the sequential nature and autocorrelation of time series data. Option C (k-means clustering) is wrong because it is an unsupervised clustering algorithm used to partition data into groups based on similarity, not for forecasting future values in a time series. Option D (Linear regression with time-based features) is wrong because while it can model a linear trend by including a time index feature, it cannot capture the complex autocorrelation structure and seasonal patterns without manually engineering lagged variables and seasonal dummies, and it assumes independent errors, which is violated in time series data.

Full explanation →

236

MCQmedium

A company has a dataset with a timestamp column and multiple numerical metrics. They want to identify seasonality and trends. Which AWS service is best suited for this analysis?

A.Amazon SageMaker Canvas

B.Amazon CloudWatch

C.Amazon QuickSight

D.Amazon Athena

AnswerC

QuickSight offers time series analysis and forecasting capabilities.

Why this answer

Amazon QuickSight provides built-in time series visualization and forecasting. SageMaker Canvas is for ML models without code. Athena is for querying.

CloudWatch is for monitoring AWS resources. Kinesis Data Analytics is for real-time analytics.

Full explanation →

237

Matchingmedium

Match each AWS security service to its function in ML.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Manage access to AWS resources

Encryption key management

Audit API calls

Isolate network resources

Discover and protect sensitive data

Why these pairings

Security services are important for compliance in ML.

Full explanation →

238

MCQmedium

A data scientist is working with a dataset containing customer transaction records stored in Amazon S3 as CSV files. The dataset has 500 columns and 2 million rows. The scientist wants to perform EDA to understand data types, missing values, and summary statistics for each column. They need to do this quickly and without writing custom code. The scientist has access to AWS Glue DataBrew and Amazon SageMaker Data Wrangler. Which approach should the scientist take?

A.Use Amazon SageMaker Data Wrangler to import the data and generate a report

B.Use Amazon Athena to run SELECT statements on each column

C.Use AWS Glue DataBrew to create a profile job that outputs data quality reports

D.Use AWS Glue ETL jobs with PySpark to compute statistics

AnswerC

DataBrew's profile job automatically computes statistics and detects missing values.

Why this answer

AWS Glue DataBrew provides a visual interface for data profiling and can handle large datasets without writing code. It automatically detects data types, missing values, and summary statistics. Option B is wrong because SageMaker Data Wrangler requires more manual setup and coding.

Option C is wrong because Athena requires SQL queries and is not a profiling tool. Option D is wrong because Glue ETL jobs require writing code.

Full explanation →

239

Drag & Dropmedium

Drag and drop the steps to deploy a model as a SageMaker endpoint for real-time inference in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Deployment requires model creation, endpoint configuration, endpoint creation, and testing.

Full explanation →

240

MCQmedium

A DevOps engineer runs the CloudWatch Logs Insights query shown above on the log group for an ML training job. The result shows a spike in ERROR messages at a specific hour. What should the engineer do next to identify the root cause?

A.Modify the query to display the actual @message for the hour with the spike.

B.Remove the filter on ERROR to see all messages.

C.Change the bin to 5m to see more detailed spikes.

D.Increase the limit to 50 to see more hours.

AnswerA

Directly see error details.

Why this answer

Option C is correct because examining the actual error messages during that hour can reveal the cause. Option A is wrong because bin(5m) may be too granular. Option B is wrong because the query already uses 1h bins.

Option D is wrong because the spike is already identified.

Full explanation →

241

Multi-Selecthard

A data engineer is performing exploratory data analysis on a dataset with 1 million rows and 50 features. The engineer wants to identify missing values and outliers. Which THREE approaches should the engineer use? (Choose three.)

Select 3 answers

A.Create a correlation heatmap of all features

B.Use a DataFrame.info() method to see non-null counts

C.Plot box plots for all features simultaneously

D.Use a missingno matrix to visualize missing data patterns

E.Use a DataFrame.describe() to view summary statistics

AnswersB, D, E

info() shows non-null counts and data types.

Why this answer

Options A, C, and D are correct. A gives overall count, C provides visual summary, D gives statistical summary. Option B is wrong because box plots are for continuous variables, not for all features.

Option E is wrong because correlation heatmap does not show missing values or outliers.

Full explanation →

242

MCQmedium

A machine learning engineer is monitoring a deployed model on SageMaker and notices that the prediction latency is increasing over time. The model is a linear regression with a small number of features. Which is the MOST likely cause?

A.The number of features is too large

B.The CPU utilization is too low

C.The model is overfitting to recent data

D.The inference code has a memory leak

AnswerD

Memory leaks cause gradual performance degradation and increased latency.

Why this answer

Memory leak or accumulation of model artifacts in inference code can cause latency growth over time.

Full explanation →

243

MCQmedium

A data scientist uses SageMaker Studio to run EDA on a dataset with 500 features. The goal is to reduce dimensionality before modeling. Which EDA technique should the data scientist use to understand the variance explained by each feature?

A.Histogram of the target variable

B.Scree plot of principal components

C.Heatmap of feature correlations

D.Box plot of each feature

AnswerB

Scree plot displays variance explained by each component.

Why this answer

A Scree plot from PCA shows the eigenvalues or variance explained by each principal component, helping decide how many components to retain. Option A is wrong because a heatmap of correlations shows pairwise relationships, not variance. Option C is wrong because a histogram shows distribution.

Option D is wrong because a box plot shows summary statistics.

Full explanation →

244

MCQhard

A financial services company uses Amazon Kinesis Data Streams with 50 shards to ingest real-time stock trade data. The data is consumed by a custom Java application running on Amazon EC2 instances. Recently, the application has been experiencing high latency, and CloudWatch metrics show that the average iterator age is increasing. The application uses the Kinesis Client Library (KCL) with DynamoDB for lease tracking. The EC2 instances are in an Auto Scaling group with a minimum of 2 and maximum of 10 instances, and the current CPU utilization is below 50%. The team wants to reduce latency without increasing costs significantly. What should they do?

A.Increase the provisioned read capacity of the DynamoDB lease table

B.Enable enhanced fan-out on the Kinesis stream

C.Increase the number of shards in the Kinesis stream

D.Increase the maximum size of the Auto Scaling group and set a scaling policy based on iterator age

AnswerD

Adding more consumers reduces iterator age.

Why this answer

Increasing the number of consumers (EC2 instances) allows parallel processing of shards, reducing iterator age. Option A is wrong because reducing shards would increase load per shard. Option C is wrong because DynamoDB provisioned throughput is not the bottleneck.

Option D is wrong because enabling enhanced fan-out is for multiple consumers, not the same consumer group.

Full explanation →

245

MCQhard

A team is training a deep learning model on SageMaker using a custom Docker container. The training job fails with 'OutOfMemoryError'. The instance type is ml.p3.2xlarge with 61 GB memory. Which change should increase available memory?

A.Reduce the batch size to use less memory.

B.Use SageMaker distributed data parallelism to distribute the model across multiple instances.

C.Set the 'shm-size' parameter in the SageMaker training container to a larger value.

D.Mount an Amazon FSx for Lustre file system to offload data.

AnswerC

Increasing shared memory (/dev/shm) can resolve OutOfMemory errors in deep learning frameworks.

Why this answer

The 'OutOfMemoryError' in a SageMaker training container often stems from insufficient shared memory (/dev/shm) for data-loading workers, especially with PyTorch or TensorFlow dataloaders that use multiprocessing. Increasing the 'shm-size' parameter allocates more shared memory to the container, resolving the error without altering the model or instance type.

Exam trap

Cisco often tests the misconception that 'OutOfMemoryError' always refers to GPU memory, leading candidates to choose batch size reduction, when in SageMaker containers it frequently indicates insufficient shared memory for data-loading processes.

How to eliminate wrong answers

Option A is wrong because reducing the batch size decreases GPU memory usage, not the shared memory (/dev/shm) that causes the 'OutOfMemoryError' in this context. Option B is wrong because SageMaker distributed data parallelism splits the model or data across multiple instances, which does not increase the memory available to a single container; it adds complexity without addressing the shared memory limit. Option D is wrong because mounting an Amazon FSx for Lustre file system offloads storage, not memory; it does not increase the container's shared memory or RAM capacity.

Full explanation →

246

MCQmedium

A data scientist creates a model resource in SageMaker using the JSON configuration in the exhibit. When creating an endpoint, the deployment fails with an error 'ModelError: Cannot find inference code'. What is the MOST likely cause?

A.The model.tar.gz file is missing the model weights

B.The ECR image does not exist

C.The inference container environment does not specify SAGEMAKER_PROGRAM

D.The training container does not have the SAGEMAKER_PROGRAM variable

AnswerC

The inference container needs the SAGEMAKER_PROGRAM variable to point to the inference script.

Why this answer

Option D is correct because the inference container's Environment is empty; it should define SAGEMAKER_PROGRAM for the inference script. Option A is not necessarily missing. Option B would cause a different error.

Option C is not required for inference.

Full explanation →

247

MCQmedium

A company is using AWS Glue Data Catalog as the metadata store for their data lake. They have multiple AWS accounts and want to share the catalog across accounts. Which feature should they use?

A.Amazon Athena Federated Query

B.AWS Lake Formation

C.AWS Resource Access Manager (RAM)

D.Amazon S3 Cross-Region Replication

AnswerC

RAM allows sharing Glue Data Catalog across accounts.

Why this answer

AWS Resource Access Manager (RAM) enables you to share AWS Glue Data Catalog databases and tables across multiple AWS accounts without needing to copy metadata. This allows a centralized catalog to be consumed by different accounts for querying and ETL operations, maintaining a single source of truth for the data lake.

Exam trap

The trap here is that candidates often confuse AWS Lake Formation's cross-account access capabilities with the actual sharing mechanism, but Lake Formation relies on AWS RAM to enable the sharing of Data Catalog resources.

How to eliminate wrong answers

Option A is wrong because Amazon Athena Federated Query allows querying data from external sources (e.g., CloudWatch, DynamoDB) using connectors, but it does not share the Glue Data Catalog across accounts. Option B is wrong because AWS Lake Formation provides fine-grained access control and data lake management, but cross-account catalog sharing is implemented via AWS RAM, not directly by Lake Formation (though Lake Formation can use RAM for sharing). Option D is wrong because Amazon S3 Cross-Region Replication replicates objects between S3 buckets in different regions, but it does not share the Glue Data Catalog metadata store across accounts.

Full explanation →

248

MCQhard

A financial services company uses Amazon SageMaker to train a model for credit risk prediction. The dataset contains 500 features and 1 million records. The target variable is binary with 20% default rate. The data scientist uses a gradient boosting algorithm (XGBoost) with default hyperparameters. After training, the model achieves 95% accuracy, but the precision for the default class is only 30%, and recall is 15%. The business requires at least 50% recall and 40% precision for the default class. The data scientist tries to adjust the decision threshold, but this does not simultaneously meet both targets. The scientist suspects that the model is not learning the default patterns well. The company also has a large dataset of unlabeled transactions that could be used. Which action should the data scientist take to improve the model?

A.Apply PCA to reduce dimensionality and noise.

B.Use the unlabeled data for semi-supervised learning with pseudo-labeling.

C.Increase the learning rate to accelerate convergence.

D.Reduce the number of features using feature selection to simplify the model.

AnswerB

Pseudo-labeling leverages unlabeled data to improve minority class detection.

Why this answer

Option C is correct because using the unlabeled data for pseudo-labeling can improve the model when the labeled dataset is imbalanced and the model is struggling to learn the minority class. Option A is wrong because reducing the number of features may not help if the features are relevant. Option B is wrong because increasing the learning rate may cause overfitting or divergence.

Option D is wrong because PCA may discard valuable information.

Full explanation →

249

MCQeasy

A data scientist needs to query a 2 TB dataset stored in Amazon S3 using Amazon Athena. The data is in CSV format and is used for exploratory analysis. Queries are currently slow and expensive. Which action will improve query performance and reduce cost?

A.Convert the data to JSON format to improve compression.

B.Increase the number of workers in the Athena query engine.

C.Convert the data to Parquet format and partition by a commonly filtered column.

D.Create a composite index on the data using Athena's index feature.

AnswerC

Parquet reduces data scanned due to columnar storage, and partitioning limits scan range.

Why this answer

Option D is correct because converting CSV to Parquet reduces scan size and query cost, and partitioning further limits data scanned. Option A is wrong because increasing workers is not applicable to Athena. Option B is wrong because converting to JSON may increase data size and cost.

Option C is wrong because Athena does not use indexes.

Full explanation →

250

MCQmedium

A data scientist is deploying a PyTorch model to Amazon SageMaker for real-time inference. The model runs on a large instance but inference latency is too high. Which action is MOST likely to reduce latency without sacrificing accuracy?

A.Compile the model using SageMaker Neo

B.Switch from a GPU instance to a CPU instance

C.Quantize the model weights from FP32 to INT8

D.Deploy the model to a multi-model endpoint

AnswerA

Neo optimizes the model for the target hardware, reducing latency without retraining or accuracy loss.

Why this answer

Option B is correct because SageMaker Neo optimizes trained models for target hardware, reducing latency without retraining. Option A may reduce latency but could affect accuracy. Option C changes instance type but not necessarily optimize the model.

Option D changes endpoint type, not latency.

Full explanation →

251

Multi-Selectmedium

Which TWO options are valid ways to reduce the amount of data scanned by Amazon Athena queries, thereby reducing cost?

Select 2 answers

A.Use columnar storage formats like Parquet or ORC

B.Use LIMIT clause in SQL queries

C.Convert data to CSV format

D.Create materialized views in Athena

E.Partition the data by a frequently filtered column

AnswersA, E

Columnar formats allow reading only required columns.

Why this answer

Partitioning allows Athena to skip entire partitions. Using columnar formats like Parquet reduces the amount of data read per column. Converting to CSV increases data scanned.

Materialized views don't reduce scan. Limiting row count does not reduce scan of underlying data.

Full explanation →

252

Multi-Selecteasy

Which TWO actions can help reduce overfitting when training a model on SageMaker? (Choose TWO.)

Select 2 answers

A.Reduce the amount of training data

B.Use early stopping based on validation error

C.Increase the maximum depth of the trees in XGBoost

D.Increase the number of training epochs

E.Add L1 regularization to the loss function

AnswersB, E

Early stopping halts training when validation error stops improving, preventing overfitting.

Why this answer

Early stopping monitors the validation error during training and halts the process when the error stops improving, preventing the model from learning noise in the training data. This directly reduces overfitting by ensuring the model does not continue to fit to spurious patterns after generalization has peaked.

Exam trap

AWS often tests the misconception that adding more data or increasing model complexity (like tree depth or epochs) always improves performance, when in fact these actions typically worsen overfitting without proper validation or regularization.

Full explanation →

253

MCQhard

A company uses SageMaker to deploy a model for real-time inference. The model is a large ensemble that requires 8 GB of memory and has high latency. The team wants to reduce latency without increasing cost. Which strategy is most effective?

A.Use a larger instance type with more memory.

B.Deploy the model on multiple instances behind a load balancer.

C.Use SageMaker Neo to compile the model for the target instance.

D.Switch from real-time inference to batch transform.

AnswerC

Neo optimizes model for faster inference without additional cost.

Why this answer

Option D is correct because SageMaker Neo optimizes trained models for target hardware, reducing latency and memory footprint. Option A is wrong because using a larger instance increases cost. Option B is wrong because batch transform is for offline, not real-time.

Option C is wrong because multiple instances increase cost.

Full explanation →

254

MCQmedium

A data scientist is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. During training, the job fails with an access denied error. What is the MOST likely cause?

A.The training instance type does not support encryption

B.The training data is not in the same region as the SageMaker notebook

C.The S3 bucket policy does not allow SageMaker to list objects

D.The SageMaker execution role lacks kms:Decrypt permission for the KMS key

AnswerD

SageMaker needs KMS decrypt permissions to read encrypted data from S3.

Why this answer

SageMaker needs permission to use the KMS key to decrypt the data; the execution role must have kms:Decrypt permissions.

Full explanation →

255

MCQmedium

A machine learning engineer is trying to deploy a model using a SageMaker endpoint but receives an access denied error. The IAM policy attached to the role is shown in the exhibit. What is the MOST likely cause of the error?

A.The policy does not include sagemaker:CreateEndpoint.

B.The policy does not specify resource ARNs.

C.The policy does not include sagemaker:InvokeEndpoint.

D.The policy does not include iam:PassRole.

AnswerD

SageMaker requires iam:PassRole to use the execution role.

Why this answer

The error occurs because the IAM role used by SageMaker does not have the iam:PassRole permission, which is required to allow SageMaker to assume the role and access the necessary resources (e.g., S3 buckets, EC2 instances) during endpoint deployment. Without this permission, SageMaker cannot pass the role to the service, resulting in an access denied error even if other SageMaker actions are allowed.

Exam trap

The trap here is that candidates often focus on missing SageMaker-specific actions (like CreateEndpoint or InvokeEndpoint) rather than recognizing that the fundamental issue is the missing iam:PassRole permission, which is a common prerequisite for any AWS service that needs to assume a role.

How to eliminate wrong answers

Option A is wrong because sagemaker:CreateEndpoint is not required for deploying a model to an existing endpoint; the error occurs during the deployment step where the role is passed, not during endpoint creation. Option B is wrong because the policy does specify resource ARNs (e.g., 'Resource': '*'), so the absence of ARNs is not the issue. Option C is wrong because sagemaker:InvokeEndpoint is used for invoking the endpoint after deployment, not for the deployment itself, and the error occurs before invocation.

Full explanation →

256

MCQmedium

An IAM policy attached to a SageMaker notebook role is shown. The data engineer tries to run an Athena query on a table in the 'my_database' Glue database. The query fails with an access denied error. What is the MOST likely cause?

A.The policy does not allow s3:PutObject on the query results location.

B.The policy does not allow glue:GetTable on the specific database.

C.The policy does not allow athena:StartQueryExecution on the Athena workgroup.

D.The policy does not allow s3:ListBucket on the bucket.

AnswerC

Athena requires workgroup-level permissions; the policy grants StartQueryExecution on all resources, but if a workgroup is specified, additional permissions may be needed.

Why this answer

Option D is correct because the policy does not grant permission to the Athena workgroup. Athena requires workgroup permissions for StartQueryExecution. Option A is wrong because the policy allows GetObject.

Option B is wrong because the policy allows GetTable and GetDatabase. Option C is wrong because S3 actions are allowed on the bucket.

Full explanation →

257

MCQhard

A data scientist trains a neural network using TensorFlow on SageMaker. The training job fails with a 'CUDA out of memory' error. What is the most likely cause and solution?

A.The dataset is too large. Use SageMaker Pipe mode.

B.The model is too large for the GPU. Use a smaller batch size.

C.The training script has a bug. Use SageMaker Debugger.

D.The instance type is insufficient. Use distributed training across multiple instances.

AnswerB

Reducing batch size decreases memory usage.

Why this answer

CUDA out of memory indicates that the GPU memory is insufficient for the batch size or model size. Reducing the batch size is a common fix. Switching to CPU is not ideal for deep learning.

Increasing the number of instances may help but requires distributed training setup. Upgrading to a larger instance type is another option, but reducing batch size is simpler.

Full explanation →

258

MCQeasy

A company uses SageMaker to host a real-time inference endpoint. The endpoint is receiving a large number of requests, but the latency is higher than expected. The data scientist observes that the CPU utilization is low but memory utilization is high. Which action should be taken to reduce latency?

A.Switch to an instance type with more memory or optimize the model to reduce memory footprint.

B.Enable VPC traffic mirroring to diagnose network issues.

C.Use an instance type with more vCPUs.

D.Increase the number of instances in the endpoint.

AnswerA

Addresses memory bottleneck.

Why this answer

Option A is correct because high memory utilization suggests the model is memory-bound; increasing instance memory or using a model with lower memory footprint can reduce latency. Option B is wrong because CPU utilization is low, so more CPU cores won't help. Option C is wrong because increasing instance count can help throughput but not necessarily latency per request; also it may increase cost.

Option D is wrong because the issue is memory, not network.

Full explanation →

259

MCQeasy

A data engineer needs to transfer 50 TB of data from an on-premises HDFS cluster to Amazon S3. The data must be encrypted in transit and at rest. The on-premises network has a 1 Gbps connection to AWS. The transfer must complete within 5 days. Which solution is MOST cost-effective and meets the requirements?

A.Use S3 Transfer Acceleration to upload the data directly from HDFS to S3.

B.Use AWS DataSync with a DataSync agent installed on-premises to transfer the data to S3.

C.Order an AWS Snowball Edge device and copy the data to it, then ship it back.

D.Use AWS Glue to read from HDFS and write to S3 in a continuous ETL job.

AnswerB

DataSync can transfer over network with encryption and is optimized for speed.

Why this answer

Option C is correct. With 1 Gbps, the maximum theoretical transfer in 5 days is about 54 TB (1 Gbps = 0.125 GB/s, 0.125 * 86400 * 5 = 54000 GB = 54 TB). So it is feasible.

AWS DataSync can transfer data from HDFS via a private endpoint using the DataSync agent, with encryption in transit (TLS) and at rest (S3 SSE). Option A is wrong because S3 Transfer Acceleration only speeds up uploads over public internet, not from HDFS directly. Option B is wrong because Snowball Edge would be faster but more expensive for this volume that can fit in the time window.

Option D is wrong because AWS Glue is for ETL, not data transfer.

Full explanation →

260

Multi-Selecthard

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

Select 3 answers

A.Check the proportion of missing values for each feature.

B.Compute pairwise correlation coefficients between numerical features.

C.Encode all categorical features using label encoding for simplicity.

D.Include all categorical features with high cardinality as-is in the model.

E.Visualize the distribution of numerical features using histograms and box plots.

AnswersA, B, E

Missing value analysis is a key EDA step.

Why this answer

Option A is correct because checking the proportion of missing values for each feature is a fundamental step in exploratory data analysis (EDA). It helps identify data quality issues, such as systematic missingness, which can bias downstream modeling and inform decisions about imputation strategies or feature exclusion.

Exam trap

The trap here is that candidates may assume label encoding is harmless for categorical features, but it imposes an artificial order that can distort model behavior, especially in tree-based models that rely on split points.

Full explanation →

261

MCQeasy

A company is using Amazon SageMaker to train a model and wants to track hyperparameter tuning jobs. Which AWS service is BEST suited to store and query metadata such as tuning job configurations and results?

A.Amazon CloudWatch Logs

B.Amazon S3 with Amazon Athena

C.Amazon SageMaker Experiments

D.Amazon DynamoDB

AnswerC

SageMaker Experiments is the native solution for tracking tuning jobs and their results.

Why this answer

Amazon SageMaker Experiment Management is the native service to track experiments, trials, and components. SageMaker automatically logs hyperparameters and metrics. Amazon DynamoDB could be used but requires custom integration.

CloudWatch Logs stores logs, not structured metadata. Athena queries data in S3 but is not optimized for real-time tracking. S3 is storage, not a queryable metadata store.

Full explanation →

262

Multi-Selectmedium

A data scientist is analyzing a dataset with 100 features and 10,000 observations. The target variable is binary (0/1). Initial exploratory data analysis reveals that many features have missing values, high correlation with each other, and non-normal distributions. The data scientist wants to identify the most important features for predicting the target while reducing dimensionality. Which TWO actions should the data scientist take? (Choose two.)

Select 2 answers

A.Use chi-squared test to rank features by p-value.

B.Apply Principal Component Analysis (PCA) to reduce dimensionality.

C.Perform a t-test for each feature to compare means between classes.

D.Calculate Pearson correlation coefficients between features and target.

E.Compute mutual information between each feature and the target.

AnswersB, E

PCA reduces dimensionality by creating uncorrelated components, handling multicollinearity.

Why this answer

B is correct because Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated features into a set of linearly uncorrelated principal components, effectively handling high correlation and reducing the feature space. It does not require normality assumptions and can work with missing values after imputation, making it suitable for this dataset.

Exam trap

Cisco often tests the misconception that correlation-based methods (like Pearson or chi-squared) are sufficient for feature selection in high-dimensional, non-normal data, when in fact they fail due to assumptions about linearity and distribution.

Full explanation →

263

MCQeasy

During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?

A.Remove the feature entirely to avoid bias.

B.Create a binary indicator for missingness and impute the continuous values with the median.

C.Impute missing values with -1 since it is out of range.

D.Drop all rows with missing values in that feature.

AnswerB

This captures both the pattern of missingness and the distribution.

Why this answer

Option B is correct because it preserves the predictive signal from the feature while accounting for the pattern of missingness. Creating a binary indicator allows the model to learn whether missingness itself is informative, and median imputation is robust to outliers for a continuous feature. This approach avoids the bias of dropping the feature entirely and is more principled than arbitrary out-of-range imputation.

Exam trap

The trap here is that candidates often choose to drop the feature or rows without considering that missingness can be a meaningful signal, and that a binary indicator combined with robust imputation is a standard technique for high-missingness continuous features.

How to eliminate wrong answers

Option A is wrong because removing a feature with 70% missing values discards potentially important domain-driven signal, and the missingness itself may be informative. Option C is wrong because imputing with -1 (an arbitrary out-of-range value) can distort the feature's distribution and introduce a false signal that the model may misinterpret as a valid numeric relationship. Option D is wrong because dropping all rows with missing values in that feature would discard 70% of the dataset, leading to severe sample size reduction and potential selection bias.

Full explanation →

264

MCQeasy

A data analyst is exploring a dataset and wants to identify outliers in a numerical feature. Which visualization technique is most effective for detecting outliers?

A.Line chart

B.Histogram

C.Scatter plot

D.Box plot

AnswerD

Box plots display outliers as individual points outside the whiskers.

Why this answer

Option A is correct because a box plot explicitly shows quartiles and potential outliers as points beyond the whiskers. Option B is wrong because a histogram shows distribution but outliers may be in low-frequency bins. Option C is wrong because a scatter plot shows relationship between two variables, not univariate outliers.

Option D is wrong because a line chart is for time series.

Full explanation →

265

MCQmedium

A data engineering team needs to ingest streaming data from thousands of IoT devices into a data lake on Amazon S3 for near-real-time analytics. The data must be partitioned by device ID and timestamp, and the team must minimize data loss during ingestion failures. Which solution is MOST appropriate?

A.Use Amazon Kinesis Data Streams with a Lambda function that writes to S3.

B.Use Amazon Kinesis Data Firehose to write directly to S3 with dynamic partitioning.

C.Use Amazon S3 Transfer Acceleration with direct uploads from devices.

D.Use AWS Lambda to receive data via API Gateway and write to S3.

AnswerB

Firehose provides automatic partitioning, retries, and near-real-time delivery to S3.

Why this answer

Option B is correct because Kinesis Data Firehose can directly write to S3 with partitioning and automatic retries, minimizing data loss. Option A is wrong because Kinesis Data Streams requires a separate consumer to write to S3, adding complexity. Option C is wrong because Lambda has a 15-minute limit and may lose data if the function fails.

Option D is wrong because S3 Transfer Acceleration is for speeding up uploads, not for streaming ingestion.

Full explanation →

266

MCQmedium

A company uses Amazon EMR to run Spark jobs on a cluster with 10 core nodes of type r5.xlarge. The jobs are I/O intensive and read large amounts of data from S3. The team notices high network throughput but low CPU utilization. Which configuration change would improve job performance at the same cost?

A.Change the instance type to m5.xlarge (general purpose) to balance resources.

B.Increase the number of core nodes to 20.

C.Replace the core nodes with r5d.xlarge instances that have local SSDs.

D.Use spot instances for the core nodes to save cost and reinvest in more nodes.

AnswerC

Local SSDs provide high I/O for caching, reducing network traffic.

Why this answer

Option B is correct because r5d instances include local NVMe SSDs, which can be used for caching intermediate data, reducing network I/O and improving performance for I/O intensive jobs. Option A is wrong because increasing core nodes increases cost. Option C is wrong because using spot instances reduces cost but not performance.

Option D is wrong because moving to m5 instances (general purpose) may not improve I/O.

Full explanation →

267

Multi-Selectmedium

A machine learning engineer is analyzing a dataset with 500 features and suspects multicollinearity. Which TWO techniques can help identify and address multicollinearity during exploratory data analysis? (Choose TWO.)

Select 2 answers

A.Apply t-SNE for visualization

B.Apply Principal Component Analysis (PCA)

C.Calculate Variance Inflation Factor (VIF) for each feature

D.Generate a correlation matrix heatmap

E.Use Lasso regression to select features

AnswersC, D

VIF > 5-10 indicates multicollinearity.

Why this answer

Variance Inflation Factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity. Correlation matrix heatmap shows pairwise correlations. PCA reduces dimensionality but does not directly identify multicollinearity.

Lasso regression addresses it via regularization but is a modeling step. t-SNE is for visualization of high-dimensional data.

Full explanation →

268

Multi-Selecteasy

A data engineer is building a data pipeline using AWS Glue. The pipeline reads data from Amazon S3, transforms it, and writes it back to S3 in a different format. The engineer needs to handle schema evolution (new columns added over time). Which TWO features of AWS Glue can help manage schema evolution?

Select 2 answers

A.AWS Glue Data Catalog

B.AWS Glue DynamicFrame

C.AWS Lake Formation

D.Amazon Athena

E.Amazon S3 object tags

AnswersA, B

Data Catalog stores schema and can be updated as schema evolves.

Why this answer

Options B and D are correct. The Glue Data Catalog can store schema and update it as new columns are added. DynamicFrame in Glue ETL can handle schema changes automatically by allowing optional fields.

Option A is wrong because AWS Lake Formation is for data lake security, not schema evolution. Option C is wrong because Amazon Athena is a query engine, not a schema evolution tool. Option E is wrong because S3 object tags are not for schema management.

Full explanation →

269

MCQhard

A data scientist is building a training dataset from data stored in Amazon S3. The data consists of JSON files each containing a 'timestamp' field. The scientist wants to use AWS Glue to catalog the data and enable querying via Amazon Athena. However, Athena queries are returning zero results for time-range filters. What is the most likely cause?

A.The AWS Glue crawler does not have permissions to read the S3 bucket.

B.Athena cannot query nested JSON objects.

C.The JSON files are not in the correct format for Athena.

D.The 'timestamp' field is not defined as a partition column in the Glue table.

AnswerD

Correct: Without partitioning, Athena scans all data, but time-range filters still work; however, the question implies zero results, which could be due to incorrect partition pruning.

Why this answer

Athena uses the partition columns derived from the Glue catalog. If the timestamp column is not used as a partition key, queries that filter on it will scan all data. Option B (timestamp is not a partition column) is correct because Glue can automatically partition by date, but the user must set it.

Option A (wrong file format) is unlikely if JSON is supported. Option C (Athena cannot query nested JSON) is false; Athena supports JSON. Option D (insufficient permissions) would cause a different error.

Full explanation →

270

MCQmedium

A company runs a real-time fraud detection system using Amazon SageMaker. The model is deployed as a SageMaker endpoint and receives predictions within milliseconds. Recently, the model's accuracy has degraded due to data drift. The data scientists want to monitor the model's performance continuously. What is the most effective way to detect data drift?

A.Store all incoming requests in Amazon S3 and use Amazon Athena to run periodic SQL queries for drift detection

B.Set up Amazon CloudWatch anomaly detection on the endpoint's invocation count and latency metrics

C.Enable Amazon SageMaker Model Monitor to capture inference data and compare it against a baseline dataset

D.Use Amazon CloudWatch Logs Insights to analyze inference logs and set custom alarms

AnswerC

Why C is correct

Why this answer

Option C is correct because SageMaker Model Monitor can automatically detect data drift by comparing incoming data against a baseline. Option A is wrong because CloudWatch Logs Insights can query logs but not automatically detect drift. Option B is wrong because storing predictions in S3 and using Athena is batch-oriented and not automated.

Option D is wrong because CloudWatch anomaly detection is generic and not specialized for ML model drift.

Full explanation →

271

Multi-Selecteasy

Which TWO actions are appropriate when dealing with outliers in a dataset during exploratory data analysis? (Select TWO.)

Select 2 answers

A.Replace the mean with the median for numerical features.

B.Apply log transformation to reduce the impact of extreme values.

C.Remove all outliers without further investigation.

D.Use visualization techniques like box plots to identify outliers.

E.Assume outliers are errors and delete them.

AnswersB, D

Log transformation can compress skewed distributions and reduce outlier influence.

Why this answer

Option B is correct because applying a log transformation compresses the range of the data, reducing the influence of extreme values without removing them. This is a common technique in exploratory data analysis for right-skewed distributions, as it can make the data more normally distributed and improve the performance of models that assume normality.

Exam trap

Cisco often tests the distinction between data transformation techniques (like log transformation) and data removal or replacement strategies, trapping candidates who think that simply changing a summary statistic (mean to median) or deleting outliers without investigation is a proper handling method.

Full explanation →

272

Multi-Selectmedium

A data scientist is training a deep neural network on Amazon SageMaker. The training is taking a long time and the data scientist wants to speed it up. Which THREE actions can help reduce training time?

Select 3 answers

A.Use GPU instances instead of CPU instances

B.Use distributed training across multiple instances

C.Use Pipe mode to stream data from S3

D.Increase the batch size

E.Use a smaller instance type

AnswersA, B, C

GPUs accelerate deep learning computations.

Why this answer

GPU instances (e.g., P3, P4d) are optimized for the massively parallel matrix operations required by deep neural networks, providing orders-of-magnitude faster computation than CPU instances for training tasks. By offloading tensor operations to GPU cores, the training time is significantly reduced, especially for large models and datasets.

Exam trap

AWS often tests the misconception that increasing batch size always speeds up training, but candidates overlook the memory constraints and potential negative impact on model accuracy, while also confusing smaller instance types as a cost-saving measure that inadvertently slows training.

Full explanation →

273

MCQmedium

A data engineer ingests streaming data into Amazon Kinesis Data Streams. The data science team needs to analyze the data using Amazon SageMaker notebooks. What is the most efficient way to provide access to the stream data for ad-hoc exploration?

A.Create an AWS Lambda function to transform and write data to DynamoDB, then query DynamoDB from the notebook.

B.Configure a Kinesis Firehose delivery stream to deliver data to an S3 bucket, then query the data from the notebook using Athena.

C.Install the Kinesis Agent on the SageMaker notebook instance and configure it to write data to a local file.

D.Use the Kinesis connector for Spark to read data directly from the stream into a Spark DataFrame in the notebook.

AnswerD

Direct, real-time access for ad-hoc exploration.

Why this answer

Using the Kinesis connector for Spark in a SageMaker notebook allows reading from the stream directly. Option A is wrong because S3 ingestion adds latency and additional steps. Option B is wrong because Kinesis Agent is for data producers, not consumers.

Option D is wrong because Lambda transformation is not needed for exploration.

Full explanation →

274

MCQeasy

A startup is deploying a machine learning model for real-time recommendation on Amazon SageMaker. The model is a TensorFlow model (1 GB) and the endpoint uses a single ml.c5.2xlarge instance. The inference latency is currently 500 ms per request. The startup expects traffic to increase 10x in the next month. They want to maintain latency under 500 ms. What is the most cost-effective solution?

A.Use SageMaker Batch Transform to process requests in batches

B.Switch to a GPU instance type for faster inference

C.Set up auto-scaling for the endpoint based on average latency or request count

D.Upgrade to a larger CPU instance type, such as ml.c5.4xlarge

AnswerC

Auto-scaling adds capacity dynamically, handling traffic spikes cost-effectively.

Why this answer

Option A is correct because auto-scaling adds instances only when needed, handling increased traffic while keeping latency low. Option B (larger instance) is more expensive and may not be needed. Option C (GPU) is overkill and costly.

Option D (batch) is not real-time.

Full explanation →

275

MCQmedium

A data scientist is analyzing a dataset with 50 features and 10,000 samples. After generating a correlation matrix, they notice several pairs of features have correlation coefficients above 0.95. What should the data scientist do to prepare the data for linear regression?

A.Apply PCA to reduce dimensionality to 10 components.

B.Remove one feature from each highly correlated pair.

C.Drop all features with correlation above 0.95.

D.Standardize all features using StandardScaler.

AnswerB

Reduces multicollinearity while retaining most information.

Why this answer

Option B is correct because high correlation between features indicates multicollinearity, which can destabilize linear regression coefficients. Removing one feature from each highly correlated pair reduces redundancy. Option A is wrong because dropping all correlated features may discard useful information.

Option C is wrong because standardizing does not address multicollinearity. Option D is wrong because PCA creates new features that lose interpretability.

Full explanation →

276

MCQeasy

A data scientist is training a binary classification model on an imbalanced dataset where the positive class accounts for 5% of the data. The model achieves 95% accuracy but has a recall of only 10% for the positive class. Which metric should the data scientist primarily use to evaluate model performance?

A.RMSE

B.F1 Score

C.Accuracy

D.AUC-ROC

AnswerB

F1 score considers both precision and recall.

Why this answer

The F1 Score is the harmonic mean of precision and recall, making it ideal for imbalanced datasets where accuracy is misleading. With 95% accuracy but only 10% recall, the model is simply predicting the majority class (negative) almost always, so F1 Score captures the trade-off between false positives and false negatives better than accuracy or AUC-ROC.

Exam trap

AWS often tests the misconception that high accuracy always indicates good model performance, especially on imbalanced datasets, leading candidates to overlook metrics like F1 Score that account for class distribution.

How to eliminate wrong answers

Option A is wrong because RMSE (Root Mean Squared Error) is a regression metric that measures the square root of the average squared differences between predicted and actual values, not applicable to binary classification. Option C is wrong because accuracy is misleading on imbalanced datasets; a model that always predicts the negative class achieves 95% accuracy but fails to identify the positive class (5% prevalence), as seen with 10% recall. Option D is wrong because AUC-ROC can be overly optimistic on highly imbalanced data; it measures the area under the ROC curve (TPR vs FPR), but with only 5% positives, the FPR remains low even if the model rarely predicts positive, giving a falsely high score.

Full explanation →

277

MCQeasy

A company is using Amazon SageMaker to train a model and wants to automatically retrain the model every week using new data. Which AWS service should be used to orchestrate the retraining pipeline?

A.Amazon CloudWatch Events

B.AWS Lambda

C.AWS Step Functions

D.AWS Data Pipeline

AnswerC

Step Functions can orchestrate multiple SageMaker API calls and handle retries.

Why this answer

AWS Step Functions is the recommended service for orchestrating complex workflows, including SageMaker training jobs, model creation, and deployment. Lambda is for short functions. CloudWatch Events can trigger on a schedule but not orchestrate a pipeline.

Data Pipeline is older and less flexible. S3 Events trigger on object creation but not scheduling.

Full explanation →

278

MCQeasy

A data scientist is using SageMaker to train a model. The training data is stored in an S3 bucket in a different AWS account. What is required to allow SageMaker to access the data?

A.Configure the SageMaker execution role with a policy that grants cross-account access to the S3 bucket.

B.Set up VPC peering between the two accounts.

C.Create a SageMaker notebook instance in the same account as the S3 bucket.

D.Launch the training job from a SageMaker notebook in the account containing the S3 bucket.

AnswerA

The IAM role used by SageMaker must have permissions to access the S3 bucket in the other account.

Why this answer

Option A is correct because SageMaker uses an IAM execution role to access resources. To allow cross-account access to an S3 bucket, the SageMaker execution role must have an IAM policy that grants s3:GetObject and s3:ListBucket permissions for the bucket, and the S3 bucket policy must also grant cross-account access to that role. This is the standard AWS mechanism for cross-account resource access.

Exam trap

The trap here is that candidates often confuse network-level solutions (VPC peering) with IAM-based access control, or assume that running the job from the same account as the data automatically grants access, ignoring that the SageMaker execution role is the key security boundary.

How to eliminate wrong answers

Option B is wrong because VPC peering is used for network connectivity between VPCs, not for granting IAM-based data access permissions; SageMaker accesses S3 via AWS APIs, not through VPC peering. Option C is wrong because creating a SageMaker notebook instance in the same account as the S3 bucket does not resolve the cross-account access issue; the training job still runs in the original account and requires proper IAM permissions. Option D is wrong because launching the training job from a notebook in the account containing the S3 bucket does not change the fact that the training job runs under the execution role of the original account; cross-account access must be explicitly configured via IAM policies.

Full explanation →

279

Multi-Selectmedium

A company is deploying a machine learning model using Amazon SageMaker. The model needs to be updated frequently. Which THREE practices should the company implement for model versioning and deployment?

Select 3 answers

A.Use AWS CodePipeline to automate the training and deployment pipeline.

B.Use the SageMaker Model Registry to catalog model versions.

C.Manually update the endpoint configuration each time.

D.Store all training datasets in a single S3 bucket without versioning.

E.Deploy new model versions using canary deployments with SageMaker endpoints.

AnswersA, B, E

CodePipeline automates CI/CD.

Why this answer

Option A is correct for versioning. Option C is correct for canary deployments. Option E is correct for automated pipelines.

Option B is wrong because it's for data. Option D is wrong because instances should be monitored.

Full explanation →

280

MCQeasy

A team is training a binary classifier and obtains a confusion matrix with 100 true positives, 10 false positives, 20 false negatives, and 200 true negatives. What is the precision of the model?

A.0.91

B.0.87

C.0.94

D.0.83

AnswerA

Precision = 100/(100+10)=0.91.

Why this answer

Precision is calculated as TP / (TP + FP). With 100 true positives and 10 false positives, precision = 100 / (100 + 10) = 100 / 110 ≈ 0.909, which rounds to 0.91. This metric measures how many of the positive predictions were actually correct.

Exam trap

The trap here is that candidates often confuse precision with recall or accuracy, especially when the numbers are close, leading them to pick 0.83 (recall) or miscalculate the denominator.

How to eliminate wrong answers

Option B (0.87) is wrong because it incorrectly uses recall (TP / (TP + FN) = 100/120 ≈ 0.833) or misapplies the denominator. Option C (0.94) is wrong because it likely uses accuracy (TP+TN / total = 300/330 ≈ 0.909) but miscalculates or uses F1-score logic. Option D (0.83) is wrong because it represents recall (100/120 ≈ 0.833), not precision.

Full explanation →

281

MCQhard

A data scientist is training a deep learning model for object detection. The training loss decreases rapidly in the first few epochs but then plateaus at a high value. The validation loss starts increasing after a few epochs. Which adjustment is MOST likely to improve generalization?

A.Add more convolutional layers

B.Use more aggressive data augmentation

C.Increase the learning rate

D.Implement early stopping with a patience parameter

AnswerD

Early stopping prevents overfitting by terminating training when validation loss degrades.

Why this answer

The described behavior—training loss plateauing at a high value while validation loss increases—is a classic sign of overfitting. Early stopping with a patience parameter halts training when validation performance stops improving, preventing the model from memorizing noise and thus improving generalization. This directly addresses the overfitting without altering the model architecture or data distribution.

Exam trap

AWS often tests the distinction between underfitting and overfitting symptoms, and candidates may mistakenly choose data augmentation (Option B) as a universal fix, but the plateauing training loss and rising validation loss specifically indicate overfitting, where early stopping is the most direct remedy.

How to eliminate wrong answers

Option A is wrong because adding more convolutional layers increases model capacity, which would exacerbate overfitting and likely worsen the validation loss increase. Option B is wrong because more aggressive data augmentation could help reduce overfitting, but the question asks for the adjustment most likely to improve generalization given the specific symptoms; early stopping is a more direct and immediate fix for the observed plateau and divergence, whereas augmentation might not address the core issue of training too long. Option C is wrong because increasing the learning rate would cause the loss to oscillate or diverge, not improve generalization, and the training loss is already plateauing, indicating the optimizer is near a minimum.

Full explanation →

282

MCQmedium

A data scientist is deploying a machine learning model using SageMaker and wants to automate the retraining pipeline. The training data is updated daily in an S3 bucket. Which combination of AWS services should the data scientist use to trigger a new training job when new data arrives?

A.Amazon SQS queue to store S3 events and a cron job to poll and start training

B.Use SageMaker Pipelines with a schedule to check for new data every hour

C.Amazon S3 event notification to directly start a SageMaker training job

D.Amazon CloudWatch Events to run an AWS Step Functions state machine that starts a SageMaker training job

E.Amazon CloudWatch Events to invoke an AWS Lambda function that starts a SageMaker training job

AnswerE

CloudWatch Events can capture S3 events and invoke Lambda to start training.

Why this answer

Option C is correct because S3 Events can trigger a Lambda function, which starts a SageMaker training job. Option A uses CloudWatch Events but is more complex for simple triggers. Option B misses the Lambda function.

Option D uses Step Functions unnecessarily. Option E uses SageMaker Pipelines, which can be triggered by events but requires additional setup.

Full explanation →

283

MCQeasy

A data scientist is deploying a model using Amazon SageMaker for real-time inference. The model is memory-intensive and requires a GPU. Which instance type should be selected for the endpoint?

A.i3.2xlarge

B.c5.2xlarge

C.r5.2xlarge

D.p3.2xlarge

AnswerD

GPU instance suitable for memory-intensive models.

Why this answer

The p3.2xlarge instance is correct because it provides a GPU (NVIDIA Tesla V100) with high memory bandwidth, which is essential for memory-intensive deep learning models requiring GPU acceleration for real-time inference. SageMaker endpoints for GPU-based models must use instance types from the P or G families, as CPU-only instances like i3, c5, or r5 lack the parallel processing capabilities needed for efficient GPU inference.

Exam trap

Cisco often tests the distinction between CPU-optimized instance families (c5, r5, i3) and GPU-accelerated families (p3, g4dn), where candidates mistakenly assume that high RAM (r5) or high compute (c5) can substitute for a GPU, ignoring the fundamental hardware requirement for GPU-based inference.

How to eliminate wrong answers

Option A (i3.2xlarge) is wrong because it is a storage-optimized instance with NVMe SSD storage, designed for high I/O workloads, not for GPU-accelerated inference. Option B (c5.2xlarge) is wrong because it is a compute-optimized instance with only CPUs, lacking a GPU, which is explicitly required for the memory-intensive model. Option C (r5.2xlarge) is wrong because it is a memory-optimized instance with high RAM but no GPU, making it unsuitable for GPU-dependent inference tasks.

Full explanation →

284

Multi-Selectmedium

A company uses SageMaker to train a model. The training job is taking too long and the data scientist wants to speed it up. Which THREE strategies should the data scientist consider? (Select THREE.)

Select 3 answers

A.Reduce the number of training epochs

B.Use a GPU instance type like ml.p3.2xlarge

C.Use distributed training with multiple instances

D.Use Pipe input mode to stream data from S3

E.Increase the batch size in the training script

AnswersB, C, D

GPUs accelerate training for deep learning.

Why this answer

Options A, C, and D are correct. A: Distributed training reduces time. C: Using a GPU instance accelerates compute.

D: Using Pipe mode reduces I/O time. B (reducing epochs) may reduce accuracy. E (increasing batch size) can help but may cause memory issues.

Full explanation →

285

MCQhard

A company stores sensitive customer data in an S3 bucket. The security team requires that all data be encrypted at rest with a key that is automatically rotated every year. Which solution meets these requirements with the least operational overhead?

A.Use SSE-KMS with a customer-managed key and automatic rotation

B.Use SSE-C (customer-provided keys)

C.Use SSE-S3 (Amazon S3-managed keys)

D.Use SSE-KMS with a customer-managed key and manual rotation

AnswerC

SSE-S3 automatically rotates keys and requires no customer management.

Why this answer

SSE-S3 uses Amazon S3-managed keys (SSE-S3) which are automatically rotated. SSE-KMS with automatic rotation also works but requires KMS key management. SSE-C requires customer-managed keys.

SSE-KMS with manual rotation adds overhead.

Full explanation →

286

Multi-Selecthard

A data science team is training a large deep learning model using Amazon SageMaker. The training job is taking a long time because the model has many layers and the dataset is large. The team wants to reduce training time by distributing the training across multiple GPUs on a single instance, as well as across multiple instances. Which TWO actions should the team take? (Choose two.)

Select 2 answers

A.Use SageMaker's distributed data parallelism (SMDDP) library to shard the model across GPUs.

B.Configure the training job to use SageMaker's model parallelism (SMP) library for pipeline or tensor parallelism.

C.Use SageMaker's managed training with a single instance containing multiple GPUs and enable data parallelism.

D.Use Horovod for data parallelism across multiple instances.

E.Set the instance type to a single GPU instance and rely on automatic model parallelism.

AnswersB, D

SMP allows splitting the model across multiple GPUs and instances, reducing memory footprint per GPU and enabling training of large models that would otherwise not fit. This complements data parallelism.

Why this answer

Option B is correct because the SageMaker model parallelism (SMP) library is specifically designed to split large deep learning models across multiple GPUs using pipeline or tensor parallelism. This allows the team to train models that are too large to fit on a single GPU and to reduce training time by parallelizing computation across devices within and across instances.

Exam trap

The trap here is that candidates often confuse data parallelism (which shards data) with model parallelism (which shards the model), and assume that simply using multiple GPUs on a single instance automatically distributes the model, when in fact explicit model parallelism libraries like SMP are required for large models that do not fit in GPU memory.

Full explanation →

287

MCQeasy

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The data arrives in bursts and must be processed with minimal latency. Which AWS service is most appropriate for the ingestion layer?

A.Amazon Kinesis Data Streams

B.Amazon Kinesis Data Firehose

C.Amazon SQS

D.Amazon S3

AnswerA

Kinesis Data Streams provides low-latency, real-time data ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time data streaming with low latency and can handle high throughput from many sources. Option A is wrong because SQS is for decoupled messaging, not streaming. Option C is wrong because Kinesis Data Firehose is for loading streaming data into destinations with some latency.

Option D is wrong because S3 itself is not a streaming ingestion service.

Full explanation →

288

MCQeasy

A data scientist is building a regression model to predict house prices. The dataset has 10 features, and the model shows high variance with a low bias. Which technique should the data scientist use to reduce variance?

A.Apply L2 regularization to the model.

B.Increase the depth of decision trees in the ensemble.

C.Add more features to the model.

D.Reduce the amount of training data.

AnswerA

L2 regularization reduces variance by penalizing large coefficients.

Why this answer

L2 regularization (Ridge regression) penalizes large coefficients by adding a squared magnitude term to the loss function, which shrinks the model's weights and reduces variance without substantially increasing bias. This directly addresses the high-variance, low-bias symptom, making the model less sensitive to fluctuations in the training data.

Exam trap

Cisco often tests the misconception that adding more data or features always reduces variance, but the trap here is that high variance is best addressed by regularization or simplifying the model, not by increasing complexity or reducing data.

How to eliminate wrong answers

Option B is wrong because increasing the depth of decision trees in an ensemble (e.g., random forest or gradient boosting) increases model complexity, which typically raises variance and worsens overfitting, not reduces it. Option C is wrong because adding more features increases the dimensionality and capacity of the model, which tends to increase variance further, especially when the current model already shows high variance. Option D is wrong because reducing the amount of training data generally increases variance (the model becomes more sensitive to the specific sample) and can also increase bias due to insufficient learning, which is the opposite of the desired effect.

Full explanation →

289

MCQmedium

A company is using Amazon SageMaker to deploy a model that predicts customer churn. The model was trained using a linear learner algorithm. During inference, the endpoint returns predictions that are always 0.5 (the probability of churn). What is the most likely cause?

A.The dataset is highly imbalanced, and the model is predicting the majority class

B.The model was trained with too few epochs

C.The input features are not normalized

D.The learning rate is set too high, causing the model to converge to the mean prediction

AnswerD

A high learning rate can cause the model to overshoot and settle at the mean of the target variable.

Why this answer

If the model always outputs 0.5, it suggests that the model is not learning and is stuck at the prior probability. This often happens when the learning rate is too high (causing divergence) or too low (causing slow convergence) so that the model does not update weights. The other options would cause different symptoms: data imbalance might bias towards 0 or 1, not exactly 0.5; feature scaling issues typically cause NaN or poor convergence; insufficient epochs might not converge but not necessarily give exactly 0.5.

Full explanation →

290

Multi-Selectmedium

A data scientist is using Amazon SageMaker to train a linear regression model. The training data contains outliers. Which THREE techniques can mitigate the impact of outliers?

Select 3 answers

A.Remove observations with outlier values from the dataset.

B.Increase the number of layers in the model.

C.Standardize the features to have mean zero and unit variance.

D.Apply winsorization to the feature values.

E.Use a loss function that is robust to outliers, such as Huber loss.

AnswersA, D, E

Direct removal eliminates outlier impact.

Why this answer

Option A is correct because removing observations with outlier values directly eliminates data points that can disproportionately influence the linear regression coefficients, leading to a more stable and representative model. In Amazon SageMaker, this can be done during data preprocessing using built-in algorithms or custom scripts in a SageMaker Processing job.

Exam trap

AWS often tests the misconception that feature scaling (standardization) alone can handle outliers, but scaling does not reduce the leverage of extreme values; it only changes their numeric range.

Full explanation →

291

MCQeasy

The exhibit shows a data quality report for a column named 'age'. Which potential data issue should be investigated further?

A.The mean and median are significantly different

B.The minimum age of 0 and maximum age of 120 may be outliers

C.The missing value rate of 2.3% is too high

D.The number of unique values (85) is too high

AnswerB

Age 0 and 120 are likely data errors.

Why this answer

Option D is correct because an age of 0 and 120 are likely data entry errors and should be investigated. Option A is wrong because 2.3% missing is relatively low and may be acceptable. Option B is wrong because the mean and median are close.

Option C is wrong because 85 unique values for age is reasonable.

Full explanation →

292

MCQhard

During exploratory data analysis, a data scientist notices that the correlation matrix of features shows many pairs with absolute correlation > 0.95. The dataset includes both numerical and categorical variables. Which technique is most appropriate to reduce multicollinearity while preserving the most information?

A.Apply Principal Component Analysis (PCA) to the features.

B.Use only one-hot encoded categorical features.

C.Apply L1 regularization during model training.

D.Remove one feature from each highly correlated pair.

AnswerA

PCA reduces dimensionality and decorrelates features.

Why this answer

Option D is correct because PCA is a dimensionality reduction technique that handles multicollinearity by creating orthogonal components. Option A (remove one from each pair) is ad-hoc; Option B (regularization) is for modeling, not EDA; Option C (use only one-hot encoded features) loses information.

Full explanation →

293

MCQeasy

A data engineer needs to load data from a MySQL database to Amazon S3 daily. The database is 500 GB and the load window is 2 hours. The data must be extracted without impacting the source database performance. Which AWS service should be used to perform the extraction?

A.AWS Glue ETL job using a JDBC connection to read the full table.

B.AWS Database Migration Service (AWS DMS) with a full-load task to S3.

C.Amazon Athena with the MySQL federated query connector.

D.Amazon EMR with a Spark job reading from MySQL via JDBC.

AnswerB

DMS is designed for minimal impact migration and can load data directly to S3.

Why this answer

Option D is correct because AWS Database Migration Service (DMS) can perform continuous or scheduled full-load and change-data-capture tasks from MySQL to S3 without impacting source performance when using appropriate task settings and replication instance. Option A is wrong because Amazon Athena is a query service, not an extraction tool. Option B is wrong because AWS Glue can connect to JDBC sources but may cause higher overhead on the source; DMS is purpose-built.

Option C is wrong because Amazon EMR is for big data processing, not direct extraction from MySQL to S3.

Full explanation →

294

MCQeasy

A data scientist is training a linear regression model and notices that the model performs well on training data but poorly on validation data. Which technique should be applied to reduce overfitting?

A.Apply L2 regularization (Ridge)

B.Increase the number of epochs

C.Add more features

D.Remove training examples

AnswerA

Regularization reduces overfitting by penalizing large coefficients.

Why this answer

L2 regularization (Ridge) adds a penalty term proportional to the square of the magnitude of the coefficients to the loss function. This shrinks the weights toward zero, reducing the model's sensitivity to individual features and preventing it from fitting noise in the training data, which directly addresses overfitting.

Exam trap

The trap here is that candidates often confuse regularization with techniques that increase model capacity (like adding features or more training iterations), not realizing that overfitting requires reducing complexity, not increasing it.

How to eliminate wrong answers

Option B is wrong because increasing the number of epochs (training iterations) typically allows the model to fit the training data even more closely, worsening overfitting rather than reducing it. Option C is wrong because adding more features increases model complexity and the risk of capturing noise, which exacerbates overfitting. Option D is wrong because removing training examples reduces the amount of data available for learning, which can increase variance and make overfitting more likely, not less.

Full explanation →

295

MCQeasy

A data scientist is evaluating a regression model. The RMSE on the training set is 2.5, and on the test set is 2.7. The R² on the test set is 0.98. What does this indicate?

A.The model has high bias

B.The model generalizes well with no severe overfitting

C.The model is underfitting because R² is too high

D.The model is overfitting because RMSE is lower on training data

AnswerB

Small difference in RMSE and high test R² indicate good generalization.

Why this answer

The model has low error and high R² on both sets, indicating good generalization without significant overfitting. The small difference between training and test RMSE suggests no severe overfitting.

Full explanation →

296

Multi-Selectmedium

A data scientist is training a deep learning model on Amazon SageMaker and wants to reduce the training time. Which TWO actions would help achieve this?

Select 2 answers

A.Enable data augmentation.

B.Use distributed training across multiple instances.

C.Use SageMaker Automatic Model Tuning.

D.Use a GPU-based instance type.

E.Use SageMaker Managed Spot Training.

AnswersB, D

Distributed training parallelizes computation, reducing wall-clock time.

Why this answer

Distributed training across multiple instances (Option B) reduces training time by parallelizing the workload across multiple compute nodes, leveraging data parallelism or model parallelism to process larger batches or model partitions simultaneously. This is particularly effective for deep learning models where the dataset or model size exceeds the capacity of a single instance, as it scales throughput linearly with the number of instances under ideal conditions.

Exam trap

AWS often tests the distinction between cost-saving techniques (like Spot Training) and performance-enhancing techniques (like distributed training or GPU instances), leading candidates to mistakenly select Spot Training as a way to reduce training time when it only reduces cost.

Full explanation →

297

MCQhard

A data scientist needs to run ad-hoc SQL queries on a large dataset stored in Amazon S3 (Parquet format, 2 TB). The queries are interactive and require sub-second response times. Which service should they use?

A.Amazon Redshift Spectrum

B.Amazon QuickSight

C.Amazon EMR with Spark SQL

D.Amazon Athena

AnswerD

Athena is serverless and optimized for interactive queries on S3 data.

Why this answer

Amazon Athena can query data in S3 using SQL and provides fast interactive query performance, especially with Parquet. Option A is wrong because EMR requires cluster setup. Option C is wrong because Redshift Spectrum is heavier for ad-hoc.

Option D is wrong because QuickSight is for visualization.

Full explanation →

298

MCQmedium

A data scientist is using SageMaker Debugger to monitor a training job. The training loss is not decreasing as expected. Which Debugger feature can help identify the issue?

A.Automatic hyperparameter tuning

B.Saving tensors every step

C.Deploying a model endpoint for real-time monitoring

D.Built-in rules to detect training anomalies

AnswerD

Rules like vanishing gradient can pinpoint issues.

Why this answer

Option B is correct because Debugger's built-in rules can detect vanishing gradients, overfitting, etc. Option A is incorrect because Debugger does not automatically tune hyperparameters. Option C is incorrect because saving tensors does not identify the issue automatically.

Option D is incorrect because Debugger is not for model deployment.

Full explanation →

299

Multi-Selecteasy

A machine learning team is using Amazon SageMaker to train a model. The training job uses spot instances to reduce cost. However, the training job is frequently interrupted. Which TWO actions can help mitigate the impact of spot interruptions? (Choose TWO.)

Select 2 answers

A.Increase the number of training instances.

B.Use a larger instance type that is less likely to be interrupted.

C.Use managed spot training with SageMaker's 'ManagedSpotTraining' parameter set to True.

D.Enable checkpointing to save intermediate results to Amazon S3.

E.Switch to on-demand instances.

AnswersC, D

Managed spot training handles interruptions.

Why this answer

Options B and D are correct. Option B: Using checkpoints allows resuming from the last saved state. Option D: Using managed spot training automatically handles interruptions.

Option A is wrong because more instances increase cost and interruption risk. Option C is wrong because larger instances are more expensive and not guaranteed to be less interrupted. Option E is wrong because on-demand instances cost more.

Full explanation →

300

MCQeasy

A data scientist is training a linear regression model. After training, the model has a high bias and low variance. Which technique should the data scientist use to reduce bias?

A.Decrease the model complexity

B.Add more relevant features

C.Apply L2 regularization (Ridge)

D.Reduce the amount of training data

AnswerB

Adding features increases model complexity and can reduce bias.

Why this answer

High bias indicates the model is underfitting the data, meaning it is too simple to capture the underlying patterns. Adding more relevant features increases model complexity, allowing it to learn more from the data and reduce bias. This directly addresses the underfitting issue without increasing variance excessively, provided the features are meaningful.

Exam trap

Cisco often tests the bias-variance tradeoff by presenting regularization as a solution for high bias, but candidates must remember that regularization (L1/L2) primarily reduces variance, not bias, and can actually increase bias if applied too strongly.

How to eliminate wrong answers

Option A is wrong because decreasing model complexity (e.g., using fewer features or a simpler algorithm) would further increase bias, worsening the underfitting problem. Option C is wrong because L2 regularization (Ridge) adds a penalty on large coefficients, which reduces variance but can increase bias by shrinking coefficients toward zero, making the model simpler. Option D is wrong because reducing the amount of training data typically increases variance and can also increase bias if the remaining data is not representative, but it does not directly reduce bias and may harm generalization.

Full explanation →

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 226–300