Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1126–1200

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 16 of 24

1126

MCQhard

An e-commerce company uses a linear regression model to predict customer lifetime value (LTV). The model shows high variance on the test set, with training RMSE much lower than test RMSE. Which of the following is the MOST effective approach to reduce overfitting?

A.Apply L2 regularization (Ridge regression)

B.Use a polynomial kernel in a support vector regressor

C.Add more features, including interaction terms

D.Increase training data size by duplicating existing samples

AnswerA

L2 regularization shrinks coefficients and reduces variance.

Why this answer

High variance (low training RMSE, high test RMSE) indicates overfitting. L2 regularization (Ridge regression) adds a penalty proportional to the square of the coefficients, shrinking them toward zero without eliminating them, which reduces model complexity and improves generalization. This directly addresses overfitting by constraining the model's sensitivity to noise in the training data.

Exam trap

Cisco often tests the misconception that adding more data always reduces overfitting, but the trap here is that duplicating existing samples (Option D) does not provide new, diverse examples and therefore fails to address the root cause of high variance.

How to eliminate wrong answers

Option B is wrong because using a polynomial kernel in a support vector regressor increases model complexity by mapping data into a higher-dimensional space, which would exacerbate overfitting rather than reduce it. Option C is wrong because adding more features, including interaction terms, further increases model complexity and variance, making overfitting worse. Option D is wrong because duplicating existing samples does not introduce new information; it artificially inflates the weight of existing patterns, which can actually increase overfitting by reinforcing noise in the training data.

Full explanation →

1127

Drag & Dropmedium

Drag and drop the steps to create a data processing job using Amazon SageMaker Processing in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Processing requires script creation, data upload, job configuration, execution, and verification.

Full explanation →

1128

MCQmedium

A company uses Amazon DynamoDB as the primary data store for a real-time application. The data science team wants to analyze the data using Amazon Athena. What is the most efficient way to make the DynamoDB data available for Athena queries?

A.Use AWS Glue to extract data from DynamoDB and load into S3 on a schedule.

B.Use Amazon Redshift Spectrum to query DynamoDB directly.

C.Use DynamoDB Streams to invoke an AWS Lambda function that writes data to Amazon S3 in Parquet format. Then query the data in S3 using Athena.

D.Use Amazon EMR to read directly from DynamoDB and run Hive queries.

AnswerC

This provides a decoupled, cost-effective solution for analytics.

Why this answer

Option A is correct because DynamoDB Streams can trigger a Lambda function to write to S3, and Athena can query S3. Option B is wrong because EMR can query DynamoDB directly but is more complex. Option C is wrong because Glue can extract data but adds latency.

Option D is wrong because Redshift Spectrum uses S3, not DynamoDB directly.

Full explanation →

1129

Multi-Selecthard

A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)

Select 2 answers

A.Increase the number of layers

B.Remove L2 regularization

C.Increase the number of training steps

D.Add dropout layers

E.Add early stopping based on validation loss

AnswersD, E

Dropout regularizes by randomly dropping neurons.

Why this answer

Option D is correct because dropout layers randomly deactivate a fraction of neurons during training, which forces the network to learn more robust features and reduces co-adaptation, a common cause of overfitting. This technique is particularly effective in deep learning models trained on SageMaker, where large architectures can quickly memorize training data.

Exam trap

The trap here is that candidates often confuse regularization techniques that reduce overfitting (dropout, L2, early stopping) with actions that increase model capacity (more layers, more steps), leading them to select options that would worsen the problem.

Full explanation →

1130

MCQmedium

A company uses Amazon SageMaker to train a model for detecting fraudulent transactions. The dataset is highly imbalanced (99.9% legitimate, 0.1% fraudulent). Which approach is most effective to address this imbalance?

A.Use class weights in the loss function

B.Apply SMOTE to generate synthetic samples

C.Random oversampling of the minority class

D.Collect more data for the minority class

AnswerB

SMOTE generates synthetic samples to balance the dataset.

Why this answer

Option C is correct because SMOTE generates synthetic samples for the minority class, effectively balancing the dataset. Option A is wrong because random oversampling may cause overfitting. Option B is wrong because using class weights directly in the loss function is also valid but SMOTE is often more effective.

Option D is wrong because collecting more data is good but may not be feasible.

Full explanation →

1131

MCQmedium

A company wants to monitor a deployed model for data drift. Which AWS service should they use?

A.Amazon SageMaker Ground Truth

B.Amazon CloudWatch Logs

C.Amazon SageMaker Clarify

D.Amazon SageMaker Model Monitor

AnswerD

Model Monitor checks for data and model drift.

Why this answer

Option B is correct because SageMaker Model Monitor is designed for drift detection. Option A is wrong because CloudWatch is for metrics, not drift. Option C is wrong because SageMaker Clarify is for bias detection.

Option D is wrong because SageMaker Ground Truth is for labeling.

Full explanation →

1132

MCQmedium

A data scientist uses Amazon QuickSight to visualize a dataset and observes that a numerical feature has a skewness of 2.5 and a kurtosis of 8. Which transformation should they apply to make the distribution more normal?

A.Standardize the feature using Z-score normalization.

B.Apply a Box-Cox transformation with lambda=0.5.

C.Apply Min-Max scaling to the range [0,1].

D.Apply a log transformation.

AnswerD

Log transformation reduces right skewness.

Why this answer

Option B is correct because a skewness of 2.5 indicates right skew, and a log transformation is commonly used to reduce skewness. Option A is incorrect because standardization does not change distribution shape. Option C is incorrect because Min-Max scaling does not change skewness.

Option D is incorrect because a Box-Cox transformation requires positive data and is a more general solution, but log is simpler and often sufficient; however, Box-Cox is also valid. In the context of this question, log is the most direct answer.

Full explanation →

1133

MCQhard

A company is using AWS Glue to run ETL jobs that transform data for machine learning. The jobs are failing with 'Out of Memory' errors. The data size is growing, and the company needs a cost-effective solution. Which approach should be taken?

A.Switch to Spark on Amazon EMR.

B.Increase the number of workers in the job configuration.

C.Optimize the job by filtering data earlier.

D.Use a larger worker type like G.2X.

AnswerB

Increases parallelism, reducing memory per worker.

Why this answer

Increasing the number of workers in the AWS Glue job configuration distributes the data processing load across more Spark executors, directly addressing the 'Out of Memory' error by providing more aggregate memory without changing the worker type. This is a cost-effective approach because it scales horizontally, often at a lower cost than moving to a larger worker type, and it leverages the existing Glue infrastructure without migrating to EMR.

Exam trap

The trap here is that candidates often assume 'Out of Memory' errors must be solved by increasing memory per worker (vertical scaling) or by switching to a more powerful service, but the most cost-effective and direct solution in AWS Glue is to increase the number of workers (horizontal scaling) to distribute the memory load.

How to eliminate wrong answers

Option A is wrong because switching to Spark on Amazon EMR would require significant architectural changes and operational overhead, and it is not inherently more cost-effective than adjusting Glue worker count for the same memory issue. Option C is wrong because filtering data earlier is a best practice for performance optimization but does not directly resolve an 'Out of Memory' error caused by insufficient total memory across workers; it reduces data volume but may not prevent memory exhaustion if the cluster is undersized. Option D is wrong because using a larger worker type like G.2X increases memory per worker but is typically more expensive than adding more workers of the same type, and it may not be the most cost-effective horizontal scaling solution for growing data.

Full explanation →

1134

Multi-Selecteasy

A team wants to move data from an on-premises Oracle database to Amazon S3 for analytics. The pipeline must run daily and handle incremental updates. Which THREE services should they use together? (Choose three.)

Select 3 answers

A.Amazon SageMaker

B.Amazon S3

C.Amazon Athena

D.AWS Database Migration Service (DMS)

E.AWS Glue

AnswersB, D, E

S3 is the target data lake storage.

Why this answer

Options A, B, and D are correct. AWS DMS can do continuous replication from Oracle to S3. Glue can transform the data.

S3 is the target. Option C is wrong because SageMaker is for ML, not data ingestion. Option E is wrong because Athena is a query engine, not a data movement service.

Full explanation →

1135

MCQmedium

A team deployed a SageMaker endpoint with the configuration shown in the exhibit. During a traffic spike, the endpoint becomes unresponsive. Which change to the endpoint configuration would best improve availability?

A.Reduce the initial instance count to 0 and use on-demand invocation

B.Add a second production variant with the same model

C.Configure auto-scaling for the endpoint

D.Change the instance type to ml.m5.xlarge

AnswerC

Auto-scaling dynamically adds instances during traffic spikes, improving availability.

Why this answer

Option C is correct because adding auto-scaling allows the endpoint to adjust instance count based on load. Option A is wrong because increasing instance size may not handle spikes if only one instance. Option B is wrong because multiple variants for A/B testing don't improve availability.

Option D is wrong because reducing instance count worsens availability.

Full explanation →

1136

Multi-Selectmedium

A data scientist is training a gradient boosting model using SageMaker's built-in XGBoost algorithm. The dataset has missing values in several features. Which TWO actions should the data scientist take to handle missing values effectively? (Choose two.)

Select 2 answers

A.Impute missing values with the median of each feature using a preprocessing step.

B.Use one-hot encoding to create binary columns indicating missingness.

C.Remove all rows with missing values from the training dataset.

D.Apply PCA to reduce dimensionality and ignore missing values.

E.Set the 'missing' parameter in XGBoost to a specific value (e.g., 0) and let the algorithm learn the best imputation.

AnswersA, E

Median imputation is a robust method that preserves data.

Why this answer

Option A and Option D are correct. XGBoost can learn the best direction to handle missing values by using the missing parameter. Alternatively, using a constant imputation (like median) is a standard approach.

Option B (remove rows) loses data. Option C (one-hot encode) is for categorical. Option E (PCA) is for dimensionality reduction.

Full explanation →

1137

MCQmedium

A company uses Amazon Athena to analyze data stored in S3. The data is in CSV format and is partitioned by year/month/day. Queries that filter on a specific day are slow. The team wants to improve query performance without changing the data format. Which action should the team take?

A.Use a larger number of small files to increase parallelism.

B.Increase the number of partitions by adding hour and minute levels.

C.Convert the data to Parquet format.

D.Ensure that the S3 folder structure follows the Hive-style partition naming convention (e.g., year=2023/month=01/day=01).

AnswerD

Hive-style partitions allow Athena to prune partitions effectively.

Why this answer

Option D is correct because partitioning pruning is most effective when the partition columns are in a Hive-style format (e.g., year=2023/month=01/day=01). If the folders are not in that format, Athena may not prune partitions properly. Option A is wrong because increasing the number of partitions would further slow queries.

Option B is wrong because converting to Parquet changes the format. Option C is wrong because using a large number of small files hurts performance.

Full explanation →

1138

Multi-Selectmedium

Which TWO metrics are suitable for evaluating a regression model? (Select TWO.)

Select 2 answers

A.Accuracy

B.Root Mean Squared Error (RMSE)

C.R-squared

D.F1-score

E.Precision

AnswersB, C

RMSE measures average prediction error in regression.

Why this answer

Root Mean Squared Error (RMSE) is a standard metric for regression models because it measures the average magnitude of prediction errors in the same units as the target variable. It penalizes larger errors more heavily due to squaring, making it sensitive to outliers and useful for comparing model performance.

Exam trap

Cisco often tests the distinction between classification and regression metrics, and the trap here is that candidates mistakenly apply classification metrics like Accuracy, F1-score, or Precision to regression problems because they are familiar with them from other contexts.

Full explanation →

1139

MCQmedium

A company is using SageMaker to train a model, but the training job fails with an out-of-memory error. Which action should the data scientist take to resolve this issue?

A.Use a larger instance type for training

B.Decrease the batch size

C.Increase the learning rate

D.Increase the number of layers

AnswerB

Smaller batches use less memory.

Why this answer

Decreasing the batch size reduces the memory footprint per training step, directly addressing the out-of-memory (OOM) error. In SageMaker, the training instance's GPU or CPU memory is shared between model parameters, activations, and the batch data; a smaller batch size lowers the peak memory usage, allowing the training job to complete without exceeding the instance's memory limit.

Exam trap

The trap here is that candidates often default to scaling up infrastructure (larger instance) instead of optimizing hyperparameters like batch size, which is a more immediate and cost-effective fix for OOM errors in SageMaker.

How to eliminate wrong answers

Option A is wrong because using a larger instance type may resolve the OOM error but is not the most efficient or cost-effective first step; it increases costs and does not address the root cause of memory bloat. Option C is wrong because increasing the learning rate does not affect memory usage; it changes the step size in gradient descent and can lead to divergence or instability. Option D is wrong because increasing the number of layers adds more parameters and activations, which increases memory consumption and would worsen the OOM error.

Full explanation →

1140

MCQmedium

A data scientist is working with a dataset that contains both numerical and categorical features. During EDA, they want to understand the relationship between a categorical feature with 10 unique values and the target variable. Which visualization is most appropriate?

A.Heatmap

B.Box plot

C.Histogram

D.Scatter plot

AnswerB

Box plot shows target distribution across categories.

Why this answer

A box plot shows the distribution of the target across categories. Option A is wrong because scatter plots are for two numerical features. Option B is wrong because histogram shows distribution of a single numerical feature.

Option D is wrong because heatmaps typically show correlation between numerical features.

Full explanation →

1141

MCQhard

A company is using SageMaker to train a deep learning model with TensorFlow. The training job is running on an ml.p3.16xlarge instance. The data scientist wants to maximize GPU utilization. Which configuration should be used?

A.Use a single GPU and increase the number of epochs.

B.Use a CPU-only instance for training and then deploy on GPU.

C.Use File mode input and a small batch size.

D.Use Pipe mode or Fast File mode with a large batch size that fits in GPU memory.

AnswerD

Pipe mode streams data efficiently; large batch size maximizes GPU compute.

Why this answer

SageMaker's Pipe mode and fast file mode reduce I/O bottlenecks. For GPU utilization, the data pipeline must keep GPUs busy. Option A (File mode with small batch) may underutilize GPU.

Option B (single GPU) wastes resources. Option D (using CPU instance) is counterproductive.

Full explanation →

1142

MCQhard

A machine learning engineer is deploying a model on SageMaker and needs to ensure that the endpoint can handle a sudden spike in traffic. The engineer expects traffic to increase by 10x during a promotional event. Which scaling strategy should be used?

A.Use a single large instance type instead of multiple smaller instances.

B.Manually increase the instance count before the event.

C.Use only dynamic scaling based on the average latency metric.

D.Use scheduled scaling to add instances before the event, combined with dynamic scaling for the remaining duration.

AnswerD

Scheduled scaling pre-warms the endpoint to handle the spike.

Why this answer

Option A is correct because scheduled scaling can add additional instances before the expected spike. Option B is wrong because dynamic scaling may not react quickly enough to a sudden 10x spike. Option C is wrong because manual scaling requires human intervention.

Option D is wrong because a single large instance provides no redundancy and may not handle the spike.

Full explanation →

1143

MCQhard

A company operates a real-time fraud detection system using an Amazon SageMaker endpoint. The model is a gradient boosting model trained on historical transaction data. The endpoint is deployed on an ml.c5.2xlarge instance with auto-scaling enabled based on average latency. Recently, during a flash sale event, the endpoint started returning HTTP 503 errors. The CloudWatch metrics show that the CPU utilization is at 70%, and the average latency has increased from 50 ms to 200 ms. The auto-scaling policy is configured to add one instance when average latency exceeds 100 ms for 5 consecutive minutes, and remove one instance when latency drops below 50 ms for 5 minutes. The current number of instances is 2. The flash sale lasted 30 minutes. What should the company do to prevent this issue in future flash sales?

A.Enable request throttling to drop excess requests

B.Change the instance type to ml.c5.4xlarge to handle higher load

C.Pre-warm the endpoint by setting a minimum number of instances that can handle the expected peak load before the flash sale

D.Change the model to a simpler model with lower latency

AnswerC

This ensures capacity is available from the start.

Why this answer

Option C is correct because the auto-scaling policy is reactive and too slow (5-minute evaluation period) to handle rapid traffic spikes. Pre-warming the endpoint by increasing the number of instances before the flash sale ensures capacity is available. Option A (increase instance size) may help but is more expensive and still reactive.

Option B (use a different algorithm) is not the core issue. Option D (enable throttling) would still result in errors.

Full explanation →

1144

Multi-Selecteasy

Which TWO services can be used to transform data in transit within a Kinesis Data Firehose delivery stream? (Choose 2)

Select 2 answers

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon Athena

D.Amazon S3

E.AWS Glue

AnswersA, E

Firehose can invoke a Lambda function to transform records.

Why this answer

AWS Lambda is correct because it can be invoked as a transformation function within a Kinesis Data Firehose delivery stream. When you enable data transformation, Firehose buffers incoming records and then calls a Lambda function you specify, passing batches of records for processing. The Lambda function can modify, enrich, filter, or reformat the data before Firehose continues delivering it to the destination.

Exam trap

The trap here is that candidates often confuse 'transformation' with 'analytics' and select Kinesis Data Analytics, not realizing that Firehose's built-in transformation feature is specifically powered by Lambda, not by a separate analytics engine.

Full explanation →

1145

Drag & Dropmedium

Drag and drop the steps to use Amazon SageMaker Debugger to debug a training job in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Debugger requires hook configuration, job setup with rules, execution, and analysis.

Full explanation →

1146

MCQmedium

A team is using Amazon SageMaker to train a deep learning model. The training job is taking too long, and they want to reduce training time without significant accuracy loss. They have already tried increasing the number of instances. Which technique should they consider next?

A.Increase L2 regularization

B.Reduce model complexity

C.Gradient accumulation

D.Early stopping

AnswerC

Gradient accumulation simulates larger batch sizes, improving convergence speed.

Why this answer

Option D is correct because gradient accumulation allows simulating larger batch sizes without increasing memory, often speeding up convergence. Option A is wrong because early stopping may stop too early. Option B is wrong because reducing model complexity may cause underfitting.

Option C is wrong because L2 regularization does not reduce training time.

Full explanation →

1147

Multi-Selecthard

A machine learning team is using Amazon SageMaker to train a model on a dataset stored in S3. The training job reads data from S3 using Pipe input mode, but the training is slow. The team wants to improve data throughput. Which THREE actions should they take?

Select 3 answers

A.Enable S3 Transfer Acceleration on the bucket.

B.Mount the S3 bucket using an S3 file system and use File mode with a larger instance type.

C.Use Amazon S3 VPC Gateway Endpoint to reduce data transfer costs and improve latency.

D.Use Amazon EFS as the data source for training.

E.Use Amazon ElastiCache to cache the training data.

AnswersB, C, D

File mode with high-bandwidth instances can improve throughput.

Why this answer

Using Amazon S3 with a VPC endpoint improves network performance. Using Amazon EFS as a data source can provide higher throughput for sequential access. Using SageMaker's File mode with a larger instance type with more network bandwidth can also improve throughput.

Full explanation →

1148

MCQmedium

A company is using AWS Glue to run ETL jobs that process data from an Amazon RDS for PostgreSQL database. The jobs are failing with connection timeouts. The security group for the RDS instance allows inbound traffic from the Glue job's security group. What is the most likely cause?

A.The security group inbound rule is incorrect

B.The VPC does not have an S3 VPC endpoint

C.RDS is in a different subnet

D.The IAM role does not have permission to access RDS

AnswerB

Glue jobs need an S3 VPC endpoint to access the catalog.

Why this answer

AWS Glue jobs run in a VPC that requires an S3 VPC endpoint to access the AWS Glue catalog and other services. Without the endpoint, the job cannot connect to Glue's metadata store, causing timeouts. Option A is wrong because the security group is correctly configured.

Option C is wrong because the issue is not about IAM permissions for RDS. Option D is wrong because there is no subnet issue mentioned.

Full explanation →

1149

MCQhard

A data scientist is analyzing a dataset with a large number of missing values in several columns. The dataset is stored in an Amazon S3 bucket and is about 5 TB in size. The scientist wants to understand the pattern of missingness (e.g., is it missing completely at random, missing at random, or not missing at random) before deciding on an imputation strategy. The scientist has access to AWS Glue DataBrew and Amazon SageMaker Studio. Which approach should the scientist take to best understand the missing data patterns?

A.Use Amazon SageMaker Data Wrangler to create a flow and analyze missingness visually

B.Use AWS Glue DataBrew's data quality and missing data reports

C.Use AWS Glue ETL jobs with PySpark to compute missingness statistics

D.Use Amazon Athena to run queries to find missing values per column

AnswerB

DataBrew's reports visualize missing data patterns and correlations.

Why this answer

AWS Glue DataBrew provides a missing data report that includes patterns and correlations of missingness, such as heatmaps and bar charts. This helps determine the type of missingness. Option B is wrong because SageMaker Data Wrangler does not have built-in missingness pattern analysis.

Option C is wrong because Athena is for querying, not pattern analysis. Option D is wrong because Glue ETL jobs require custom coding and are less efficient for exploratory analysis.

Full explanation →

1150

Multi-Selecthard

Which THREE steps should be taken to secure a SageMaker notebook instance that accesses sensitive data? (Select THREE.)

Select 3 answers

A.Enable encryption at rest for the notebook's EBS volume

B.Grant root access to the notebook instance for flexibility

C.Place the notebook instance inside a VPC with no internet access

D.Allow direct internet access from the notebook for downloading packages

E.Use an IAM role with least privilege permissions for the notebook

AnswersA, C, E

Protects stored data.

Why this answer

Using a VPC with no internet access keeps traffic private. IAM roles enforce least privilege access. Encryption at rest protects data on the notebook.

Option D is wrong because root access should be disabled, not enabled. Option E is wrong because public internet access should be disabled for security.

Full explanation →

1151

Multi-Selecthard

Which TWO of the following are best practices for exploratory data analysis when using Amazon SageMaker Data Wrangler? (Select TWO.)

Select 2 answers

A.Store all intermediate results in Amazon Athena for querying.

B.Use Data Wrangler's built-in data visualizations to explore feature distributions and relationships.

C.Use Amazon EMR to run Spark jobs for data profiling.

D.Always export the data to Amazon QuickSight for analysis before transformation.

E.Export the Data Wrangler flow as a Jupyter notebook to share with the team.

AnswersB, E

Built-in visualizations enable quick EDA.

Why this answer

Using Data Wrangler's built-in visualizations for quick analysis and exporting the flow as a Jupyter notebook for reproducibility are best practices. Option B (QuickSight) is separate. Option C (EMR) is not needed.

Option D (Athena) is for queries, not for building into pipeline.

Full explanation →

1152

Multi-Selecthard

A machine learning team is analyzing a dataset with 10,000 rows and 200 features. They suspect data leakage due to time-based features. Which THREE EDA checks should they perform?

Select 3 answers

A.Plot distribution of each feature in training vs. test sets

B.Apply PCA and check if first two components separate train/test

C.Check whether the dataset is sorted by time and if any feature uses future information

D.Compare feature correlations with target in training and test sets

E.Perform k-means clustering on the whole dataset

AnswersA, C, D

Why D is correct

Why this answer

Option A is correct because feature correlation with target in training vs. test sets may indicate leakage. Option C is correct because time-based split (chronological) can reveal if future data leaks into training. Option D is correct because distribution differences between train and test sets can indicate leakage (e.g., train has future data).

Option B is wrong because clustering is not directly helpful for leakage detection. Option E is wrong because PCA is for dimensionality reduction, not leakage detection.

Full explanation →

1153

MCQhard

A data engineer runs the above CLI command and sees that the bucket contains many small Parquet files (1 MB each) under the prefix. When querying this data with Athena, the query performance is poor and costs are high. Which approach would MOST improve performance and reduce cost?

A.Convert the files to JSON format

B.Convert the files to CSV format

C.Consolidate the small files into fewer, larger Parquet files

D.Add more partitions by including hour in the prefix

AnswerC

Fewer, larger files reduce overhead and improve compression.

Why this answer

Consolidating small files into larger files (e.g., 100 MB) reduces the overhead of reading many small files in Athena. Option A is wrong because CSV is not columnar and would not improve performance. Option B is wrong because converting to JSON is similar.

Option D is wrong because adding more partitions may create even more files.

Full explanation →

1154

MCQhard

A data scientist is using Amazon SageMaker for hyperparameter tuning. The tuning job uses a Bayesian optimization strategy. After 10 training jobs, the objective metric (validation accuracy) has plateaued at 0.85. The data scientist wants to explore more diverse hyperparameter combinations. What should the data scientist do?

A.Decrease the exploration weight in the tuning job configuration.

B.Switch to random search strategy.

C.Increase the exploration weight in the tuning job configuration.

D.Increase the number of parallel training jobs.

AnswerC

Increasing exploration weight prompts the algorithm to try more diverse combinations.

Why this answer

In Bayesian optimization, the exploration weight controls the trade-off between exploring new hyperparameter regions and exploiting known good regions. Increasing this weight encourages the acquisition function to sample more diverse combinations, which can help escape a plateau. Option C is correct because it directly addresses the need for greater diversity in the search space.

Exam trap

Cisco often tests the misconception that increasing parallel jobs or switching to random search is the best way to increase diversity, when in fact Bayesian optimization's exploration weight is the precise control for this purpose.

How to eliminate wrong answers

Option A is wrong because decreasing the exploration weight would make the tuning job more exploitative, focusing on known good regions and reducing diversity, which is the opposite of what is needed. Option B is wrong because switching to random search would abandon the benefits of Bayesian optimization's informed sampling, potentially wasting resources on random trials without leveraging prior results. Option D is wrong because increasing the number of parallel training jobs does not inherently increase exploration diversity; it only speeds up the tuning process but may lead to less informed decisions if the Bayesian model cannot keep up with parallel evaluations.

Full explanation →

1155

Multi-Selectmedium

Which TWO actions are appropriate during exploratory data analysis when you discover that a categorical feature has 50 unique values (high cardinality)?

Select 2 answers

A.Group rare categories into a single 'Other' category.

B.Apply one-hot encoding to create 50 dummy variables.

C.Apply label encoding to assign integers to each category.

D.Drop the feature entirely.

E.Use feature hashing (hashing trick) to reduce dimensionality.

AnswersA, E

Reduces cardinality while keeping most information.

Why this answer

Options B and D are correct. B: Grouping rare categories into an 'Other' category reduces cardinality while preserving information. D: Using feature hashing can transform high-cardinality categorical features into a fixed-size vector.

Option A is incorrect because one-hot encoding creates many columns, which can be problematic. Option C is incorrect because dropping the feature may lose important information. Option E is incorrect because label encoding implies ordinality, which may not exist.

Full explanation →

1156

MCQmedium

A data scientist is training a deep learning model for image classification using Amazon SageMaker. The training job is taking too long. The data scientist wants to speed up training by using distributed training across multiple GPUs. Which SageMaker feature or configuration should the data scientist use?

A.SageMaker Debugger

B.Model parallelism in SageMaker

C.SageMaker hyperparameter tuning

D.SageMaker Data Parallelism library

AnswerD

The SageMaker Data Parallelism library distributes data across multiple GPUs, reducing training time for large datasets.

Why this answer

Option D is correct because the SageMaker Data Parallelism library is specifically designed to distribute training across multiple GPUs by splitting the input data across workers, which reduces per-GPU computation time and accelerates training for deep learning models. This library uses optimized all-reduce algorithms (e.g., Ring AllReduce) to synchronize gradients efficiently, making it ideal for speeding up image classification tasks that are data-intensive.

Exam trap

The trap here is that candidates often confuse model parallelism (splitting the model) with data parallelism (splitting the data), and incorrectly choose model parallelism when the scenario clearly describes a training speed issue solvable by distributing data across GPUs.

How to eliminate wrong answers

Option A is wrong because SageMaker Debugger is a tool for monitoring and debugging training jobs (e.g., capturing tensors, detecting anomalies), not for distributing training across GPUs. Option B is wrong because model parallelism in SageMaker splits the model itself across devices, which is useful for models too large to fit on a single GPU, but the question asks to speed up training for a model that already fits on a single GPU, where data parallelism is the appropriate approach. Option C is wrong because SageMaker hyperparameter tuning automates the search for optimal hyperparameters (e.g., learning rate, batch size) but does not directly enable distributed training across multiple GPUs.

Full explanation →

1157

MCQeasy

A company is using Amazon SageMaker to deploy a machine learning model for real-time inference. The model was trained using XGBoost and achieves high accuracy. However, during deployment, the endpoint returns a 'ModelError' when receiving input data. The input is a CSV string. What is the most likely cause?

A.The input data format does not match the model's expected format (e.g., CSV vs JSON)

B.The inference instance type is too small

C.The model is not properly loaded into memory

D.The model weights are corrupted during deployment

AnswerA

SageMaker inference endpoints require the input to be in the format expected by the model, e.g., CSV for XGBoost.

Why this answer

The most common cause of ModelError during inference is that the input format does not match what the model expects. XGBoost models typically expect CSV without headers. The serializer setting in SageMaker must be configured correctly.

If the model expects text/csv but the endpoint is configured as JSON, the error occurs. The other options are less likely: model weights are loaded correctly if the model deployed, and the instance type affects latency not errors.

Full explanation →

1158

MCQhard

A data scientist is training a model using SageMaker's built-in XGBoost algorithm with a large dataset stored in CSV format. The training job is using File mode. The data scientist wants to reduce the time it takes to start training. Which approach would be most effective?

A.Increase the size of the EBS volume.

B.Convert the data to Parquet format.

C.Use Pipe mode for the input data channel.

D.Increase the number of training instances.

AnswerC

Pipe mode starts training immediately by streaming data.

Why this answer

Pipe mode streams data directly from Amazon S3 into the training container, eliminating the need to first download the entire dataset to the EBS volume. This reduces the startup time significantly because training can begin as soon as the first records arrive, rather than waiting for the full download to complete.

Exam trap

The trap here is that candidates often assume converting to a more efficient format like Parquet will speed up training startup, but in File mode the bottleneck is the download step, not the read efficiency, so Pipe mode directly addresses the root cause.

How to eliminate wrong answers

Option A is wrong because increasing the EBS volume size does not reduce the time to start training; it only provides more storage space, and the download time from S3 remains the same. Option B is wrong because converting to Parquet format improves read performance and reduces storage size, but the training job still uses File mode, which requires the full dataset to be downloaded to the EBS volume before training starts. Option D is wrong because increasing the number of training instances does not reduce the startup time; it distributes the training workload across more machines but still requires each instance to download the full dataset in File mode before training begins.

Full explanation →

1159

MCQeasy

A company is using SageMaker to deploy a model for real-time inference. The model requires low latency, and the company wants to test the endpoint before production. Which approach should be used to validate endpoint performance?

A.Use CloudWatch Synthetics to create a canary.

B.Perform offline batch evaluation on a test dataset.

C.Deploy to production and monitor using CloudWatch.

D.Use SageMaker's built-in shadow testing or load testing features.

AnswerD

Allows traffic simulation and latency measurement.

Why this answer

Option D is correct because SageMaker provides a built-in load testing tool that simulates traffic to test endpoint performance. Option A is wrong because waiting for production traffic does not allow pre-production validation. Option B is wrong because CloudWatch does not generate traffic.

Option C is wrong because offline evaluation does not test real-time inference.

Full explanation →

1160

MCQmedium

A data scientist is trying to create a SageMaker training job but receives an access denied error. The IAM policy attached to the role is shown in the exhibit. What is the most likely cause of the error?

A.The policy does not allow s3:PutObject for the output location

B.The policy does not allow sagemaker:CreateTrainingJob

C.The policy has an explicit deny on s3:PutObject

D.The policy does not allow s3:GetObject on the output bucket

AnswerA

SageMaker needs to write model artifacts and output to S3, requiring s3:PutObject.

Why this answer

Option B is correct because the policy lacks permissions to write output to S3. The s3:PutObject action is missing. Option A is wrong because the policy allows s3:GetObject.

Option C is wrong because sagemaker:CreateTrainingJob is allowed. Option D is wrong because there is no condition denying writes.

Full explanation →

1161

MCQmedium

A team is using Amazon SageMaker to train a model and wants to automatically stop training when the model stops improving to save costs. Which SageMaker feature should they use?

A.SageMaker Experiments

B.SageMaker Debugger

C.SageMaker Managed Spot Training with early stopping

D.SageMaker Automatic Model Tuning

AnswerC

Spot training can use a stopping condition to halt when improvements cease.

Why this answer

SageMaker's managed spot training with early stopping can halt jobs when improvement plateaus, but the specific feature is Automatic Model Tuning with early stopping, or using a custom stopping condition via CloudWatch. However, the simplest is SageMaker's built-in early stopping in Hyperparameter Tuning jobs. For a single training job, use a custom callback or the SageMaker Debugger.

Full explanation →

1162

Multi-Selecthard

A data scientist is using Amazon SageMaker to train a random forest model for a binary classification task. The dataset has 50 features and 10,000 samples. The model achieves high training accuracy but poor test accuracy. Which TWO actions should the scientist take to improve generalization?

Select 2 answers

A.Increase the max_samples parameter.

B.Reduce the max_depth of the trees.

C.Increase the max_features parameter.

D.Increase the number of trees (n_estimators).

E.Increase the min_samples_leaf parameter.

AnswersB, E

Reducing tree depth limits model complexity and helps prevent overfitting.

Why this answer

The model is overfitting. Increasing max_depth would increase overfitting. Reducing the number of trees (n_estimators) generally increases bias and may reduce overfitting, but random forest is robust to overfitting with more trees.

Typically, increasing trees reduces overfitting. However, reducing tree depth (max_depth) reduces model complexity. Also, increasing the minimum samples per leaf reduces overfitting.

So correct actions are: reduce max_depth and increase min_samples_leaf. Option B (increase n_estimators) can also help, but it is not a direct fix for overfitting; random forest with more trees tends to generalize better. But the question says 'which TWO'.

I'll go with A and D as they directly reduce complexity. However, increasing n_estimators is also a common practice to reduce overfitting. Let's check: In random forest, more trees reduce variance and overfitting.

So B is also correct. But we need exactly 2 correct. The question says 'Which TWO'.

I need to provide two correct options. I'll choose A and D. But I'll reconsider: Reducing max_depth (A) and increasing min_samples_leaf (D) are standard ways to reduce overfitting.

Increasing n_estimators (B) also helps but may increase training time. The question likely expects A and D. I'll set A and D as correct.

Full explanation →

1163

MCQeasy

A data scientist has this IAM policy attached to an IAM role used by SageMaker. When trying to create a training job, the scientist gets an access denied error. The training data is in 's3://my-bucket/training-data/'. What is the most likely cause?

A.The bucket name is misspelled

B.The S3 resource ARN is incorrect

C.Missing s3:ListBucket permission

D.The sagemaker:CreateTrainingJob action is not allowed

AnswerC

SageMaker needs ListBucket permission to access objects.

Why this answer

Option C is correct. The policy allows 's3:GetObject' but not 's3:ListBucket', which is required for SageMaker to access objects in the bucket. Option A is wrong because the actions are allowed.

Option B is wrong because the resource is specified correctly. Option D is wrong because the bucket and prefix are correct.

Full explanation →

1164

Multi-Selectmedium

Which TWO factors should be considered when choosing between Amazon SageMaker's real-time endpoints and serverless inference? (Select TWO.)

Select 2 answers

A.GPU requirement

B.Inference traffic pattern (intermittent vs steady)

C.Integration with AWS Lambda

D.Availability of built-in algorithms

E.Model size in GB

AnswersA, B

Serverless inference does not support GPU instances.

Why this answer

Serverless inference is ideal for intermittent workloads with no cold start tolerance. Real-time endpoints are better for predictable, low-latency, and high-throughput requirements. GPU support is only available on real-time endpoints.

Memory limits apply to both but serverless has a maximum of 6 GB. Concurrent requests can be handled by both, but serverless scales to zero.

Full explanation →

1165

MCQmedium

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents 5% of the data. The model achieves 99% accuracy but only identifies 10% of the actual positive cases. Which metric should the data scientist focus on to evaluate the model's performance on the positive class?

A.Precision

B.Recall

C.AUC-ROC

D.F1 score

AnswerB

Recall measures the proportion of actual positives correctly identified, which is the key issue.

Why this answer

Recall measures the proportion of actual positives correctly identified, which is critical for imbalanced datasets where accuracy is misleading. Option A is wrong because precision focuses on correctness of positive predictions, not coverage. Option B is wrong because F1 balances precision and recall but doesn't directly address the low recall.

Option D is wrong because AUC-ROC considers overall separability, not specifically recall of the positive class.

Full explanation →

1166

Multi-Selecteasy

A data scientist is training a binary classifier using imbalanced data. Which TWO techniques can help improve model performance on the minority class? (Choose two.)

Select 2 answers

A.Undersample the majority class randomly.

B.Use accuracy as the evaluation metric.

C.Use the F1 score as the evaluation metric.

D.Oversample the minority class using SMOTE.

E.Apply L1 regularization to the model.

AnswersC, D

F1 score balances precision and recall.

Why this answer

The F1 score is the harmonic mean of precision and recall, making it a robust evaluation metric for imbalanced datasets because it captures both false positives and false negatives. Unlike accuracy, which can be misleadingly high when the majority class dominates, the F1 score provides a balanced measure of model performance on the minority class.

Exam trap

Cisco often tests the misconception that random undersampling is always beneficial for imbalanced data, but candidates must recognize that it can discard useful majority class patterns and that SMOTE or other synthetic oversampling methods are preferred.

Full explanation →

1167

MCQmedium

Refer to the exhibit. A data scientist ran a SageMaker training job and reviewed the logs. The training completed quickly, but the model performance is very poor. What is the most likely cause?

A.The model is overfitting to the training data.

B.There is data leakage from the test set into the training set.

C.The learning rate is too low, causing slow convergence.

D.The training dataset is too small for the model complexity.

AnswerD

A small dataset can be trained quickly but leads to poor generalization.

Why this answer

The training ran for only about 1 minute, which is too short for a typical training. The model likely didn't converge. This indicates that the training job might have been configured with too few epochs or the data was very small, or the algorithm stopped early.

The logs show 'Training completed' quickly. The most likely cause is that the training job used a very small number of epochs or early stopping criteria caused premature termination. Option C (model overfitting) would show longer training and high training accuracy.

Option D (data leakage) would show good performance. Option A (insufficient training data) could cause poor performance, but the logs show training completed quickly, suggesting the job didn't run long enough. Option B (incorrect learning rate) could cause divergence but would still train for the specified epochs.

The quick completion suggests the job was configured with too few epochs or early stopping. But among the options, A (insufficient training data) is plausible. However, the question says 'most likely'.

I'll choose B (incorrect learning rate) because if the learning rate is too high, the loss may explode and cause early stopping or NaN, leading to quick termination. But the log doesn't show errors. Actually, the log shows normal completion.

So it's likely the model didn't train enough. Option B: If learning rate is too low, training can be slow but still complete epochs. The quick completion suggests the number of epochs was small.

But the options don't mention epochs. Option A: insufficient training data would still train for the number of epochs. Option C: overfitting would not cause quick completion.

Option D: data leakage would give good performance. So I'm leaning towards B: incorrect learning rate (too high) could cause the loss to become NaN and training to stop, but the log says 'Training completed' not 'Stopped'. It might be that the training completed all epochs because of a small dataset.

Actually, the log shows 'Training completed' after 1 minute, so it might have finished all epochs. If the dataset is very small, training could be fast. That would lead to poor performance due to insufficient data.

So A is plausible. I'll go with A.

Full explanation →

1168

Multi-Selectmedium

A data scientist is performing feature selection for a classification problem with 100 features. The data scientist wants to reduce overfitting and improve model interpretability. Which THREE methods are appropriate for feature selection? (Choose THREE.)

Select 3 answers

A.Principal Component Analysis (PCA)

B.Recursive Feature Elimination (RFE)

C.L1 regularization (Lasso)

D.Adding random noise to the features

E.Feature importance from a random forest model

AnswersB, C, E

RFE recursively removes the least important features based on model coefficients or feature importance.

Why this answer

Recursive Feature Elimination (RFE) is a wrapper method that recursively removes the least important features based on a model's feature weights or coefficients, training the model multiple times to identify the optimal subset. This directly reduces overfitting by eliminating irrelevant or redundant features and improves interpretability by keeping only the most predictive features.

Exam trap

AWS often tests the distinction between feature selection (keeping original features) and dimensionality reduction (creating new features), so candidates mistakenly choose PCA as a feature selection method when it is actually a feature extraction technique.

Full explanation →

1169

MCQhard

A data scientist is analyzing a dataset with a binary target variable. The dataset is highly imbalanced (99% negative class). Which metric is most appropriate for evaluating the model's performance during exploratory data analysis?

A.Accuracy

B.Precision

C.F1 Score

D.Area Under the ROC Curve (AUC-ROC)

AnswerD

AUC-ROC is insensitive to class imbalance and provides a global measure of performance.

Why this answer

The Area Under the ROC Curve (AUC-ROC) is robust to class imbalance and measures the trade-off between true positive rate and false positive rate. Accuracy is misleading. F1 score is sensitive to imbalance but less so for threshold selection.

Precision and recall individually are less comprehensive.

Full explanation →

1170

MCQhard

The exhibit shows an Athena query result from a table. What is the output of the query?

A.3, 3, 4

B.2, 3, 3

C.2, 4, 4

D.2, 4, 3

AnswerC

Correct counts: col2 non-null=2, rows=4, distinct col1=4.

Why this answer

Option B is correct. COUNT(col2) counts non-null values in col2: rows 1 and 3 have non-null, so 2. COUNT(*) counts all rows: 4.

COUNT(DISTINCT col1) counts distinct col1 values: A,B,C,D = 4. Option A is wrong because COUNT(col2) is 2, not 3. Option C is wrong because COUNT(*) is 4, not 3.

Option D is wrong because COUNT(DISTINCT col1) is 4, not 3.

Full explanation →

1171

MCQmedium

The exhibit shows the result of an Athena query. What does the value '5000' represent?

A.The total number of rows in the table

B.The number of rows where col1 is NULL

C.The number of rows where col1 is not NULL

D.The number of distinct values in col1

AnswerB

The query counts rows with NULL in col1.

Why this answer

Option B is correct because the query counts rows where col1 IS NULL, and the result '5000' is that count. Option A is wrong because the query is not selecting distinct values. Option C is wrong because the query counts NULL rows, not non-NULL.

Option D is wrong because total row count was not queried.

Full explanation →

1172

Multi-Selectmedium

A data scientist is training a linear regression model and wants to handle multicollinearity among features. Which TWO actions are appropriate?

Select 2 answers

A.Add interaction terms between features

B.Use Ridge regression (L2 regularization)

C.Use Lasso regression (L1 regularization)

D.Remove one of the highly correlated features

E.Scale all features to have zero mean and unit variance

AnswersB, D

Ridge regression shrinks coefficients of correlated features, reducing their impact.

Why this answer

Ridge regression (L2) adds a penalty that can reduce the impact of correlated features. Removing one of the correlated features directly addresses multicollinearity. Lasso (L1) may also help but is less effective for groups of correlated features.

Scaling features does not remove collinearity. Adding interaction terms increases multicollinearity.

Full explanation →

1173

MCQhard

A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?

A.Remove all but one feature from each group of highly correlated features.

B.Apply Principal Component Analysis (PCA) and keep the top 50 principal components.

C.Use Linear Discriminant Analysis (LDA) to project to 50 dimensions.

D.Use t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce to 50 dimensions.

AnswerB

PCA finds orthogonal directions of maximum variance and can reduce dimensionality effectively.

Why this answer

Principal Component Analysis (PCA) is the correct technique because it performs an orthogonal linear transformation that projects the original 500 features into a new coordinate system where the axes (principal components) are ordered by the variance they capture. By keeping the top 50 principal components, the data scientist retains the maximum possible variance in the reduced 50-dimensional space, directly addressing the goal of preserving variance while handling high multicollinearity.

Exam trap

Cisco often tests the distinction between unsupervised variance-preserving techniques (PCA) and supervised or visualization-specific techniques (LDA, t-SNE), leading candidates to mistakenly choose LDA for dimensionality reduction without recognizing its supervised nature and dimension limit.

How to eliminate wrong answers

Option A is wrong because simply removing all but one feature from each group of highly correlated features is a heuristic that does not guarantee preserving maximum variance; it discards potentially useful information and does not leverage the correlation structure to create new, uncorrelated features. Option C is wrong because Linear Discriminant Analysis (LDA) is a supervised technique that requires class labels to maximize class separability, not variance preservation, and it can project to at most (number of classes - 1) dimensions, which is typically far fewer than 50. Option D is wrong because t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear, stochastic dimensionality reduction technique primarily used for visualization of high-dimensional data in 2 or 3 dimensions; it does not preserve global variance structure and is not suitable for reducing to 50 dimensions while retaining maximum variance.

Full explanation →

1174

Matchingmedium

Match each ML model evaluation concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Model performs well on training data but poorly on unseen data

Model fails to capture underlying patterns in data

Error from wrong assumptions in the learning algorithm

Error from sensitivity to small fluctuations in training data

Balance between underfitting and overfitting

Why these pairings

These are fundamental concepts in model evaluation.

Full explanation →

1175

MCQmedium

A company uses Amazon SageMaker to train a classification model. The training job fails with an error indicating that the algorithm requires a GPU but the instance type does not have one. The scientist used the built-in XGBoost algorithm. What should the scientist do to resolve the issue?

A.Choose a CPU instance type for the training job

B.Install a GPU-enabled version of XGBoost in the training container

C.Change the algorithm to a deep learning algorithm

D.Use a larger GPU instance type

AnswerA

XGBoost can run on CPU; use CPU instance.

Why this answer

XGBoost does not require a GPU; it can run on CPU. The error may be due to using a GPU-only algorithm version or misconfiguration. The simplest solution is to choose a CPU instance type.

Installing a GPU version is unnecessary. Changing algorithm is not needed. Using a larger CPU instance can help but is not required.

Option A: Choose a CPU instance type is correct. Option B: Installing GPU version is not needed. Option C: Changing algorithm is unnecessary.

Option D: Using a larger instance may not address the issue if the instance type is still GPU-only.

Full explanation →

1176

MCQmedium

A machine learning team is deploying a model that performs real-time inference on streaming data from Amazon Kinesis Data Streams. The model requires sub-100ms latency. Which deployment option should the team choose?

A.Use Amazon SageMaker batch transform

B.Use Amazon SageMaker asynchronous inference

C.Deploy the model on an Amazon SageMaker real-time endpoint

D.Deploy a custom inference container on AWS Lambda

AnswerC

Real-time endpoints provide low-latency inference.

Why this answer

Amazon SageMaker real-time endpoints provide low-latency inference suitable for sub-100ms requirements. SageMaker batch transform is for offline predictions. SageMaker asynchronous inference is for near-real-time with longer latencies.

AWS Lambda alone may not handle model serving efficiently for low latency. Option A: SageMaker real-time endpoint is correct. Option B: SageMaker batch transform is for batch, not real-time.

Option C: SageMaker asynchronous inference has higher latency. Option D: AWS Lambda custom inference is possible but may not meet sub-100ms consistently due to cold starts.

Full explanation →

1177

MCQhard

A company uses Amazon SageMaker to train a regression model. After training, the data scientist notices that the training loss decreases but validation loss increases after a few epochs. Which EDA technique could have helped predict this behavior?

A.Create box plots of each feature to identify outliers

B.Plot learning curves showing training and validation loss over epochs

C.Generate residual plots to check heteroscedasticity

D.Plot confusion matrix on the validation set

AnswerB

Why B is correct

Why this answer

Option B is correct because plotting learning curves (training and validation loss vs. epochs) would show overfitting. Option A is wrong because confusion matrix is for classification. Option C is wrong because box plots show outliers but not overfitting.

Option D is wrong because residual plots are for linear regression assumptions.

Full explanation →

1178

Multi-Selecteasy

A data scientist is evaluating a classification model. The confusion matrix shows that the model has 50 true positives, 100 true negatives, 20 false positives, and 30 false negatives. Which TWO metrics can be calculated from this confusion matrix? (Choose two.)

Select 2 answers

A.R-squared

B.F1 score

C.Recall

D.Root mean squared error

E.Precision

AnswersC, E

Recall = TP/(TP+FN) can be directly calculated.

Why this answer

Options A and D are correct because precision and recall are directly computed from TP, FP, FN. Option B is wrong because R-squared is for regression. Option C is wrong because RMSE is for regression.

Option E is wrong because F1 score requires both precision and recall, but it is not directly from the confusion matrix without calculation; however, it can be derived. But the question asks 'calculate from this confusion matrix', and both precision and recall are directly calculated. F1 is derived from them, so it's also calculable.

However, the question says 'TWO', and the most direct are precision and recall. F1 requires an extra step. I'll consider precision and recall as the correct pair.

But to be precise, I'll pick A and D.

Full explanation →

1179

Multi-Selectmedium

A data scientist is using Amazon SageMaker to perform exploratory data analysis on a dataset with missing values and outliers. Which TWO actions should the scientist take to understand the data quality? (Choose TWO.)

Select 2 answers

A.Build a scatterplot matrix to visualize pairwise relationships

B.Use histograms to visualize the distribution of each numerical feature

C.Plot a confusion matrix to assess class separation

D.Create a correlation matrix to identify redundant features

E.Generate summary statistics using df.describe() in a SageMaker notebook

AnswersB, E

Histograms reveal outliers, skewness, and missing data patterns (e.g., zero counts).

Why this answer

Option A is correct because generating summary statistics helps identify missing counts and outliers via min/max. Option D is correct because visualizing distributions with histograms helps spot outliers and skewness. Option B is wrong because a correlation matrix does not directly show missing values or outliers.

Option C is wrong because a confusion matrix is for classification models, not for data exploration. Option E is wrong because a scatterplot matrix shows pairwise relationships, not missing values.

Full explanation →

1180

MCQeasy

A data scientist needs to version control datasets used for machine learning experiments. Which AWS service should the data scientist use?

A.AWS Lake Formation

B.Amazon SageMaker Feature Store

C.Amazon SageMaker Model Registry

D.Amazon S3 with versioning enabled

AnswerD

S3 versioning provides dataset version control.

Why this answer

Option C is correct because AWS Lake Formation does not version datasets. Option D is correct because Amazon S3 can be used with versioning enabled. Option A is wrong because it's for feature stores.

Option B is wrong because it's for model registry.

Full explanation →

1181

Multi-Selectmedium

Which THREE techniques are commonly used to detect outliers in a dataset? (Select THREE.)

Select 3 answers

A.Interquartile range (IQR)

B.k-means clustering

C.Principal component analysis (PCA)

D.Z-score

E.Isolation Forest

AnswersA, D, E

Why B is correct

Why this answer

Options A, B, and D are correct. Z-score and IQR are standard statistical methods, and isolation forest is a machine learning algorithm for anomaly detection. Option C is wrong because PCA is for dimensionality reduction, not outlier detection, though it can be used in some contexts but is not common.

Option E is wrong because k-means clustering is for clustering, not specifically for outlier detection.

Full explanation →

1182

MCQhard

A data engineering team is designing a data lake on Amazon S3. The data is ingested from multiple sources in JSON, CSV, and Parquet formats. The team needs to make the data available for analysis using Amazon Athena and Amazon Redshift Spectrum. The team wants to minimize data transformation costs and storage overhead. Which data storage approach should the team use?

A.Load the data into Amazon Redshift cluster and then unload to S3 in Parquet

B.Store the data in its original format in S3 and use Athena to query directly

C.Store the data in its original format and use AWS Glue to convert to Parquet when queried

D.Convert all data to Apache Parquet before storing in S3

AnswerD

Parquet is columnar, reducing storage and improving query performance.

Why this answer

Option B is correct because storing data in columnar formats like Parquet reduces storage and improves query performance. Option A is wrong because storing all data as raw JSON inflates storage. Option C is wrong because converting data to a single format increases transformation costs.

Option D is wrong because using a relational database is not a data lake approach.

Full explanation →

1183

MCQeasy

A data scientist is using Amazon SageMaker to train a model. The training job is taking longer than expected. The data scientist notices that the GPU utilization is low. Which action would most likely improve GPU utilization?

A.Change to a CPU-based instance

B.Increase the batch size

C.Decrease the batch size

D.Use a larger instance type

E.Enable data augmentation

AnswerB

Larger batch sizes keep GPU busy.

Why this answer

Option A is correct because increasing the batch size allows the GPU to process more data per step, improving utilization. Option B (reduce batch size) would decrease utilization. Option C (increase instance type) may help but is more costly.

Option D (enable data augmentation) increases data, not utilization. Option E (use CPU instance) would make it worse.

Full explanation →

1184

MCQeasy

A company is building a recommendation system for an e-commerce platform. The data includes user IDs and item IDs. Which SageMaker built-in algorithm is most appropriate?

A.BlazingText

B.XGBoost

C.Factorization Machines

D.Image Classification

AnswerC

Designed for recommendation.

Why this answer

Factorization Machines (FM) are specifically designed for recommendation tasks with sparse, high-dimensional categorical data like user IDs and item IDs. They model pairwise interactions between features (e.g., user-item interactions) using factorized parameters, making them highly effective for collaborative filtering and implicit feedback scenarios in e-commerce.

Exam trap

The trap here is that candidates often choose XGBoost (B) because it is a versatile algorithm, but they overlook that FM is purpose-built for sparse, high-dimensional interaction data and directly models pairwise feature interactions without manual feature engineering.

How to eliminate wrong answers

Option A is wrong because BlazingText is optimized for word embeddings and text classification, not for collaborative filtering or sparse user-item interaction matrices. Option B is wrong because XGBoost is a gradient boosting tree-based algorithm that struggles with extremely sparse, high-cardinality categorical features without extensive feature engineering, and it does not inherently model pairwise interactions like FM. Option D is wrong because Image Classification is designed for convolutional neural network tasks on pixel data, not for tabular or recommendation data with user and item IDs.

Full explanation →

1185

MCQhard

A data engineer is building a data pipeline that uses AWS Lambda to process records from an SQS queue and write results to an S3 bucket. The Lambda function processes each record individually and writes a separate file to S3. The team notices high latency and wants to reduce the number of S3 PUT requests to improve performance and reduce cost. Which approach should the data engineer take?

A.Use S3 multipart upload for each record to improve throughput.

B.Increase the Lambda function's memory allocation to improve processing speed.

C.Use S3 Batch Operations to process the records in batches.

D.Aggregate multiple records into a single file in a DynamoDB table, then periodically write the aggregated data to S3.

AnswerD

Aggregation reduces the number of S3 PUT requests by writing larger files less frequently.

Why this answer

Option A is correct because buffering records in a DynamoDB table and then using a batch write to S3 reduces the number of PUT requests. Option B (increasing Lambda memory) does not reduce S3 PUT count. Option C (S3 batch operations) is for existing objects, not for incoming data.

Option D (multipart upload) is for large objects, not for many small objects.

Full explanation →

1186

MCQeasy

A data engineer needs to ingest streaming data from an on-premises Kafka cluster into Amazon S3 with minimal operational overhead. Which AWS service should be used to stream the data into S3 without managing servers?

A.Amazon Kinesis Data Streams

B.AWS Glue

C.Amazon Managed Streaming for Apache Kafka (Amazon MSK)

D.Amazon Kinesis Data Firehose

AnswerD

Kinesis Data Firehose can directly ingest streaming data and deliver to S3 without managing servers.

Why this answer

Amazon Kinesis Data Firehose is the correct service for loading streaming data into S3 without managing servers. Option A (Amazon Kinesis Data Streams) requires a consumer to process data; Option B (Amazon MSK) is a managed Kafka service but still requires management; Option D (AWS Glue) is for ETL jobs, not real-time streaming.

Full explanation →

1187

MCQhard

A company has a real-time inference endpoint on Amazon SageMaker that uses a custom container. The endpoint is experiencing high latency and occasional 502 errors. The logs from the container show that the model inference time is low, but the overall response time is high. Which step is MOST likely to reduce the latency?

A.Switch to batch transform to process requests in batches

B.Use a larger instance type for the endpoint

C.Optimize the model to reduce inference time

D.Increase the number of instances and enable auto-scaling

AnswerD

More instances can handle more concurrent requests, reducing queuing and latency.

Why this answer

Option C is correct because increasing the endpoint's instance count and enabling auto-scaling can distribute the load and reduce queuing delays. Option A is wrong because the inference time is already low, so optimizing the model further won't help much. Option B is wrong because increasing instance size may help but is less cost-effective than scaling out.

Option D is wrong because switching to batch transform is for offline inference, not real-time.

Full explanation →

1188

MCQmedium

A model deployed on a SageMaker endpoint is producing predictions that are consistently biased against a certain demographic. Which step should the team take FIRST to address this issue?

A.Enable SageMaker Model Monitor to track prediction quality

B.Switch to a different algorithm that is less prone to bias

C.Use SageMaker Clarify to analyze bias in the training data and predictions

D.Retrain the model with balanced data

AnswerC

Clarify can detect and explain bias, guiding corrective actions.

Why this answer

The first step is to analyze the data and model for bias. SageMaker Clarify can detect bias. Option B is correct.

Option A is a solution after analysis. Option C changes model. Option D is a general practice.

Full explanation →

1189

MCQhard

A team is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a large dataset stored in S3. The dataset contains missing values, outliers, and categorical variables with high cardinality. The team wants to understand data distributions and relationships before modeling. Which combination of Data Wrangler features should they use?

A.Generate a data quality report, view histograms, and create scatter plots for selected features.

B.Drop rows with missing values and visualize box plots for numerical features.

C.Use imputation to handle missing values and one-hot encoding for categorical features.

D.Generate a data quality report and a correlation heatmap.

AnswerA

Data quality report provides summary statistics and missing values; histograms and scatter plots show distributions and relationships.

Why this answer

Option D is correct because Data Wrangler's data quality report provides summary statistics and missing value analysis, and the histogram visualization shows distributions. Scatter plots reveal relationships between variables. Option A is incorrect because Data Wrangler does not include correlation heatmaps directly.

Option B is incorrect because imputation and one-hot encoding are transformations, not EDA steps. Option C is incorrect because handling missing values is part of data preparation, not initial EDA.

Full explanation →

1190

MCQhard

A data scientist is performing exploratory data analysis on a large dataset stored in Amazon S3 (100 GB, CSV format, 500 columns). The dataset contains customer transaction records with features such as transaction amount, timestamp, customer ID, and numerous categorical variables (e.g., product category, payment method, location). The scientist wants to understand the distribution of transaction amounts across different product categories and identify any outliers. They have an Amazon SageMaker notebook instance with a ml.t3.medium instance and are using pandas. However, when trying to load the entire dataset into a DataFrame using pd.read_csv('s3://bucket/data.csv'), the notebook crashes with a memory error. Additionally, the scientist suspects that some categorical columns have high cardinality (e.g., product category has thousands of unique values), and there are missing values in several columns. What is the MOST efficient approach to perform the EDA without modifying the original dataset or using additional AWS services? Options: A) Use the SageMaker SDK to launch a parallel processing job with PySpark and read the data into a Spark DataFrame, then compute statistics and visualize with matplotlib. B) Use pandas with chunksize parameter to iterate through the dataset in chunks, compute per-chunk statistics, and aggregate results; for high-cardinality columns, use value_counts() with dropna=False and then plot the top 20 categories. C) Use the S3 Select API to filter rows and columns before loading into pandas, reducing the data size; then use pandas for EDA. D) Use SageMaker Data Wrangler to import the dataset, create a flow to handle missing values and reduce cardinality, and export a sample to the notebook for analysis.

A.Use the SageMaker SDK to launch a parallel processing job with PySpark and read the data into a Spark DataFrame, then compute statistics and visualize with matplotlib.

B.Use the S3 Select API to filter rows and columns before loading into pandas, reducing the data size; then use pandas for EDA.

C.Use SageMaker Data Wrangler to import the dataset, create a flow to handle missing values and reduce cardinality, and export a sample to the notebook for analysis.

D.Use pandas with chunksize parameter to iterate through the dataset in chunks, compute per-chunk statistics, and aggregate results; for high-cardinality columns, use value_counts() with dropna=False and then plot the top 20 categories.

AnswerD

Directly solves memory issue by chunking; handles high cardinality by limiting to top categories; no extra services needed.

Why this answer

Option B is correct because it directly addresses the memory issue by processing data in chunks and handles high-cardinality categorical columns by focusing on top categories, all within the existing pandas environment without additional services. Option A requires PySpark which is not set up on the current instance and adds complexity. Option C, S3 Select, can reduce data size but cannot perform the aggregation needed (e.g., distribution across categories) without pulling all rows; it's more suitable for simple filtering.

Option D, SageMaker Data Wrangler, is a separate service that requires additional setup and is not the most efficient for an ad-hoc EDA; it also modifies the workflow.

Full explanation →

1191

MCQmedium

A company is building a binary classifier to detect fraudulent transactions. The dataset is highly imbalanced with only 0.1% positive cases. The data scientist uses logistic regression and obtains 99.9% accuracy on the test set. Which metric should the data scientist use to evaluate the model's performance?

A.ROC AUC

B.Precision-recall curve

C.Precision

D.F1 score

AnswerB

Precision-recall curves focus on the positive class and handle imbalance well.

Why this answer

With only 0.1% positive cases, accuracy is misleading because a model that always predicts 'not fraudulent' achieves 99.9% accuracy. The precision-recall curve focuses on the positive class and is robust to extreme class imbalance, showing the trade-off between precision and recall across thresholds. This makes it the best choice for evaluating a binary classifier on highly imbalanced fraud detection data.

Exam trap

The trap here is that candidates see 'ROC AUC' as a standard metric and forget that it can be inflated by a large number of true negatives in imbalanced datasets, making precision-recall the correct choice for evaluating rare event classifiers.

How to eliminate wrong answers

Option A is wrong because ROC AUC can be overly optimistic on highly imbalanced datasets; the area under the ROC curve is dominated by the large number of true negatives, masking poor performance on the rare positive class. Option C is wrong because precision alone is a single-point metric that does not capture the trade-off with recall, so it cannot fully evaluate model performance across different decision thresholds. Option D is wrong because the F1 score is a harmonic mean of precision and recall at a single threshold, which may not reflect the model's overall ability to rank positive cases; it is less informative than the full precision-recall curve for threshold selection in imbalanced settings.

Full explanation →

1192

MCQhard

A data scientist creates the above IAM policy and attaches it to a role used by an Amazon SageMaker notebook instance. When trying to save a file to the S3 bucket, the operation fails. What is the missing permission?

A.kms:Decrypt

B.s3:ListBucket

C.kms:GenerateDataKey

D.s3:GetObject

AnswerC

If the bucket uses SSE-KMS, PutObject requires kms:GenerateDataKey to encrypt the object.

Why this answer

Option D is correct because SageMaker needs s3:PutObject, but also needs s3:GetObject for some operations. However, the error is likely due to missing encryption permissions. Option A is wrong because s3:ListBucket is for listing.

Option B is wrong because kms:Decrypt is for reading. Option C is wrong because s3:GetObject is for reading.

Full explanation →

1193

MCQmedium

A company wants to deploy a machine learning model that requires GPU acceleration for inference. The model is small and can fit on a single GPU. Which SageMaker endpoint configuration is MOST cost-effective?

A.Use a ml.p3.16xlarge instance with 8 GPUs.

B.Use a SageMaker Serverless Inference endpoint.

C.Use a Multi-Model Endpoint on a ml.g4dn.xlarge instance.

D.Use a ml.p3.2xlarge instance with 1 GPU and enable automatic scaling.

AnswerD

A single GPU instance with scaling provides cost-effective real-time inference.

Why this answer

Option D is the most cost-effective because it uses a single-GPU ml.p3.2xlarge instance, which matches the requirement that the model fits on one GPU, and enables automatic scaling to handle variable traffic without over-provisioning. This avoids paying for unused GPU capacity while still providing the necessary GPU acceleration for inference.

Exam trap

The trap here is that candidates often choose a larger GPU instance (like A) thinking it provides better performance, or select Serverless Inference (B) assuming it supports all instance types, but the exam tests the specific constraint that GPU acceleration is required and that Serverless Inference is CPU-only.

How to eliminate wrong answers

Option A is wrong because a ml.p3.16xlarge instance with 8 GPUs is massively over-provisioned for a model that fits on a single GPU, leading to unnecessary cost. Option B is wrong because SageMaker Serverless Inference does not support GPU acceleration; it uses CPU-based compute, which would not meet the GPU requirement. Option C is wrong because a Multi-Model Endpoint on a ml.g4dn.xlarge instance, while cost-effective for hosting multiple models, uses a single GPU that must be shared among all loaded models, potentially causing contention and not being the most cost-effective for a single model that fits on one GPU.

Full explanation →

1194

MCQmedium

Refer to the exhibit. A data scientist is deploying a PyTorch model on a SageMaker endpoint. When the endpoint is invoked, the above error appears in CloudWatch logs. What is the MOST likely cause?

A.The endpoint instance type does not support the required CUDA version.

B.The endpoint instance does not have enough memory to load the model.

C.The input tensor shape does not match the model's expected input shape.

D.The model artifact was not properly saved or is missing from the S3 location.

AnswerD

If the model file is missing or corrupted, load_model returns None.

Why this answer

The error occurs in the model_fn function, which loads the model. The error 'NoneType' object has no attribute 'shape' suggests that the model object is None, meaning the model was not loaded correctly. The most likely cause is that the model file (model.tar.gz) does not contain the expected model artifact, or the model file is missing.

Incorrect input tensor shape (B) would cause errors during inference, not loading. Insufficient memory (C) would cause out-of-memory errors. Wrong endpoint instance type (D) would not cause a NoneType error.

Full explanation →

1195

MCQhard

A data scientist is building a binary classifier for loan default prediction. The cost of a false negative (missing a default) is 10 times higher than the cost of a false positive. Which evaluation metric is MOST appropriate?

A.Precision

B.F-beta score with beta=2

C.Accuracy

D.Area under the ROC curve

AnswerB

F-beta with beta>1 gives more weight to recall, minimizing costly false negatives.

Why this answer

The F-beta score with beta=2 is the most appropriate metric because it weights recall (sensitivity) higher than precision, which is critical when false negatives are 10 times more costly than false positives. Beta=2 means recall is considered 2^2 = 4 times more important than precision, directly aligning with the asymmetric cost structure. This allows the model to be tuned to minimize missed defaults, even at the expense of more false alarms.

Exam trap

The trap here is that candidates often default to AUC-ROC as a 'balanced' metric without realizing it does not incorporate asymmetric error costs, leading them to overlook the F-beta score which is explicitly designed for such scenarios.

How to eliminate wrong answers

Option A is wrong because precision focuses only on the proportion of true positives among positive predictions, ignoring false negatives entirely, so it cannot account for the higher cost of missing defaults. Option C is wrong because accuracy treats all correct predictions equally and is misleading when classes are imbalanced, which is common in loan default prediction, and it does not incorporate the differential cost of errors. Option D is wrong because the area under the ROC curve (AUC-ROC) measures the model's ability to discriminate between classes across all thresholds but does not directly optimize for a specific cost ratio; it is a rank-based metric that does not penalize false negatives more heavily.

Full explanation →

1196

MCQeasy

A data scientist is building a time series forecasting model for monthly sales. The data shows strong seasonality with a yearly pattern. They plan to use Amazon Forecast. Which algorithm should they choose?

A.XGBoost

B.K-means clustering

C.DeepAR+

D.Linear regression

AnswerC

DeepAR+ is designed for time series with seasonality and trends.

Why this answer

Amazon Forecast's DeepAR+ is built for time series with seasonality.

Full explanation →

1197

Multi-Selecthard

Which THREE of the following are common causes of overfitting in machine learning models?

Select 3 answers

A.Using a complex model like a deep neural network on a small dataset

B.Model has too many parameters relative to the number of training samples

C.Having a large dataset with many samples

D.Training for too many epochs

E.Using regularization techniques

AnswersA, B, D

Complex models on small data overfit.

Why this answer

Option A is correct because a complex model like a deep neural network has high capacity and can easily memorize noise and patterns specific to a small dataset, rather than learning generalizable features. With limited training samples, the model fails to capture the underlying data distribution, leading to poor performance on unseen data.

Exam trap

Cisco often tests the misconception that more data or regularization causes overfitting, when in fact both are standard countermeasures; the trap is confusing correlation with causation in model training dynamics.

Full explanation →

1198

MCQeasy

A machine learning engineer is using Amazon SageMaker to train a model. The training dataset is 2 TB and is stored in Amazon S3. The engineer wants to reduce the training time by improving data loading performance. Which data ingestion mode should be used?

A.Pipe mode

B.Incremental mode

C.File mode

D.Fast file mode

AnswerA

Pipe mode streams data from S3 directly to the algorithm, reducing I/O wait time.

Why this answer

SageMaker Pipe mode streams data directly from S3 to the training algorithm, which can reduce training time by overlapping data loading and training, especially for large datasets.

Full explanation →

1199

MCQhard

A data scientist is examining a dataset for a binary classification problem. The target variable has a 1:1000 imbalance. Which technique should be used to assess model performance during exploratory data analysis?

A.Area under the Precision-Recall curve

B.F1 score

C.Area under the ROC curve

D.Cohen's kappa

AnswerA

PR AUC is sensitive to class imbalance and focuses on the positive class.

Why this answer

With a 1:1000 class imbalance, the positive class is extremely rare. The Area Under the Precision-Recall curve (AUPRC) focuses on the performance of the positive class and is sensitive to changes in precision and recall, making it a robust metric for imbalanced datasets. Unlike ROC AUC, which can be overly optimistic when negatives dominate, AUPRC provides a realistic assessment of model performance on the minority class.

Exam trap

The trap here is that candidates often default to ROC AUC as the universal metric for classification, not realizing that in extreme imbalance, ROC AUC can be misleadingly high because the false positive rate is diluted by the vast number of true negatives.

How to eliminate wrong answers

Option B (F1 score) is wrong because it is a threshold-dependent metric that evaluates a single point on the precision-recall curve, not the overall performance across all thresholds, and it can be misleading when comparing models without a fixed threshold. Option C (Area under the ROC curve) is wrong because ROC AUC is insensitive to class imbalance; it treats false positive rate (which is dominated by the majority class) equally, often yielding deceptively high scores even when the model fails to identify the minority class. Option D (Cohen's kappa) is wrong because it measures inter-rater agreement adjusted for chance, which is not a standard metric for binary classification model evaluation and does not specifically address the imbalance problem.

Full explanation →

1200

MCQhard

A data scientist is tuning a gradient boosting model using Amazon SageMaker's Automatic Model Tuning (hyperparameter optimization). The objective metric is validation:auc. After 50 training jobs, the best model still has a validation AUC of only 0.65. The scientist suspects overfitting because the training AUC is 0.99. Which hyperparameter configuration is MOST likely to reduce overfitting?

A.Increase lambda from 1 to 10

B.Increase num_round from 100 to 500

C.Increase max_depth from 6 to 12

D.Increase subsample from 0.5 to 1.0

AnswerA

Higher L2 regularization reduces overfitting by penalizing large weights.

Why this answer

Increasing lambda (L2 regularization) from 1 to 10 adds a stronger penalty on the magnitude of leaf weights in the gradient boosting model. This directly reduces overfitting by discouraging the model from fitting noise in the training data, which is consistent with the observed gap between training AUC (0.99) and validation AUC (0.65). In XGBoost, lambda controls the L2 regularization term on weights, and a higher value forces the model to be simpler and more generalizable.

Exam trap

The trap here is that candidates often assume increasing model complexity (e.g., more rounds, deeper trees) will improve performance, but the question explicitly describes overfitting, so the correct answer must reduce complexity or increase regularization, which is lambda.

How to eliminate wrong answers

Option B is wrong because increasing num_round (number of boosting rounds) from 100 to 500 would increase model complexity and training time, likely worsening overfitting by allowing the model to further memorize the training data. Option C is wrong because increasing max_depth from 6 to 12 allows trees to grow deeper, capturing more specific interactions and noise, which exacerbates overfitting rather than reducing it. Option D is wrong because increasing subsample from 0.5 to 1.0 means using all training data for each tree, removing the stochastic regularization effect that subsampling provides, which would reduce generalization and increase overfitting risk.

Full explanation →

Page 16 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →