Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1051–1125

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 15 of 24

1051

MCQhard

A machine learning engineer is analyzing a dataset that contains a categorical feature 'country' with 200 unique values. The target variable is binary. The engineer wants to use this feature in a linear model. Which encoding method should be applied during EDA to prepare the data for modeling, considering the high cardinality?

A.Target encoding with cross-validation

B.Label encoding

C.Frequency encoding

D.One-hot encoding

AnswerA

Target encoding captures the relationship with the target, and cross-validation prevents data leakage.

Why this answer

Option D is correct because target encoding (or mean encoding) replaces each category with the mean of the target, which is suitable for high cardinality in linear models but requires careful validation to avoid overfitting. Option A is wrong because one-hot encoding would create 199 dummy variables, leading to high dimensionality. Option B is wrong because label encoding imposes an arbitrary ordinal relationship.

Option C is wrong because frequency encoding may not capture the relationship with the target.

Full explanation →

1052

Multi-Selectmedium

A data scientist is building a recommender system using Amazon SageMaker. The dataset contains user-item interactions with implicit feedback (clicks). Which THREE evaluation metrics are appropriate for this use case?

Select 3 answers

A.Root Mean Squared Error (RMSE)

B.Precision@k

C.Mean Average Precision (MAP)

D.Recall@k

E.Area Under the ROC Curve (AUC-ROC)

AnswersB, C, D

Precision@k measures relevance of top-k recommendations.

Why this answer

Precision@k is appropriate for implicit feedback (clicks) because it measures the proportion of relevant items among the top-k recommendations, focusing on the accuracy of the ranked list. In recommender systems with implicit feedback, where only positive interactions are observed, ranking metrics like Precision@k are standard as they evaluate the quality of the top recommendations without requiring explicit ratings.

Exam trap

The trap here is that candidates often confuse regression metrics (RMSE) or binary classification metrics (AUC-ROC) as applicable to implicit feedback, not realizing that recommender systems with implicit feedback require ranking-based metrics that handle only positive observations and no explicit negative labels.

Full explanation →

1053

MCQeasy

A company wants to perform real-time analytics on streaming data from clickstreams. The data needs to be ingested, processed, and made available for querying within seconds. Which AWS service should be used for the processing step?

A.AWS Glue

B.Amazon Redshift

C.Amazon Kinesis Data Analytics

D.Amazon Athena

AnswerC

Kinesis Data Analytics processes streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics is the correct choice because it enables real-time processing and analysis of streaming data using SQL or Apache Flink. It can ingest data from Kinesis Data Streams or Kinesis Data Firehose, process it with sub-second latency, and output results to destinations like Kinesis Data Streams or Firehose for further querying, meeting the requirement for analytics within seconds.

Exam trap

The trap here is that candidates often confuse AWS Glue's streaming ETL capability (which still relies on Spark Structured Streaming with higher latency) with Kinesis Data Analytics' native real-time processing, or they assume Athena can query streaming data directly when it only queries data at rest in S3.

How to eliminate wrong answers

Option A is wrong because AWS Glue is a serverless ETL service designed for batch processing and data cataloging, not for real-time stream processing with sub-second latency. Option B is wrong because Amazon Redshift is a data warehouse optimized for analytical queries on large datasets, but it is not designed for real-time stream processing; it ingests data in batches or via streaming ingestion with higher latency. Option D is wrong because Amazon Athena is an interactive query service for analyzing data in Amazon S3 using SQL, but it operates on data at rest and cannot process streaming data in real time.

Full explanation →

1054

MCQmedium

A data scientist is training a classification model on an imbalanced dataset where the positive class represents only 5% of the data. Which technique would BEST address the class imbalance without discarding data?

A.Use SMOTE to generate synthetic samples for the minority class

B.Randomly undersample the majority class

C.Adjust the decision threshold to 0.95

D.Randomly oversample the minority class with replacement

AnswerA

SMOTE creates synthetic samples, balancing the dataset without data loss.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the best choice because it generates synthetic samples for the minority class by interpolating between existing minority instances, effectively balancing the dataset without discarding any data. This avoids the information loss of undersampling and the overfitting risk of simple random oversampling, making it ideal for a 5% positive class scenario.

Exam trap

Cisco often tests the distinction between data-level techniques (like SMOTE) and post-hoc adjustments (like threshold tuning), trapping candidates who think changing the threshold alone solves the imbalance without addressing the underlying data distribution.

How to eliminate wrong answers

Option B is wrong because randomly undersampling the majority class discards data, which can lead to loss of valuable information and reduced model performance, especially when the majority class contains important patterns. Option C is wrong because adjusting the decision threshold to 0.95 does not address class imbalance at the data level; it only changes the classification cutoff, which may improve recall but does not fix the underlying skewed distribution and can harm precision. Option D is wrong because randomly oversampling the minority class with replacement duplicates existing samples, which can cause overfitting to the minority class and does not introduce new, diverse examples like SMOTE does.

Full explanation →

1055

Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose three.)

Select 3 answers

A.The need for built-in data transformation and analytics.

B.The need for custom real-time processing logic using consumer applications.

C.The required end-to-end latency (seconds vs. minutes).

D.The need to manually manage shard capacity and scaling.

E.The requirement for exactly-once delivery semantics.

AnswersB, C, D

Correct: Data Streams supports custom consumers; Firehose does not.

Why this answer

Three key differences: Kinesis Data Streams supports custom processing with consumers (option A), requires manual scaling (option D), and has lower latency (option B). Firehose is for simple delivery with minimal code, auto-scaling, and higher latency due to buffering. Option C (exactly-once delivery) is not guaranteed by either.

Option E (built-in analysis) is not a feature of Firehose.

Full explanation →

1056

Matchingmedium

Match each SageMaker feature to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Managed compute to train a model

Host a model for real-time inference

Run inference on a batch of data

Jupyter notebook for exploration

Run data processing scripts

Why these pairings

These are core components of SageMaker.

Full explanation →

1057

MCQmedium

A data scientist is performing exploratory data analysis on a dataset with missing values. The dataset contains a column 'income' with 20% missing values. The income distribution is right-skewed. Which imputation method is most appropriate to preserve the skewness?

A.Impute with the mean income

B.Impute with the median income

C.Drop rows with missing income

D.Impute with the mode income

AnswerB

Median is robust to skewness and preserves the distribution shape.

Why this answer

Option D is correct because median is robust to skewness and preserves the central tendency without affecting the skewness as much as mean. Option A is wrong because mean is sensitive to outliers and would reduce skewness. Option B is wrong because mode is for categorical data.

Option C is wrong because dropping rows reduces sample size.

Full explanation →

1058

MCQmedium

A company is building a data lake on Amazon S3. Raw data is ingested from multiple sources in different formats (CSV, JSON, Parquet). The data must be cataloged and made queryable using Amazon Athena. The data schema may evolve over time. Which approach minimizes manual effort and supports schema evolution?

A.Use Athena only, without a catalog, by directly querying files

B.Use Amazon EMR to process data and write to a Hive metastore

C.Use AWS Glue Crawlers to automatically create and update the Glue Data Catalog

D.Manually create tables in Athena using DDL statements

AnswerC

Crawlers automatically detect schema changes and update the catalog.

Why this answer

Using AWS Glue Crawlers automatically infers and updates schemas in the Glue Data Catalog, supporting schema evolution. Option B (manual schema creation) is error-prone and not scalable. Option C (EMR only) does not leverage Glue Catalog.

Option D (Athena only) can query but does not manage schema evolution.

Full explanation →

1059

MCQmedium

A company is using Amazon SageMaker to host a real-time inference endpoint for a natural language processing model. The endpoint is configured with an ml.m5.large instance. After deployment, the company observes that the inference latency is higher than expected, and the endpoint is experiencing CPU utilization near 100% during peak hours. The model is a PyTorch model that uses a transformer architecture. The company wants to reduce latency without increasing cost significantly. Which approach should the company take?

A.Configure the endpoint with Auto Scaling to add more instances during peak hours.

B.Switch to batch transform for inference.

C.Attach an Elastic Inference accelerator to the existing instance.

D.Change the endpoint instance type to ml.g4dn.xlarge to use GPU acceleration.

AnswerD

Correct: GPU instances accelerate transformer inference, reducing latency.

Why this answer

The issue is high CPU utilization causing latency. Using a GPU instance (ml.g4dn.xlarge) can accelerate inference for transformer models due to parallel processing, reducing latency. Option C is correct.

Option A (Elastic Inference) may help but is less effective than a full GPU for transformer models; also, it adds complexity. Option B (Auto Scaling) helps with traffic but does not reduce per-request latency. Option D (batch transform) is for offline inference, not real-time.

Full explanation →

1060

Multi-Selecteasy

A data analyst is exploring a dataset with a binary target variable. Which TWO visualizations are most useful for understanding the relationship between a numerical feature and the target?

Select 2 answers

A.Pie chart of the feature

B.Bar chart of the feature

C.Histogram with overlaid target classes

D.Box plot grouped by target class

E.Scatter plot of the feature versus target

AnswersC, D

Shows how the feature distribution differs by class.

Why this answer

Options A and D are correct. A: A box plot grouped by target class shows distribution differences. D: A histogram with overlaid target classes shows how the feature distribution differs.

Option B is incorrect because a scatter plot is for two numerical variables. Option C is incorrect because a bar chart is for categorical features. Option E is incorrect because a pie chart is for proportions, not relationships.

Full explanation →

1061

MCQmedium

A company is building a recommendation system using matrix factorization. The training data contains user-item interactions. The model performs well on the training set but poorly on the test set. Which regularization technique should be applied to improve generalization?

A.Add L1 regularization to the user and item latent factors

B.Add L2 regularization to the user and item latent factors

C.Apply dropout to the latent factors during training

D.Use batch normalization on the factors

AnswerB

L2 regularization penalizes large factor values, reducing overfitting.

Why this answer

L2 regularization (weight decay) penalizes large values in the user and item latent factor matrices, which helps prevent overfitting by encouraging the model to learn smoother, more generalizable representations. This is the standard regularization technique used in matrix factorization for collaborative filtering, as it directly controls the magnitude of the latent vectors without inducing sparsity.

Exam trap

Cisco often tests the distinction between L1 and L2 regularization in the context of matrix factorization, where candidates mistakenly choose L1 because they associate it with feature selection, but the correct choice for controlling latent factor magnitude and preventing overfitting is L2 regularization.

How to eliminate wrong answers

Option A is wrong because L1 regularization induces sparsity in the latent factors, which is not typically desired in matrix factorization—sparse factors can lose the dense, low-rank structure needed for capturing collaborative signals. Option C is wrong because dropout is a regularization technique designed for neural networks, not for standard matrix factorization models, and applying it to latent factors would disrupt the multiplicative interaction that defines the prediction. Option D is wrong because batch normalization normalizes activations within mini-batches to stabilize training in deep networks, but matrix factorization has no notion of mini-batch activations and batch normalization does not address overfitting from latent factor magnitudes.

Full explanation →

1062

MCQhard

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint. The model is a PyTorch model that requires a custom inference script. The engineer notices that the endpoint is returning 500 errors after deployment. Which step should the engineer take to debug the issue?

A.Redeploy the endpoint with a different instance type.

B.Check the CloudWatch metrics for the endpoint.

C.Modify the inference script and update the endpoint.

D.View the CloudWatch Logs for the endpoint.

AnswerD

Logs contain stack traces and error messages.

Why this answer

Option D is correct because CloudWatch Logs contain detailed error messages. Option A is wrong because the endpoint is already deployed. Option B is wrong because it would overwrite the script.

Option C is wrong because CloudWatch metrics show metrics, not errors.

Full explanation →

1063

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and the data must be transferred within 5 days. Which AWS service is best suited for this task?

A.AWS DataSync

B.AWS Snowball Edge

C.Amazon S3 Transfer Acceleration

D.AWS Storage Gateway

AnswerB

Snowball Edge is a physical device that can transfer 50 TB in a few days, bypassing network bandwidth limitations.

Why this answer

Option C is correct because AWS Snowball Edge is a physical device that can transfer large amounts of data over a network faster than a 100 Mbps connection. Option A is wrong because AWS DataSync is designed for network transfers, but 100 Mbps would take much longer than 5 days. Option B is wrong because AWS Storage Gateway is for ongoing hybrid cloud storage, not large-scale data migration.

Option D is wrong because S3 Transfer Acceleration speeds up internet transfers but still limited by the 100 Mbps connection.

Full explanation →

1064

MCQmedium

A company uses SageMaker to train a large language model. The training job is taking too long. The data scientist wants to use distributed training with data parallelism. Which SageMaker feature should be used?

A.SageMaker distributed training libraries

B.SageMaker Neo

C.SageMaker Processing

D.SageMaker Debugger

AnswerA

These libraries provide data and model parallelism.

Why this answer

Option B is correct because SageMaker's distributed training libraries support data parallelism. Option A is wrong because SageMaker Debugger is for monitoring and debugging, not distribution. Option C is wrong because SageMaker Processing is for data processing.

Option D is wrong because SageMaker Neo is for model optimization.

Full explanation →

1065

MCQmedium

A company is using Amazon SageMaker to train a model on a large dataset stored in S3. The training job is taking a long time due to slow data loading. Which action can the data scientist take to reduce data loading time?

A.Use Pipe mode to stream data from S3.

B.Use File mode and copy data to Amazon EBS.

C.Use a larger instance type with more memory.

D.Enable data augmentation during training.

AnswerA

Pipe mode streams data directly, reducing load time.

Why this answer

Option D is correct because Pipe mode streams data directly. Option A is wrong because it doesn't change data transfer. Option B is wrong because it may be slower.

Option C is wrong because it doesn't help.

Full explanation →

1066

MCQmedium

A team is using SageMaker to train a model using the built-in XGBoost algorithm. The training job is taking longer than expected. The team suspects that the data is not being loaded efficiently. Which data format should they use to minimize training time?

A.Pipe mode with CSV

B.File mode with Parquet

C.Pipe mode with RecordIO-Protobuf

D.File mode with CSV

AnswerC

Streaming with efficient binary format.

Why this answer

Option B is correct because Pipe mode streams data from S3 directly to the algorithm, reducing initialization time. Option A is wrong because File mode downloads data first. Option C is wrong because CSV is not as efficient as RecordIO for XGBoost.

Option D is wrong because Parquet is not natively supported by XGBoost in SageMaker.

Full explanation →

1067

MCQmedium

A data scientist is training a deep learning model for image classification using Amazon SageMaker. The training job is taking too long. The data scientist notices that GPU utilization is low (around 30%). Which action is most likely to improve GPU utilization and reduce training time?

A.Increase the batch size

B.Use a smaller instance type

C.Increase the learning rate

D.Reduce the batch size

AnswerA

Larger batch size keeps GPU busy, improving utilization and reducing total training time if the data pipeline can keep up.

Why this answer

Low GPU utilization (around 30%) indicates that the GPU is spending too much time idle while waiting for data batches to be processed. Increasing the batch size allows each training step to process more samples per forward/backward pass, which increases computational load on the GPU and improves hardware utilization. This directly reduces the number of steps needed per epoch, thereby decreasing overall training time.

Exam trap

The trap here is that candidates often confuse low GPU utilization with a need to reduce batch size (thinking smaller batches speed up training), when in fact increasing batch size is the standard remedy to saturate GPU compute and reduce wall-clock time.

How to eliminate wrong answers

Option B is wrong because using a smaller instance type would reduce available GPU compute resources, likely further lowering utilization and increasing training time. Option C is wrong because increasing the learning rate does not directly affect GPU utilization; it changes the optimization dynamics and may cause divergence or instability without addressing the underutilization bottleneck. Option D is wrong because reducing the batch size would decrease the amount of work per GPU step, further lowering utilization and increasing the number of steps, which would worsen training time.

Full explanation →

1068

MCQhard

Refer to the exhibit. A SageMaker training job using the built-in Linear Learner algorithm fails with 'Loss function returned NaN'. Which hyperparameter change is MOST likely to resolve this issue?

A.Increase the learning rate to 0.5

B.Increase mini_batch_size to 2000

C.Decrease epochs to 5

D.Reduce learning_rate to 0.01

AnswerD

Lower learning rate helps convergence.

Why this answer

The 'Loss function returned NaN' error in SageMaker's built-in Linear Learner algorithm typically occurs when the learning rate is too high, causing gradient updates to overshoot optimal parameters and diverge. Reducing the learning rate to 0.01 stabilizes training by ensuring smaller, more controlled weight updates, preventing numerical instability that leads to NaN loss.

Exam trap

AWS often tests the misconception that increasing the learning rate speeds up convergence, but the trap here is that a high learning rate causes divergence and NaN loss, so the correct fix is to reduce it, not increase it.

How to eliminate wrong answers

Option A is wrong because increasing the learning rate to 0.5 would exacerbate the instability, making NaN loss more likely rather than resolving it. Option B is wrong because increasing mini_batch_size to 2000 does not directly address the learning rate-induced divergence; while larger batches can reduce gradient variance, they do not fix the fundamental issue of an overly aggressive step size. Option C is wrong because decreasing epochs to 5 would simply truncate training without addressing the root cause—the loss would still be NaN from the first few iterations if the learning rate is too high.

Full explanation →

1069

MCQmedium

A data scientist is building a recommendation system using collaborative filtering. The dataset contains user-item interactions in a sparse matrix. The model will be trained on Amazon SageMaker using the built-in Factorization Machines algorithm. Which data format should the scientist use for the training data?

A.CSV format with all features as columns

B.JSON format with nested arrays

C.RecordIO-protobuf format with sparse features

D.Parquet format

AnswerC

RecordIO-protobuf is the recommended format for sparse data for Factorization Machines.

Why this answer

Amazon SageMaker's Factorization Machines algorithm expects input in the 'application/x-recordio-protobuf' format for sparse data, or in CSV format for dense data. For sparse data, Protobuf is recommended for performance.

Full explanation →

1070

MCQeasy

A data scientist wants to understand the distribution of a categorical feature with 100 unique values. Which visualization is most appropriate?

A.Histogram

B.Bar chart

C.Scatter plot

D.Pie chart

AnswerB

Bar charts are ideal for displaying categorical frequencies.

Why this answer

A bar chart is the most appropriate visualization for displaying the distribution of a categorical feature with 100 unique values because it uses discrete bars to represent the frequency or proportion of each category. Unlike a histogram, which requires continuous numeric bins, a bar chart preserves the distinct categories and allows clear comparison of counts across all 100 levels.

Exam trap

Cisco often tests the distinction between histograms (for continuous data) and bar charts (for categorical data), and candidates mistakenly choose histogram because they confuse 'distribution' with 'numeric distribution' without recognizing the categorical nature of the feature.

How to eliminate wrong answers

Option A is wrong because a histogram is designed for continuous numeric data and groups values into bins, which is inappropriate for categorical features and would obscure the distinct categories. Option C is wrong because a scatter plot is used to visualize the relationship between two continuous variables, not the distribution of a single categorical feature. Option D is wrong because a pie chart, while usable for categorical data, becomes unreadable and misleading with 100 unique values due to overlapping small slices and difficulty comparing proportions; bar charts are far superior for many categories.

Full explanation →

1071

Multi-Selecteasy

Which TWO of the following are benefits of feature scaling for machine learning algorithms?

Select 2 answers

A.Eliminates the effect of outliers

B.Reduces the need for feature selection

C.Improves performance of decision tree algorithms

D.Faster convergence of gradient descent

E.Prevents features with larger magnitudes from dominating distance-based algorithms

AnswersD, E

Scaling ensures all features contribute equally to the gradient.

Why this answer

Option D is correct because feature scaling, typically via standardization (z-score) or min-max normalization, ensures that gradient descent converges faster. Without scaling, features with larger numerical ranges dominate the gradient updates, causing the algorithm to oscillate and require more iterations to reach the optimum. Scaling produces a more spherical contour of the loss function, allowing gradient descent to take more direct steps toward the minimum.

Exam trap

The trap here is that candidates often assume feature scaling universally improves all algorithms, but Cisco specifically tests that tree-based models (like decision trees) are scale-invariant, making option C a common distractor.

Full explanation →

1072

MCQhard

A company is building a binary classification model to predict customer churn. The dataset is highly imbalanced (95% non-churn, 5% churn). The data scientist uses SMOTE to oversample the minority class. After training a logistic regression model, the recall for the churn class is 0.80, but the precision is only 0.10. Which action would MOST likely improve precision without significantly harming recall?

A.Use random oversampling instead of SMOTE

B.Reduce the number of features in the model

C.Increase the classification threshold for the positive class

D.Decrease the classification threshold for the positive class

AnswerC

A higher threshold reduces false positives, improving precision, while likely still capturing many true positives.

Why this answer

Option A is correct because increasing the classification threshold reduces false positives, improving precision, while still retaining many true positives. Option B is wrong because decreasing the threshold would further reduce precision. Option C is wrong because using a different oversampling technique might not directly address the threshold issue.

Option D is wrong because reducing the model's complexity could reduce recall.

Full explanation →

1073

MCQhard

A data engineer is investigating a slow Athena query on a partitioned table. The table is partitioned by year, month, and day, and the data is stored in S3 with the prefix pattern 'raw/YYYY/MM/DD/'. The engineer runs the above CLI command and sees that there are many small files. Which action would most improve query performance?

A.Convert the data to columnar format like Parquet or ORC.

B.Use S3DistCp to coalesce files into fewer, larger files.

C.Increase the number of partitions in the Athena DDL.

D.Add more partitions to reduce the amount of data scanned per query.

AnswerB

Coalescing reduces the number of files, improving query performance.

Why this answer

Athena performs best with larger files. Consolidating small files into fewer, larger files (e.g., using S3DistCp or Glue ETL) reduces the overhead of reading many small files and improves query performance.

Full explanation →

1074

Multi-Selecteasy

A company has a large number of small CSV files (hundreds of thousands) in an S3 bucket. A data engineer needs to run a SQL query on this data using Amazon Athena. The queries are currently slow and expensive. Which two actions will improve query performance and reduce cost?

Select 2 answers

A.Increase the S3 request rate per prefix to improve read throughput.

B.Compress the CSV files using gzip.

C.Partition the data by a commonly filtered column (e.g., date).

D.Increase the number of partitions by splitting files into smaller ones.

E.Convert the data to Parquet or ORC columnar format.

AnswersC, E

Partitioning limits the data scanned per query, improving performance and reducing cost.

Why this answer

Option A and D are correct. Partitioning the data reduces the amount of data scanned by Athena, improving performance and reducing cost. Converting to columnar format (Parquet) further reduces scanned data and improves compression.

Option B (compression) helps but is less impactful than partitioning and columnar format. Option C (more partitions) could help but too many small partitions may hurt performance. Option E (increasing S3 request rate) is not a direct action for Athena.

Full explanation →

1075

Multi-Selecthard

A company is deploying a machine learning model to an Amazon SageMaker endpoint. The model receives requests with sensitive data that must be encrypted in transit and at rest. Additionally, the company needs to control access to the endpoint using AWS IAM. Which THREE steps should the company take to meet these requirements? (Choose THREE.)

Select 3 answers

A.Enable HTTPS for the endpoint

B.Configure the endpoint to use a VPC

C.Store the model artifacts in Amazon S3 with SSE-S3 encryption

D.Enable encryption at rest for the endpoint's ML storage volume

E.Attach an IAM policy to the endpoint to allow only authorized principals

AnswersA, D, E

HTTPS encrypts data in transit.

Why this answer

Options A, B, and D are correct. Enabling encryption in transit using HTTPS, enabling encryption at rest for the endpoint's attached storage, and attaching an IAM policy to the endpoint to allow only authorized users are necessary. Option C is wrong because VPC settings do not encrypt data at rest.

Option E is wrong because the model data in S3 should be encrypted with SSE-KMS, not SSE-S3, to meet stricter security requirements.

Full explanation →

1076

MCQhard

A data scientist is training a deep learning model on Amazon SageMaker using a PyTorch estimator. The training job runs on a single ml.p3.2xlarge instance but is taking too long. The scientist wants to reduce training time by using distributed data parallelism across multiple GPUs. Which change to the training script and SageMaker estimator is required?

A.Add the SageMaker distributed data parallelism configuration in the estimator and modify the script to use the SageMaker distributed library.

B.Change the framework to TensorFlow and use tf.distribute.MirroredStrategy with instance_count=2.

C.Modify the script to use torch.nn.parallel.DistributedDataParallel and set instance_count to 2 in the estimator.

D.Modify the script to use torch.nn.DataParallel and keep instance_count as 1.

AnswerC

DDP is efficient for multi-node training.

Why this answer

Option C is correct because to achieve distributed data parallelism across multiple GPUs on multiple instances with PyTorch, you must modify the training script to use `torch.nn.parallel.DistributedDataParallel` (DDP), which handles gradient synchronization across nodes. Additionally, you must set `instance_count` to 2 (or more) in the SageMaker PyTorch estimator to launch multiple instances, each with its own GPU, enabling true multi-node distributed training.

Exam trap

Cisco often tests the distinction between `DataParallel` (single-node, multi-GPU) and `DistributedDataParallel` (multi-node, multi-GPU), leading candidates to incorrectly choose `DataParallel` because they overlook the requirement for multiple instances.

How to eliminate wrong answers

Option A is wrong because the SageMaker distributed data parallelism library is a separate framework (SMDDP) that is not required for PyTorch DDP; using it would add unnecessary complexity and is not the standard approach for PyTorch users. Option B is wrong because switching to TensorFlow is unnecessary and introduces a framework change; the question specifies PyTorch, and `tf.distribute.MirroredStrategy` is for TensorFlow, not PyTorch. Option D is wrong because `torch.nn.DataParallel` only parallelizes within a single node (single instance) and does not support multi-instance distributed training; it also does not scale across multiple GPUs on different instances, so it would not reduce training time when using multiple instances.

Full explanation →

1077

Multi-Selectmedium

A data scientist is training a gradient boosting model using SageMaker. The model is overfitting to the training data. Which TWO actions can help reduce overfitting? (Choose 2)

Select 2 answers

A.Increase the number of boosting rounds

B.Increase the learning rate

C.Increase the minimum child weight

D.Reduce the maximum depth of trees

E.Use a larger training dataset

AnswersC, D

Higher min_child weight requires more data to split, reducing overfitting.

Why this answer

Increasing the learning rate actually worsens overfitting; increasing max_depth increases model complexity. Reducing max_depth and increasing min_child_weight both regularize the model.

Full explanation →

1078

MCQeasy

A company is deploying a PyTorch model on a SageMaker endpoint for real-time inference. The model is stored as a .pth file in an S3 bucket. The data scientist wants to use the SageMaker PyTorch inference toolkit. Which file is REQUIRED in the model artifacts to serve the model?

A.A file named model.tar.gz that contains the model and any dependencies.

B.A file named inference.py that defines the model loading and prediction logic.

C.A file named model.pth containing the model state dictionary.

D.A file named requirements.txt listing the dependencies.

AnswerC

The PyTorch inference toolkit loads model.pth by default.

Why this answer

Option B is correct because the SageMaker PyTorch inference toolkit expects a file named model.pth in the model artifacts. Option A is wrong because inference.py is optional for custom code. Option C is wrong because requirements.txt is optional.

Option D is wrong because the model can be loaded from S3.

Full explanation →

1079

MCQhard

A data scientist is building a regression model to predict house prices. The dataset contains features like number of bedrooms, square footage, and location. After training, the model has high variance. Which technique should the data scientist use to reduce variance without significantly increasing bias?

A.Use bagging

B.Increase the number of features

C.Apply L2 regularization

D.Use fewer training examples

AnswerC

L2 regularization penalizes large coefficients, reducing variance.

Why this answer

Option D is correct because L2 regularization (Ridge) penalizes large coefficients, reducing variance while keeping bias low. Option A is wrong because increasing model complexity increases variance. Option B is wrong because removing features may increase bias.

Option C is wrong because bagging reduces variance but may not be appropriate for all cases; regularization is more direct.

Full explanation →

1080

MCQhard

Refer to the exhibit. A data scientist runs the AWS CLI command shown and gets the output. The scientist wants to create an Athena table over all log files in the 'logs/2023/' prefix, including files smaller than 1000 bytes. Which approach achieves this?

A.Create the table using LOCATION 's3://my-bucket/logs/2023/' which includes all files under that prefix.

B.Create the table and add a WHERE clause to include small files.

C.Ask the S3 team to remove the size restriction on the bucket.

D.Modify the CLI command to remove the size filter and re-run it before creating the table.

AnswerA

The table location covers all files regardless of size.

Why this answer

The CLI command filters objects larger than 1000 bytes, but the scientist wants all files in the prefix. The Athena table definition should point to the entire prefix without size filtering. Option A is wrong because the command was run locally, not affecting Athena.

Option B is wrong because adding a WHERE clause in Athena only filters after scanning. Option D is wrong because the scientist can still create the table without size restrictions.

Full explanation →

1081

MCQeasy

A data scientist wants to query a dataset stored in Amazon S3 using standard SQL without provisioning any servers. The dataset is in CSV format and is updated daily. Which AWS service should be used?

A.Amazon Athena

B.Amazon Redshift

C.Amazon RDS

D.Amazon DynamoDB

AnswerA

Athena is serverless and supports SQL queries on S3 data.

Why this answer

Amazon Athena is a serverless interactive query service that allows querying data in S3 using standard SQL. Option B is wrong because Redshift is a data warehouse requiring provisioning. Option C is wrong because RDS is a relational database service.

Option D is wrong because DynamoDB is a NoSQL database.

Full explanation →

1082

MCQmedium

A data scientist is using SageMaker to train a model. The training job is failing with a 'ResourceLimitExceeded' error. Which action should be taken to resolve this issue?

A.Use a different AWS Region.

B.Request a service limit increase for the instance type.

C.Reduce the training dataset size.

D.Switch to a different instance type with lower resource requirements.

AnswerB

The error indicates the instance limit is reached; requesting an increase resolves it.

Why this answer

Option B is correct because the error indicates that the account's maximum number of instances for the instance type has been reached. Requesting a limit increase for the specific instance type is the appropriate resolution. Option A is incorrect because it does not address the limit issue.

Option C is incorrect because switching to a different instance type may not be desired and does not resolve the underlying limit. Option D is incorrect because the error is about resource limits, not data size.

Full explanation →

1083

MCQmedium

A company is building a recommendation system using matrix factorization. The dataset has 1 million users and 100,000 items. The data scientist trains a model using SageMaker's Factorization Machines algorithm. The model achieves a root mean squared error (RMSE) of 0.95 on the test set. However, the business requires RMSE below 0.90. The data scientist has already tuned hyperparameters like number of factors and learning rate. Which additional step should the data scientist take to improve RMSE?

A.Add side features such as user demographics and item categories

B.Increase the number of training iterations

C.Increase the number of factors to 1000

D.Use a linear regression model instead

AnswerA

Side features enrich the model and can improve accuracy.

Why this answer

Option D (add side features) provides more information to the model. Option A (more factors) may overfit. Option B (more iterations) may not help if converged.

Option C (use linear regression) is less powerful.

Full explanation →

1084

MCQhard

A data scientist is training a deep learning model for image classification on Amazon SageMaker. The dataset consists of 10,000 images of size 224x224 pixels. The training job uses a single ml.p3.2xlarge instance. The data scientist notices that the GPU utilization is very low (~20%) and the training is slow. Which change would most likely improve GPU utilization?

A.Use gradient accumulation

B.Use a larger instance type with more GPUs

C.Increase the batch size

D.Increase the number of data loader workers to load data in parallel

AnswerD

More workers can load data faster, reducing idle GPU time.

Why this answer

Low GPU utilization often indicates that the data loading pipeline is bottlenecked. Increasing the number of data loader workers can improve data throughput to the GPU, keeping it busy.

Full explanation →

1085

MCQeasy

A data scientist needs to query a dataset stored as Parquet files in Amazon S3 using standard SQL without managing any infrastructure. Which service should they use?

A.Amazon Athena

B.Amazon QuickSight

C.AWS Glue

D.Amazon Redshift

AnswerA

Athena is serverless and supports SQL on S3.

Why this answer

Amazon Athena is serverless and allows SQL queries directly on S3 data. Redshift requires a cluster. Glue is for ETL.

QuickSight is for visualization.

Full explanation →

1086

MCQmedium

An ML team is analyzing a time series dataset of daily website traffic. They notice a pattern where traffic spikes every Sunday. Which EDA technique should they use to confirm this seasonality?

A.Plot the time series data with a line plot

B.Compute autocorrelation at different lags

C.Create a scatter plot of traffic vs. day of week

D.Plot a histogram of the traffic values

AnswerA

A line plot over time directly reveals seasonal patterns.

Why this answer

Option A is correct because a line plot over time clearly shows weekly patterns. Option B is wrong because histograms show distribution of values, not time patterns. Option C is wrong because autocorrelation measures correlation with lagged values, but visual inspection is more direct for confirming seasonality.

Option D is wrong because scatter plots show relationships between two variables.

Full explanation →

1087

MCQhard

A data engineer created a CloudFormation template for a Glue ETL job as shown. The job processes 500 GB of data and takes 90 minutes to complete. However, the job fails after 60 minutes. What is the MOST likely cause?

A.The IAM role does not have sufficient permissions.

B.The ScriptLocation S3 bucket is in a different region.

C.The Timeout property is set to 60 minutes, but the job requires more time.

D.The MaxRetries property is set to 0, so the job does not retry on failure.

AnswerC

The job is killed when it exceeds the timeout, causing failure.

Why this answer

Option C is correct because the Timeout is set to 60 minutes, but the job takes 90 minutes, so it is killed. Option A is wrong because the script location is correct. Option B is wrong because the role appears correctly configured.

Option D is wrong because MaxRetries is 0, but the job fails due to timeout, not due to retry policy.

Full explanation →

1088

MCQhard

A company runs an e-commerce platform on AWS. They have a SageMaker endpoint serving a product recommendation model. The model uses a custom container with a TensorFlow model. Recently, the endpoint has been returning high latency and occasional 504 errors during peak traffic. The data scientist observes that the model inference time is around 200 ms per request, but the endpoint is configured with a single ml.c5.large instance. The traffic spikes can reach 100 requests per second. The data scientist needs to reduce latency and eliminate 504 errors. Which course of action is most appropriate?

A.Use Amazon Elastic Inference to attach an EI accelerator to the endpoint instance

B.Configure the SageMaker endpoint with Application Auto Scaling to scale out based on the 'InvocationsPerInstance' metric, and use a larger instance type such as ml.c5.xlarge

C.Switch to a multi-model endpoint to serve multiple models on the same instance

D.Replace the SageMaker endpoint with an AWS Lambda function that loads the model from S3 and returns predictions

AnswerB

Auto scaling adds instances to handle load; a larger instance reduces per-request latency.

Why this answer

Option B is correct because the endpoint is bottlenecked by both instance size and concurrency. With a single ml.c5.large instance handling 100 requests per second and a 200 ms inference time, the instance can only process about 5 requests per second (1000 ms / 200 ms = 5 requests per second per instance). Application Auto Scaling based on the 'InvocationsPerInstance' metric will add instances during traffic spikes, while upgrading to ml.c5.xlarge doubles compute capacity per instance, reducing latency and eliminating 504 errors caused by request queue overflow.

Exam trap

The trap here is that candidates often confuse performance bottlenecks with model optimization or cost-saving strategies, and incorrectly choose Elastic Inference or multi-model endpoints, which address different problems (GPU acceleration or multi-model hosting) rather than the core issue of insufficient compute capacity and lack of auto scaling.

How to eliminate wrong answers

Option A is wrong because Amazon Elastic Inference attaches a GPU accelerator for deep learning inference, but the ml.c5.large instance is CPU-based and the bottleneck is compute capacity per instance, not GPU acceleration; EI does not address the concurrency or scaling issue. Option C is wrong because a multi-model endpoint is designed to host multiple models on a single instance to reduce hosting costs, not to reduce latency or handle high traffic spikes; it does not increase the compute capacity or scaling of the endpoint. Option D is wrong because AWS Lambda has a maximum execution timeout of 15 minutes and limited memory (up to 10 GB), but loading a TensorFlow model from S3 on each invocation would add cold start latency and cannot sustain 100 requests per second without heavy concurrency management, making it unsuitable for real-time inference at this scale.

Full explanation →

1089

MCQeasy

A data scientist is training a random forest model. During hyperparameter tuning, which parameter is MOST effective at reducing overfitting?

A.Increase the number of trees

B.Increase the number of features considered per split

C.Decrease the maximum depth of each tree

D.Increase the maximum depth of each tree

AnswerC

Shallow trees generalize better.

Why this answer

Decreasing the maximum depth of each tree limits the complexity of individual trees, preventing them from memorizing noise and outliers in the training data. This directly reduces overfitting by enforcing simpler decision boundaries, which is a core regularization technique for ensemble methods like Random Forest.

Exam trap

AWS often tests the misconception that adding more trees always reduces overfitting, but the trap is that while more trees stabilize predictions, they do not address the root cause of overfitting from overly complex individual trees.

How to eliminate wrong answers

Option A is wrong because increasing the number of trees generally improves model stability and reduces variance without significantly increasing overfitting; it can even help generalization. Option B is wrong because increasing the number of features considered per split increases tree diversity and can reduce overfitting, but it is not the most effective parameter for directly controlling overfitting. Option D is wrong because increasing the maximum depth of each tree allows trees to grow deeper, capturing more specific patterns and noise, which exacerbates overfitting.

Full explanation →

1090

MCQmedium

A company is using AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The ETL jobs are failing intermittently with write timeout errors when writing to S3. The company wants to implement a retry mechanism for transient errors. What should the company do?

A.Configure the AWS Glue job to retry on failure by setting the 'Max retries' parameter

B.Increase the size of the Amazon EBS volumes attached to the Glue job

C.Use Amazon CloudWatch to monitor the job and manually restart on failure

D.Place the failed job messages in an Amazon SQS queue and reprocess them

AnswerA

Glue jobs can automatically retry up to a specified number of times.

Why this answer

Option D is correct because configuring job retry in AWS Glue automatically retries failed jobs. Option A is wrong because CloudWatch alarms do not retry. Option B is wrong because SQS is not integrated with Glue jobs.

Option C is wrong because increasing EBS volume does not address S3 write errors.

Full explanation →

1091

Multi-Selecteasy

Which TWO SageMaker features can be used to monitor and debug training jobs? (Choose 2.)

Select 2 answers

A.SageMaker Debugger

B.SageMaker Model Monitor

C.SageMaker Ground Truth

D.Amazon CloudWatch Logs

E.SageMaker Clarify

AnswersA, D

Debugger captures real-time training metrics and tensors.

Why this answer

SageMaker Debugger (Option A) captures tensors and metrics during training. CloudWatch Logs (Option E) provide training script logs. Options B (Model Monitor), C (Clarify), and D (Ground Truth) are for inference, explainability, and labeling respectively.

Full explanation →

1092

MCQeasy

A data engineer is querying the AWS Glue Data Catalog table shown in the exhibit. The engineer runs an Athena query: SELECT * FROM transactions WHERE year=2023. The query returns results quickly. However, a subsequent query: SELECT * FROM transactions WHERE amount > 100 takes a long time. What is the most likely reason for the performance difference?

A.The data is compressed, and the first query benefits from compression.

B.The first query uses a partition column (year), allowing partition pruning, while the second query does not.

C.The data is stored in Parquet format, which is optimized for columnar access.

D.The second query is not optimized because it uses 'SELECT *'.

AnswerB

Partition pruning reduces data scanned.

Why this answer

Option A is correct because the table is partitioned by year and month. The first query filters on a partition column (year), so Athena prunes partitions and scans only the relevant data. The second query filters on a non-partition column (amount), so Athena scans all partitions.

Option B is wrong because the data format is text (CSV), not Parquet. Option C is wrong because compression is not mentioned. Option D is wrong because the query is not partitioned correctly; the second query does not use partition columns.

Full explanation →

1093

MCQmedium

A data scientist runs a SageMaker notebook and uses pandas to explore a dataset. The dataset contains 500,000 rows and 20 columns, including a 'timestamp' column. After loading the data into a DataFrame, the memory usage is unexpectedly high. What is the most likely cause?

A.The DataFrame created an index column on the timestamp field, doubling memory usage.

B.The default data types inferred by pandas are unnecessarily large for the actual data ranges.

C.The DataFrame only loaded a sample of the data, but the sample size was too large.

D.The CSV file was compressed, and pandas inflated it in memory.

AnswerB

Pandas uses int64/float64 by default, which can be optimized by downcasting.

Why this answer

Option C is correct because pandas reads numeric columns as int64 or float64 by default, which uses 8 bytes per value. For 500,000 rows, even a single numeric column can consume significant memory. Option A is wrong because pandas does not automatically index columns.

Option B is wrong because memory usage is typically not due to loading all rows; pandas loads all rows. Option D is wrong because CSV files are plain text and do not have compression overhead.

Full explanation →

1094

MCQhard

A team is analyzing a dataset with many categorical features that have high cardinality (e.g., ZIP code, user ID). They want to explore relationships between these features and a continuous target variable. Which approach is most appropriate for visualizing these relationships without overwhelming the viewer?

A.Group categories into top K levels and use a box plot for each group.

B.Compute a correlation matrix using Pearson correlation.

C.Create a scatter plot with each category as a different color.

D.Use a heatmap to show pairwise chi-square statistics.

AnswerA

Aggregating categories makes the plot interpretable.

Why this answer

Option D is correct because aggregating the categories into groups (e.g., top 10 levels) and then using a bar plot or box plot makes the visualization interpretable. Option A is wrong because scatter plots are not suitable for categorical data. Option B is wrong because a heatmap of pairwise chi-square tests is for categorical-categorical relationships, not categorical-continuous.

Option C is wrong because a correlation matrix with Pearson correlation is for numerical data.

Full explanation →

1095

Multi-Selectmedium

A data scientist is training a random forest classifier on Amazon SageMaker and wants to reduce overfitting. Which TWO actions should the scientist take? (Choose TWO.)

Select 2 answers

A.Increase the number of features considered per split

B.Increase the maximum depth of trees

C.Decrease the number of trees

D.Limit the maximum depth of trees

E.Increase the number of trees

AnswersD, E

Shallow trees reduce overfitting.

Why this answer

Increasing the number of trees reduces variance, and limiting the maximum depth prevents overfitting. Option A is wrong because increasing max depth increases overfitting. Option D is wrong because reducing the number of trees increases variance.

Option E is wrong because increasing the number of features increases tree correlation and may increase overfitting.

Full explanation →

1096

MCQhard

A machine learning team is using Amazon SageMaker to train a model with a custom algorithm packaged in a Docker container. The training job fails with the error 'Error: Unable to locate sagemaker-training toolkit.' What is the MOST likely cause?

A.The container does not have internet access to download dependencies

B.The instance type is incompatible with the container

C.The training role does not have permissions to access the container repository

D.The container does not include the SageMaker Training Toolkit

AnswerD

The toolkit must be installed in the Docker image.

Why this answer

SageMaker requires the container to include the SageMaker Training Toolkit (or the common library) for integration. Option B is correct. Option A is wrong because the error is about the toolkit, not network.

Option C is wrong because the error is clear about missing toolkit. Option D is wrong because the container is custom, not the default.

Full explanation →

1097

MCQmedium

An AWS Glue ETL job failed with the error 'Insufficient memory allocated for the job'. The job run details show AllocatedCapacity: 5, WorkerType: Standard, NumberOfWorkers: 5. Which change should be made to resolve the issue?

A.Delete and recreate the job with a different name

B.Increase the job timeout to 3600 minutes

C.Increase the number of workers to 10

D.Change the worker type to G.2X

AnswerC

More workers increase total memory and compute capacity.

Why this answer

Option C is correct because the error indicates insufficient memory; increasing the number of workers (DPUs) provides more memory. Option A is wrong because worker type Standard has 16 GB memory; G.2X has 32 GB, but the error is about allocated capacity, not worker type. Option B is wrong because job timeout is not the issue.

Option D is wrong because starting a new job won't fix the resource allocation.

Full explanation →

1098

Multi-Selecteasy

A company needs to transfer 10 TB of data from an on-premises data center to Amazon S3. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 5 days. Which TWO options are viable? (Choose TWO.)

Select 1 answer

A.Use S3 Transfer Acceleration to speed up the transfer

B.Use S3 Multipart Upload to upload files in parallel

C.Use AWS Snowball Edge device to ship the data

D.Use AWS DataSync over the existing internet connection

E.Use AWS Direct Connect to establish a dedicated network connection

AnswersC

Snowball Edge is ideal for large data volumes over slow networks; physical shipping is faster.

Why this answer

AWS Snowball Edge is a physical device for large data transfers over slow networks. AWS DataSync can be used if there is some bandwidth, but with 100 Mbps, it would take ~10 days, exceeding the 5-day window. Option B (S3 Transfer Acceleration) uses existing internet, which is still limited.

Option C (Direct Connect) would require setup time and may not meet the window. Option E (S3 Multipart Upload) is for uploading files, not for bulk transfer over slow link.

Full explanation →

1099

MCQeasy

A machine learning engineer needs to deploy a real-time inference endpoint for a model that requires GPU acceleration for low latency. Which AWS service should be used?

A.Amazon SageMaker real-time endpoint

B.Amazon SageMaker batch transform

C.Amazon EC2 with auto scaling

D.AWS Lambda with GPU

AnswerA

SageMaker real-time endpoints support GPU instances and provide low-latency inference.

Why this answer

Amazon SageMaker provides real-time endpoints that support GPU instances for low-latency inference. AWS Lambda does not support GPU, and Batch is for asynchronous processing. EC2 would require manual management.

Full explanation →

1100

Multi-Selecthard

A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The team has multiple models that need to be served, but usage patterns are unpredictable and traffic spikes occur several times a day. The team wants to minimize costs while maintaining low latency. Which THREE actions should the team take?

Select 3 answers

A.Enable provisioned concurrency on the endpoint to reduce cold starts.

B.Use SageMaker inference with Spot Instances to reduce cost.

C.Use a SageMaker multi-model endpoint to serve multiple models on the same instance.

D.Configure automatic scaling on the endpoint to handle traffic spikes.

E.Use SageMaker Batch Transform for all inference requests.

AnswersB, C, D

Spot Instances are cheaper but can be interrupted; for cost savings, sometimes acceptable.

Why this answer

SageMaker multi-model endpoints (A) allow serving multiple models on a single instance, reducing cost. SageMaker automatic scaling (B) adjusts capacity based on demand, handling spikes. Using Spot Instances (C) for inference can reduce cost but may cause interruptions; for real-time, On-Demand is safer.

Provisioned concurrency (D) is for Lambda, not SageMaker. Batch Transform (E) is for offline inference.

Full explanation →

1101

MCQmedium

A company is using AWS Glue to run ETL jobs that process data in an S3 data lake. The jobs are failing with out-of-memory errors when processing large files. Which configuration change should be made to resolve this issue?

A.Change the worker type to G.1X

B.Increase the number of DPUs allocated to the job

C.Partition the input data into smaller files

D.Enable job bookmark to process only new data

AnswerB

More DPUs provide more memory and compute resources.

Why this answer

Option B is correct because increasing the number of DPUs (Data Processing Units) allocates more memory and compute to the Glue job. Option A is wrong because partitioning does not address memory per task; Option C is wrong because job bookmark is for incremental processing, not memory; Option D is wrong because G.1X worker type has limited memory; G.2X or higher is better, but increasing DPUs on the default worker type is simpler.

Full explanation →

1102

MCQhard

Refer to the exhibit. A SageMaker training job failed with the error shown. What is the most likely cause of this error?

A.The input data contains missing or invalid values

B.The training algorithm is not compatible with the data type

C.The training instance type is not powerful enough

D.The training script has a syntax error

AnswerA

NaN or infinity in data cause this error.

Why this answer

The error indicates that the input data contains NaN or infinite values. This is a data quality issue. The algorithm expects clean numeric values.

The algorithm itself is fine; the training script may have a bug but the error specifically points to input data.

Full explanation →

1103

MCQmedium

A data scientist is trying to launch a SageMaker training job using an IAM role with the above policy. The training job fails with an access denied error. What is the MOST likely reason?

A.The policy does not include s3:ListBucket permission

B.The sagemaker:StopTrainingJob action is not required

C.The sagemaker:CreateTrainingJob action is not allowed for the specific instance type

D.The S3 bucket ARN should not include the /* suffix

AnswerA

SageMaker needs ListBucket to access objects in the bucket.

Why this answer

The policy grants s3:GetObject and s3:PutObject permissions on the S3 bucket ARN with a /* suffix, but SageMaker training jobs also require s3:ListBucket permission at the bucket level (without the /*) to enumerate objects and validate paths during job creation. Without s3:ListBucket, the training job fails with an access denied error even though read/write permissions are present.

Exam trap

AWS often tests the subtle distinction between bucket-level actions (s3:ListBucket) and object-level actions (s3:GetObject, s3:PutObject), where candidates mistakenly assume object permissions are sufficient for SageMaker training jobs.

How to eliminate wrong answers

Option B is wrong because sagemaker:StopTrainingJob is an unrelated action that is not required for launching a training job; the error occurs during job creation, not stopping. Option C is wrong because the policy does not specify any instance type restrictions, and SageMaker IAM policies do not typically deny CreateTrainingJob based on instance type unless explicitly scoped with a condition key like sagemaker:InstanceTypes. Option D is wrong because the /* suffix is correctly used to grant object-level permissions on all objects within the bucket; the issue is the missing bucket-level s3:ListBucket permission, not the suffix itself.

Full explanation →

1104

MCQmedium

A machine learning team is deploying a real-time inference endpoint for a fraud detection model using Amazon SageMaker. The model requires low latency (<100 ms) and the team expects a steady stream of requests with occasional spikes. Which instance type and deployment strategy should they use to minimize cost while meeting latency requirements?

A.Use ml.p3 instances with a multi-model endpoint.

B.Use AWS Lambda with a container image for serverless inference.

C.Use ml.m5 instances with a production variant and auto-scaling.

D.Use ml.c5 instances with a single endpoint and provisioned concurrency.

AnswerD

Compute-optimized instances and auto-scaling meet latency and cost goals.

Why this answer

Option D is correct because ml.c5 instances are compute-optimized for low-latency inference, and provisioned concurrency pre-warms the endpoint to handle steady traffic with spikes without cold starts, meeting the <100 ms requirement cost-effectively. This combination avoids over-provisioning while ensuring consistent performance.

Exam trap

The trap here is that candidates often choose auto-scaling (Option C) thinking it handles spikes cost-effectively, but they overlook the latency penalty of scaling up during a spike, which can exceed 100 ms, whereas provisioned concurrency (Option D) pre-warms capacity to meet the latency requirement.

How to eliminate wrong answers

Option A is wrong because ml.p3 instances are GPU-based and designed for deep learning training, not cost-effective for real-time inference on a fraud detection model that likely uses tree-based or linear models; multi-model endpoints add overhead that can increase latency. Option B is wrong because AWS Lambda has a maximum execution timeout of 15 minutes and cold starts can exceed 100 ms, making it unsuitable for sub-100 ms real-time inference with occasional spikes. Option C is wrong because ml.m5 instances are general-purpose and may not provide the compute-optimized performance needed for low latency; using a production variant with auto-scaling can introduce scaling latency during spikes, potentially violating the 100 ms requirement.

Full explanation →

1105

MCQeasy

A data scientist is performing EDA on a dataset that contains customer demographics and purchase history. The dataset has a column 'age' with some values that are negative or unreasonably high (e.g., 200). The scientist wants to identify and handle these outliers. The scientist is using a SageMaker notebook with pandas. Which approach should the scientist take to effectively handle these outliers?

A.Apply standard scaling to the 'age' column

B.Impute the outlier values with the mean of the column

C.Define reasonable bounds based on domain knowledge and filter or cap the outliers

D.Remove the 'age' column entirely

AnswerC

Domain knowledge provides logical bounds to handle outliers appropriately.

Why this answer

Using domain knowledge to define valid age range (e.g., 0-120) and filtering out or capping outliers is the most appropriate approach. Option B is wrong because removing the entire column loses information. Option C is wrong because imputing with mean distorts the distribution if outliers are extreme.

Option D is wrong because standard scaling does not handle outliers; it will still be affected.

Full explanation →

1106

MCQhard

A company runs a critical ETL job using AWS Glue that writes to an Amazon Redshift cluster. The job occasionally fails due to insufficient disk space on the Redshift cluster. How can the company automate the process to prevent this failure?

A.Use a CloudWatch alarm to trigger a Lambda function that resizes the cluster.

B.Use RA3 node types with managed storage.

C.Increase the number of slices in the Redshift cluster.

D.Reserve additional nodes for the Redshift cluster.

AnswerA

This automates scaling based on disk usage.

Why this answer

Using Amazon CloudWatch to monitor disk space and automatically resize the cluster is the best automated solution. Reserving nodes does not address space. Using RA3 nodes with managed storage is a good proactive step but does not automate resizing.

The correct answer is to monitor and auto-resize.

Full explanation →

1107

MCQeasy

A machine learning engineer is training a linear regression model on a dataset with 50 features. After training, the model achieves high accuracy on the training set but poor accuracy on the test set. Which technique should the engineer use to address this issue?

A.Train a deeper neural network with more layers

B.Add more features through feature engineering

C.Apply L1 or L2 regularization

D.Increase the size of the training dataset

AnswerC

Regularization penalizes large coefficients and reduces overfitting.

Why this answer

The model exhibits overfitting: high training accuracy but poor test accuracy. L1 (Lasso) or L2 (Ridge) regularization penalizes large coefficients, reducing model complexity and improving generalization. This directly addresses the variance problem without requiring more data or features.

Exam trap

AWS often tests the distinction between overfitting and underfitting, and the trap here is that candidates may think adding more data (Option D) is the universal fix for overfitting, when in fact regularization is the most direct and efficient solution for a model with high variance.

How to eliminate wrong answers

Option A is wrong because training a deeper neural network would increase model capacity and likely worsen overfitting, not fix it. Option B is wrong because adding more features through feature engineering would increase dimensionality and exacerbate overfitting, not reduce it. Option D is wrong because increasing the training dataset size can help reduce overfitting, but it is not the most direct or practical fix; regularization is a more immediate and targeted technique for this specific symptom.

Full explanation →

1108

MCQmedium

An IAM policy is attached to a data engineering role that writes to an S3 bucket. The policy is shown in the exhibit. What is the effect of this policy?

A.The role can write objects with any encryption, but reading is restricted to SSE-KMS only

B.The role can read and write any object without encryption restrictions

C.The role can only read objects; writing is always denied

D.The role must use SSE-KMS when writing objects; reading is allowed only if the object is encrypted with SSE-KMS

AnswerD

The Allow statement grants GetObject only when SSE-KMS is specified, and the Deny statement enforces SSE-KMS for PutObject.

Why this answer

Option B is correct because the first statement allows GetObject and PutObject only when SSE-KMS is used; the second statement denies PutObject if SSE-KMS is not used, effectively enforcing SSE-KMS for PutObject. Option A is wrong because the policy allows PutObject with SSE-KMS. Option C is wrong because GetObject is allowed with SSE-KMS.

Option D is wrong because the policy never denies GetObject.

Full explanation →

1109

MCQhard

A data scientist is tuning a linear regression model and observes that the model has high bias and low variance. Which action is most likely to improve model performance?

A.Reduce the number of features

B.Increase regularization

C.Add more features

D.Reduce the amount of training data

AnswerC

Increases complexity, reducing bias.

Why this answer

High bias and low variance indicate underfitting, meaning the model is too simple to capture the underlying patterns in the data. Adding more features increases model complexity, allowing it to learn more relevant relationships and reduce bias. This directly addresses the core issue of underfitting in linear regression.

Exam trap

AWS often tests the bias-variance tradeoff by presenting high bias (underfitting) and high variance (overfitting) scenarios, and the trap here is that candidates mistakenly choose to increase regularization or reduce features, which are remedies for overfitting, not underfitting.

How to eliminate wrong answers

Option A is wrong because reducing the number of features further simplifies the model, which would increase bias and worsen underfitting. Option B is wrong because increasing regularization penalizes model coefficients more heavily, reducing complexity and increasing bias, which is the opposite of what is needed. Option D is wrong because reducing the amount of training data typically increases variance (overfitting risk) and does not address the high bias problem; it may also degrade the model's ability to learn generalizable patterns.

Full explanation →

1110

MCQmedium

A data scientist is analyzing application logs in JSON format. Based on the exhibit, which EDA insight is most valuable for troubleshooting?

A.There is a recurring NullPointerException error.

B.All logs occurred at the same timestamp.

C.There is a connection timeout issue.

D.Most logs are at WARN level.

AnswerA

Three out of four logs are the same error, indicating a pattern.

Why this answer

Option B is correct because the repeated NullPointerException suggests a recurring issue that needs immediate attention. Option A is wrong because connection timeout appears only once. Option C is wrong because there are multiple timestamps.

Option D is wrong because WARN level appears only once.

Full explanation →

1111

MCQhard

During EDA, a data scientist plots the distribution of a numeric feature and observes that it is right-skewed. The feature will be used as input to a linear model. Which transformation should the data scientist apply?

A.Square transformation

B.Log transformation

C.One-hot encoding

D.Standardization (Z-score)

AnswerB

Log transformation compresses the tail and reduces right skewness.

Why this answer

A right-skewed distribution indicates that the feature has a long tail on the right, which can violate the linear model assumption of normally distributed errors. The log transformation compresses the high values and expands the low values, making the distribution more symmetric and stabilizing variance, which improves linear model performance.

Exam trap

Cisco often tests the misconception that standardization or scaling fixes skewness, but candidates must remember that only shape-altering transformations like log or Box-Cox address non-normality, not just rescaling.

How to eliminate wrong answers

Option A is wrong because a square transformation amplifies skewness by increasing the spread of high values, making the distribution even more right-skewed. Option C is wrong because one-hot encoding is used for categorical features, not for transforming the distribution of numeric features. Option D is wrong because standardization (Z-score) centers and scales the data but does not change the shape of the distribution, so it does not address skewness.

Full explanation →

1112

Multi-Selectmedium

A data scientist is performing EDA on a dataset with mixed data types (numerical and categorical). Which TWO visualizations are most appropriate for understanding the distribution of categorical features?

Select 2 answers

A.Histogram

B.Box plot

C.Pie chart

D.Scatter plot

E.Bar chart

AnswersC, E

Pie charts show proportions of categories.

Why this answer

A bar chart shows the count of each category, and a pie chart shows the proportion. Both are suitable for categorical data. Options C, D, and E are for numerical data.

Full explanation →

1113

MCQeasy

A data analyst is exploring a dataset and notices that the target variable has a Poisson distribution. Which type of model is most appropriate for this target?

A.Poisson regression

B.Linear regression

C.Cox proportional hazards model

D.Logistic regression

AnswerA

Poisson regression models count data with Poisson distribution.

Why this answer

Poisson regression is the correct choice because it is specifically designed for modeling count data where the target variable follows a Poisson distribution, which is characterized by non-negative integer values and a variance equal to the mean. This aligns directly with the data analyst's observation of a Poisson-distributed target, making Poisson regression the most appropriate generalized linear model (GLM) for this scenario.

Exam trap

The trap here is that candidates may confuse Poisson regression with logistic regression or linear regression, mistakenly applying a model for binary outcomes or continuous data to count data, without recognizing that the Poisson distribution's unique properties require a specialized GLM.

How to eliminate wrong answers

Option B is wrong because linear regression assumes a normally distributed target variable with constant variance, which is violated when the target follows a Poisson distribution (count data with variance equal to the mean). Option C is wrong because Cox proportional hazards model is a survival analysis technique for time-to-event data with censoring, not for modeling a Poisson-distributed count target. Option D is wrong because logistic regression models binary or ordinal outcomes using a logit link function, not count data with a Poisson distribution.

Full explanation →

1114

MCQeasy

A machine learning engineer needs to deploy a model that makes real-time predictions with latency under 100ms. The model is a small ensemble of decision trees. Which AWS service is MOST suitable?

A.Amazon EMR with Spark Streaming

B.AWS Glue

C.Amazon SageMaker endpoint

D.AWS Lambda with custom container

AnswerC

SageMaker endpoints are designed for real-time inference with low latency.

Why this answer

Amazon SageMaker provides real-time endpoints with low latency for model inference, and can host the ensemble as a single endpoint.

Full explanation →

1115

MCQmedium

A company uses AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The Glue job writes data to Redshift using the JDBC connection. Recently, the job has been failing with connection timeout errors when writing to Redshift. The Redshift cluster is a 2-node dc2.large cluster. The Glue job processes about 50 GB of data per run. The errors occur sporadically, and the job succeeds after a few retries. The data engineer needs to resolve the issue to prevent job failures. What should the engineer do?

A.Increase the Redshift cluster size to 4 nodes.

B.Modify the Glue job to write data to S3 first, then use the Redshift COPY command to load data.

C.Increase the Glue job timeout to 24 hours.

D.Use AWS Database Migration Service (DMS) to load data into Redshift.

AnswerB

Using COPY from S3 is the recommended approach for bulk loading into Redshift; it is faster and more reliable than JDBC writes.

Why this answer

Option C is correct. The Glue JDBC driver may not handle large data volumes efficiently; using the COPY command via S3 is optimized for bulk loads and is more reliable. Option A is wrong because increasing timeout may mask the issue but not resolve it.

Option B is wrong because the Redshift cluster may be undersized, but the issue is likely the JDBC driver. Option D is wrong because AWS DMS is for database migration, not ETL jobs.

Full explanation →

1116

MCQhard

Refer to the exhibit. A company is using the Kinesis stream 'my-stream' with one shard. The producer is sending 1000 records per second, each 1 KB. The consumer is reading from the stream using the Kinesis Client Library (KCL). The consumer is able to process 500 records per second per shard. What is the most likely cause of the consumer falling behind?

A.The retention period is set to 24 hours, which is too short.

B.The stream uses KMS encryption, which adds latency.

C.The stream has only one shard, which limits the read throughput to 1 MB/s.

D.The consumer application is not using enhanced fan-out.

AnswerC

Correct: One shard provides 2 MB/s read, but the producer is sending 1 MB/s (1000 records * 1 KB), which is within limits, but the consumer can only process 500 records/s, so it falls behind. Actually, read throughput is 2 MB/s, but processing capacity is lower.

Why this answer

The stream has only 1 shard, which provides a read throughput of 2 MB/s (or 1000 records per second of 1 KB each). The consumer can process 500 records per second, which is half the incoming rate, so it will fall behind. Increasing the number of shards would increase read throughput.

Option A (not enough shards) is correct. Option B (retention period) does not affect ingestion. Option C (encryption) may add overhead but is not the main issue.

Option D (consumer code) could be optimized, but the root cause is shard count.

Full explanation →

1117

MCQhard

A data scientist is tuning hyperparameters for an XGBoost model on a large dataset using Amazon SageMaker. The training job is taking too long, and they want to speed up the tuning process. Which strategy is most effective?

A.Use Bayesian optimization

B.Use grid search with a fine-grained grid

C.Use random search with more iterations

D.Reduce the max depth of trees

AnswerA

Bayesian optimization is more efficient.

Why this answer

Option B is correct because Bayesian optimization is more efficient than grid search, especially for large datasets. Option A is wrong because grid search is exhaustive and slow. Option C is wrong because random search is faster but less efficient.

Option D is wrong because reducing max depth may reduce accuracy.

Full explanation →

1118

MCQmedium

A company uses Amazon DynamoDB as the primary data store for a real-time recommendation engine. The data engineering team needs to export a daily snapshot of the DynamoDB table to S3 for offline analytics. The table is large (10 TB) and has a high read/write throughput. Which method will export the data with the least impact on the production workload?

A.Use AWS Data Pipeline to export the DynamoDB table to S3.

B.Use DynamoDB Scan API with parallel scans to export data to S3.

C.Use the DynamoDB export to S3 feature available in the AWS Console or CLI.

D.Use AWS Glue ETL job with a DynamoDB connection to export data.

AnswerC

This feature exports data without consuming read capacity units, minimizing impact.

Why this answer

Option D is correct because DynamoDB's export to S3 feature uses the table's backup capability, which reads from the storage layer without consuming read capacity units, thus having zero impact on production. Option A (scan) consumes RCUs. Option B (Data Pipeline) also uses scan.

Option C (Glue ETL) also consumes RCUs.

Full explanation →

1119

MCQeasy

A data pipeline uses AWS Glue to crawl an S3 bucket and create a table in the AWS Glue Data Catalog. The data is in Parquet format with partitions by date. After a new partition is added to S3, the crawler runs but the new partition is not reflected in the table. What is the most likely cause?

A.The crawler requires an AWS Lambda trigger to be configured for new partitions.

B.The Parquet schema in the new partition does not match the existing table schema.

C.The new partition folder does not follow the Hive-style partition naming convention expected by the crawler.

D.The S3 bucket has too many partitions, exceeding the Glue crawler limit.

AnswerC

Glue crawlers require partition folders to follow the key=value pattern to automatically detect partitions.

Why this answer

Option A is correct because the Glue crawler may be configured to only crawl new folders if the crawler's configuration is set to 'Crawl new folders only' but the partition path may not match the expected pattern, or the crawler's 'Schema updates' setting might be set to 'Ignore'. More commonly, the crawler's 'Update the table definition' option is set to 'Add new columns only' or 'Ignore the change and don't update the table'; the correct setting to add partitions is 'Add new columns only' or 'Add new partitions only'. However, the most typical issue is that the crawler is set to 'Crawl all folders each time' and still not picking up because the partition path is not in a recognized Hive-style format (e.g., date=2023-01-01/).

Option A points to the partition path format. Option B is wrong because Glue can handle Parquet. Option C is wrong because the crawler does not need Lambda triggers.

Option D is wrong because the crawler can handle up to many partitions.

Full explanation →

1120

Multi-Selectmedium

A data scientist is training a classification model on a dataset with missing values in several features. The data scientist wants to use SageMaker to train the model. Which TWO approaches can the data scientist use to handle missing data within the SageMaker training pipeline? (Choose two.)

Select 2 answers

A.Use the SageMaker built-in XGBoost algorithm, which can handle missing values by default.

B.Use the SageMaker BlazingText algorithm, which automatically imputes missing values.

C.Use SageMaker Inference Pipeline to handle missing values at inference time.

D.Use SageMaker Processing to run a custom Python script that imputes missing values before training.

E.Use SageMaker PCA algorithm, which automatically handles missing values.

AnswersA, D

XGBoost has built-in support for missing values.

Why this answer

Option A is correct because the SageMaker built-in XGBoost algorithm has a built-in mechanism to handle missing values by default. It learns the best direction (left or right branch) to route missing values during training, so no explicit imputation is needed. This makes it a seamless choice for datasets with missing data within the SageMaker training pipeline.

Exam trap

The trap here is that candidates often assume all SageMaker built-in algorithms automatically handle missing values, but only XGBoost does; BlazingText and PCA require complete data, and Inference Pipeline is for serving, not training.

Full explanation →

1121

MCQeasy

A company wants to use Amazon SageMaker to train a model on a dataset stored in Amazon S3. The dataset is 100 GB and consists of millions of small JSON files. What should the data engineering team do to optimize training performance?

A.Combine the small JSON files into larger Parquet files using a Spark job on Amazon EMR.

B.Copy the data to an Amazon EBS volume attached to the training instance.

C.Use Amazon Athena to convert the data into a single CSV file.

D.Use S3 Select to filter data before training.

AnswerA

Parquet with larger files improves read efficiency and reduces overhead.

Why this answer

Option D is correct because combining small files into larger ones reduces S3 LIST overhead and improves I/O performance. Option A is wrong because Athena is not needed for training. Option B is wrong because EBS is ephemeral and not shared across instances.

Option C is wrong because S3 Select is for server-side filtering, not for training performance.

Full explanation →

1122

MCQmedium

A company is building a data lake on Amazon S3. They need to enforce encryption at rest for all objects. Which combination of actions will achieve this? (Assume the bucket is versioned.)

A.Use AWS KMS with automatic key rotation

B.Enable S3 default encryption and set a bucket policy to deny PutObject without encryption headers

C.Enable S3 default encryption only

D.Enable S3 Block Public Access

AnswerB

This ensures all objects are encrypted.

Why this answer

Enabling default encryption on the bucket ensures all new objects are encrypted. Adding a bucket policy that denies PutObject without encryption headers prevents unencrypted uploads. The other options do not enforce encryption for all objects.

Full explanation →

1123

MCQmedium

A data scientist runs the above AWS CLI command and gets the output. The object size is 1 GB. They try to open the CSV file in Amazon Athena but get an error. What is the most likely cause?

A.The file format is not supported by Athena

B.The file exceeds the maximum CSV file size that Athena can query without partitioning

C.The file is not compressed with gzip

D.The file is too large for Athena to query at all

AnswerB

Athena has a 100 MB limit for CSV files when not partitioned.

Why this answer

Option B is correct because Athena has a 100 MB per file limit for CSV queries without partitioning. Option A is wrong because gzip compression is supported. Option C is wrong because 1 GB is large but Athena can handle larger with proper partitioning.

Option D is wrong because CSV is supported.

Full explanation →

1124

MCQmedium

A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has a column 'transaction_date' with timestamps in string format. Which AWS service can be used to parse the timestamps and extract features like day of week and hour?

A.Amazon Athena

B.Amazon SageMaker Studio

C.AWS Glue

D.AWS Data Pipeline

AnswerC

AWS Glue provides built-in transformations for timestamp parsing and feature extraction.

Why this answer

Option C is correct because AWS Glue provides built-in transformations to parse timestamps and extract date/time features. Option A is wrong because Amazon Athena is a query service, not a transformation service. Option B is wrong because Amazon SageMaker Studio is an IDE, not a data transformation service.

Option D is wrong because AWS Data Pipeline is a workflow orchestration service, not a timestamp parsing tool.

Full explanation →

1125

Multi-Selecteasy

A data analyst is performing exploratory data analysis on a dataset and notices that there are outliers in several numerical columns. Which TWO methods can the analyst use to identify outliers?

Select 2 answers

A.Create a scatter plot matrix to visually inspect.

B.Calculate z-scores and flag any data points with |z| > 3.

C.Use a box plot to visualize the interquartile range (IQR) and identify points outside the whiskers.

D.Compare the mean and median of each column.

E.Plot a histogram and look for gaps.

AnswersB, C

Z-scores provide a statistical threshold for outliers.

Why this answer

Options B and D are correct. Box plots use the IQR to identify outliers as points outside 1.5*IQR from the quartiles. Z-scores identify outliers as points with |z| > 3 (assuming normal distribution).

Option A is wrong because mean and median are measures of central tendency, not outlier detection. Option C is wrong because histograms show distribution shape but do not explicitly identify outliers. Option E is wrong because pairwise scatter plots may show outliers but are not a systematic method.

Full explanation →

Page 15 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →