Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1–75

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 1 of 24

1

Multi-Selecthard

Which THREE factors should be considered when choosing between a parametric and a non-parametric machine learning model?

Select 3 answers

A.Non-parametric models are generally more flexible

B.Parametric models have lower bias than non-parametric models

C.Parametric models train faster than non-parametric models

D.Non-parametric models are less prone to overfitting

E.Parametric models typically require less training data

AnswersA, C, E

Non-parametric models can fit complex patterns.

Why this answer

Option A is correct because non-parametric models, such as k-nearest neighbors or decision trees, do not assume a fixed functional form for the data, allowing them to capture complex, non-linear relationships. This flexibility makes them well-suited for datasets where the underlying distribution is unknown or highly irregular, but it also increases the risk of overfitting if not properly regularized.

Exam trap

AWS often tests the misconception that non-parametric models are less prone to overfitting because they are 'simpler,' when in fact their flexibility makes them more susceptible to overfitting without careful tuning or large datasets.

Full explanation →

2

MCQeasy

A data scientist needs to perform feature scaling for a dataset containing numerical features with different units (e.g., age in years and income in dollars). Which scaling method is most appropriate when the algorithm assumes data is normally distributed?

A.Standardization (Z-score normalization)

B.Log transformation

C.Min-Max scaling

D.Robust scaling (using median and IQR)

AnswerA

Standardization centers data around zero with unit variance, suitable for normality assumptions.

Why this answer

Option B is correct because Standardization (Z-score normalization) transforms features to have mean 0 and standard deviation 1, which is suitable for algorithms assuming normality. Option A is wrong because Min-Max scaling does not preserve distribution shape. Option C is wrong because Robust scaling is for outliers.

Option D is wrong because Log transformation is for skewed data, not scaling.

Full explanation →

3

MCQeasy

A machine learning engineer is exploring a dataset with 500 features and 10,000 samples. To reduce dimensionality for visualization, which technique is most suitable if the goal is to preserve global data structure?

A.t-Distributed Stochastic Neighbor Embedding (t-SNE)

B.Locally Linear Embedding (LLE)

C.Principal Component Analysis (PCA)

D.Uniform Manifold Approximation and Projection (UMAP)

AnswerC

PCA preserves global variance (covariance structure).

Why this answer

PCA is the most suitable technique for preserving the global data structure when reducing dimensionality because it is a linear method that maximizes variance along orthogonal principal components, capturing the overall covariance structure of the 500 features. Unlike nonlinear methods, PCA ensures that the global relationships (e.g., distances between clusters) are retained, making it ideal for visualization of high-dimensional data where the goal is to see broad patterns.

Exam trap

Cisco often tests the misconception that nonlinear methods like t-SNE or UMAP are always better for visualization, but the trap here is that they sacrifice global structure for local detail, making PCA the correct choice when the question explicitly states 'preserve global data structure.'

How to eliminate wrong answers

Option A is wrong because t-SNE is a nonlinear technique that focuses on preserving local neighborhoods and pairwise similarities, often distorting global structure (e.g., cluster sizes and distances) to create visually separable clusters. Option B is wrong because LLE is a nonlinear manifold learning method that preserves local linear relationships between neighbors, but it does not guarantee preservation of global structure and can fail with high-dimensional data (500 features) due to the curse of dimensionality. Option D is wrong because UMAP, while faster than t-SNE, is also a nonlinear technique designed to preserve local and some global structure but prioritizes topological connectivity over global variance, making it less suitable than PCA when the explicit goal is to maintain the overall data covariance and global distances.

Full explanation →

4

Multi-Selecteasy

Which TWO services can be used to serve machine learning models for real-time inference? (Select TWO.)

Select 2 answers

A.Amazon Rekognition

B.Amazon Athena

C.Amazon SageMaker

D.Amazon Comprehend

E.AWS Lambda

AnswersC, E

SageMaker endpoints are designed for real-time inference.

Why this answer

Amazon SageMaker provides real-time endpoints. AWS Lambda can be used with a custom container for inference. Amazon Rekognition is a specialized service for image analysis, not general model serving.

Amazon Comprehend is NLP service. Amazon Athena is query service.

Full explanation →

5

MCQhard

A company uses Amazon EMR with Spark to process data daily. The job reads from S3 and writes to S3. Recently, the job started failing with 'S3AccessDenied' errors. The IAM role used by EMR has not changed. What is the MOST likely cause?

A.The EMR cluster's security group blocks outbound traffic

B.The S3 bucket policy was updated to deny access to the EMR role

C.The S3 bucket was deleted and recreated

D.The EMR service role was rotated

AnswerB

Bucket policies can deny access even if IAM allows.

Why this answer

S3 bucket policies can be changed independently and may block access. IAM policies are not the only factor; bucket policies also apply. The role itself hasn't changed, but the bucket policy might have been updated to deny access.

Other options are less likely because the role hasn't changed and the errors are access-related.

Full explanation →

6

MCQmedium

A data scientist is working on a regression problem with a dataset that contains outliers. The data scientist is choosing between mean squared error (MSE) and mean absolute error (MAE) as the loss function. Which loss function is more robust to outliers?

A.Both are equally robust to outliers.

B.MSE, because it penalizes large errors more heavily.

C.MAE, because it treats all errors equally.

D.Neither is robust; use Huber loss instead.

AnswerC

MAE is linear in errors, reducing the impact of outliers.

Why this answer

MAE is more robust to outliers because it uses the absolute difference between predicted and actual values, which does not disproportionately penalize large errors. In contrast, MSE squares the errors, causing outliers to have a much larger influence on the loss and model updates. This makes MAE less sensitive to extreme values in regression tasks.

Exam trap

AWS often tests the misconception that a loss function that penalizes errors more heavily is better for robustness, when in fact the opposite is true for outliers.

How to eliminate wrong answers

Option A is wrong because MSE and MAE handle outliers very differently due to their mathematical formulations, so they are not equally robust. Option B is wrong because MSE's heavier penalization of large errors actually makes it less robust to outliers, not more. Option D is wrong because while Huber loss is indeed more robust than both MSE and MAE in some cases, the question asks which of the two given loss functions is more robust, and MAE is the correct choice; Huber loss is not an option in the original comparison.

Full explanation →

7

MCQeasy

A data engineer is building a data pipeline that ingests streaming data from IoT devices. The data must be processed in near real-time and stored in Amazon S3 for further analysis. Which AWS service should be used to capture and process the streaming data before storing it in S3?

A.Use Amazon S3 with S3 Event Notifications to trigger AWS Lambda for processing.

B.Use AWS Glue to perform ETL on the streaming data.

C.Use Amazon Kinesis Data Streams to capture the data and Amazon Kinesis Data Firehose to deliver it to S3.

D.Use Amazon Simple Queue Service (SQS) to buffer the data and then process it with AWS Lambda.

AnswerC

Kinesis Data Streams ingests real-time data and Kinesis Data Firehose delivers it to S3.

Why this answer

Amazon Kinesis Data Streams is designed for real-time data ingestion and can be integrated with Lambda for processing. Kinesis Data Firehose can then load the data into S3. Option A is wrong because S3 is a storage service, not a streaming ingestion service.

Option B is wrong because AWS Glue is for ETL and cataloging, not real-time streaming. Option D is wrong because SQS is a message queue, not optimized for streaming analytics.

Full explanation →

8

MCQhard

A company is running a data pipeline that uses Amazon EMR with Spark to process 100 TB of data daily. The pipeline must complete within 6 hours. Currently, it takes 8 hours. Which optimization will most likely reduce the runtime?

A.Consolidate small input files into fewer larger files

B.Enable EMR Managed Scaling

C.Increase the memory of each node by using r5 instances

D.Use Spot Instances for all core nodes

AnswerB

Managed Scaling dynamically adds resources to meet deadlines.

Why this answer

Option D is correct because enabling EMR Managed Scaling automatically adjusts cluster resources based on workload, which can reduce runtime. Option A is wrong because using spot instances may cause interruptions; Option B is wrong because more memory per node may not help if the bottleneck is parallelism; Option C is wrong because consolidating data into fewer files can reduce overhead but may not be the main issue.

Full explanation →

9

MCQeasy

A data scientist needs to evaluate a binary classification model's performance. Which metric is most appropriate when the cost of false positives is very high?

A.Accuracy

B.F1 score

C.Recall

D.Precision

AnswerD

Minimizes false positives.

Why this answer

Precision is the most appropriate metric when the cost of false positives is very high because it measures the proportion of positive identifications that were actually correct. In binary classification, precision = TP / (TP + FP), so a high precision means very few false positives occur, directly minimizing the costly error type.

Exam trap

The trap here is that candidates often confuse precision with recall or F1 score, mistakenly thinking that minimizing false positives is best achieved by maximizing recall or a balanced metric, rather than directly optimizing precision.

How to eliminate wrong answers

Option A is wrong because accuracy considers both false positives and false negatives equally, and can be misleading when classes are imbalanced; it does not specifically penalize false positives. Option B is wrong because the F1 score is the harmonic mean of precision and recall, balancing both false positives and false negatives, so it does not prioritize minimizing false positives over false negatives. Option C is wrong because recall measures the proportion of actual positives correctly identified (TP / (TP + FN)), which focuses on avoiding false negatives, not false positives.

Full explanation →

10

MCQmedium

A company uses Amazon SageMaker to train a model. The training job fails with an 'OutOfMemory' error. The training data is stored in S3 and the instance type is ml.m5.xlarge. What is the most efficient way to resolve this issue?

A.Enable managed spot training

B.Reduce the batch size in the training script

C.Increase the number of instances using distributed training

D.Use a larger instance type, such as ml.m5.2xlarge

AnswerD

Larger instance provides more memory.

Why this answer

Option C is correct: Increasing the instance type provides more memory. Option A (increasing instance count) does not increase memory per instance. Option B (reducing batch size) may help but is less efficient.

Option D (enabling spot instances) does not address memory.

Full explanation →

11

MCQhard

A machine learning engineer is training a neural network on Amazon SageMaker using a custom Docker container. The training job fails with an error: 'CUDA out of memory.' The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model and data fit into memory when using batch size 32, but the engineer wants to maximize GPU utilization. Which approach should the engineer use to fix the out-of-memory error while maintaining efficient training?

A.Enable mixed precision training

B.Reduce batch size to 1

C.Use a CPU-only instance

D.Implement gradient accumulation with a larger effective batch size

AnswerD

Accumulates gradients over smaller batches to simulate larger batches.

Why this answer

Gradient accumulation allows the engineer to simulate a larger effective batch size by accumulating gradients over multiple forward/backward passes before performing an optimizer step. This keeps the per-step memory footprint low (avoiding CUDA out-of-memory) while maintaining training dynamics similar to a larger batch, thus maximizing GPU utilization without crashing.

Exam trap

The trap here is that candidates may think mixed precision training (Option A) is the direct solution for CUDA out-of-memory, but it only reduces memory per tensor, not the peak memory from batch size; gradient accumulation is the correct technique to handle large effective batches without exceeding GPU memory.

How to eliminate wrong answers

Option A is wrong because mixed precision training reduces memory usage by using float16 for most operations, but it does not directly address the out-of-memory error caused by batch size being too large; it can help but is not the primary fix for this specific error when the goal is to maintain efficient training with a larger effective batch. Option B is wrong because reducing batch size to 1 would drastically reduce GPU utilization and slow convergence, contradicting the goal of maximizing GPU utilization. Option C is wrong because switching to a CPU-only instance would eliminate the GPU entirely, making the training extremely slow and defeating the purpose of using a GPU-accelerated instance like ml.p3.2xlarge.

Full explanation →

12

MCQeasy

A company uses Amazon SageMaker to train a linear regression model. The training data includes a feature 'age' with values ranging from 0 to 100. The model's loss is not converging. What is the MOST likely cause?

A.The features are not normalized.

B.There are outliers in the target variable.

C.The instance type is too small.

D.The learning rate is too high.

AnswerA

Unscaled features cause gradient descent to oscillate.

Why this answer

Feature scaling is crucial for linear models; 'age' range is large compared to other features if they are not scaled. Option A (wrong instance type) is unlikely. Option C (outliers) could cause issues but scaling is more fundamental.

Option D (learning rate too high) is possible but scaling is a common first step.

Full explanation →

13

MCQmedium

A company has deployed a machine learning model on a SageMaker endpoint that serves predictions to a web application. The model uses a custom inference container that loads the model artifacts from an ECR repository. After updating the model with new training data, the data scientist creates a new model and updates the endpoint. However, some users report that they still get predictions from the old model. The data scientist confirms that the endpoint configuration points to the new model. What is the most likely cause?

A.The new model artifacts are not correctly uploaded to S3

B.The endpoint is behind a load balancer that is not updated

C.The inference container is cached and not pulling the new image

D.DNS caching on the client side is resolving to the old endpoint IP address

AnswerD

DNS caching can cause stale responses.

Why this answer

Option B is correct because DNS caching at the client side may resolve the endpoint's DNS name to an old IP address, especially if the endpoint's underlying instances have not changed. Option A (incorrect model artifact) would affect all users. Option C (load balancer) is not part of SageMaker.

Option D (CloudFront caching) could be an issue if CloudFront is in front, but the question does not mention it.

Full explanation →

14

MCQmedium

A company has a dataset with a large number of missing values in several columns. The data scientist wants to impute missing values without introducing bias. Which approach should be used?

A.Remove rows with any missing values

B.Use iterative imputation (MICE) to model missing values

C.Replace missing values with the mode of each column

D.Replace missing values with the mean of each column

AnswerB

MICE uses relationships among variables to impute, reducing bias.

Why this answer

Option C is correct because MICE (Multiple Imputation by Chained Equations) is a sophisticated method that models each variable with missing values as a function of other variables, reducing bias. Option A is wrong because mean imputation can reduce variance and bias relationships. Option B is wrong because dropping rows loses data.

Option D is wrong because mode imputation for categorical data may introduce bias if missingness is not random.

Full explanation →

15

MCQmedium

A data scientist is using Amazon SageMaker to train a model with a custom Docker container. The training script reads data from an S3 bucket and writes the model artifact to an S3 bucket. The training job fails with a 'NoSuchKey' error. What is the MOST likely cause?

A.The training script is not compatible with the Docker image.

B.The training data path specified in the input data channel is incorrect.

C.The Docker image is not available in Amazon ECR.

D.The SageMaker execution role does not have s3:GetObject permission.

AnswerB

NoSuchKey means the S3 key does not exist.

Why this answer

Option B is correct because the error indicates that the specified file or key does not exist in the S3 bucket. Option A is wrong because a missing ECR image would cause a different error. Option C is wrong because insufficient permissions would cause AccessDenied.

Option D is wrong because the training image is not the issue.

Full explanation →

16

MCQhard

A machine learning team is working with a dataset containing high-dimensional sparse features, such as text data represented as bag-of-words. The team wants to reduce dimensionality while preserving the structure of the sparse matrix. Which technique is most appropriate for this scenario?

A.t-distributed Stochastic Neighbor Embedding (t-SNE).

B.Truncated Singular Value Decomposition (SVD).

C.Linear Discriminant Analysis (LDA).

D.Principal Component Analysis (PCA) using the covariance matrix.

AnswerB

Truncated SVD works efficiently on sparse matrices.

Why this answer

Option D is correct because Truncated SVD (e.g., using sklearn's TruncatedSVD or PCA on sparse data via SVD) is designed for sparse matrices and preserves variance. Option A is wrong because PCA with covariance matrix requires dense matrix and is computationally expensive for sparse data. Option B is wrong because t-SNE is for visualization, not for general dimensionality reduction preserving structure.

Option C is wrong because LDA is a supervised method and requires labels.

Full explanation →

17

MCQeasy

A company wants to use Amazon SageMaker to train a deep learning model using a custom TensorFlow script. The data is stored in an S3 bucket. Which SageMaker API operation should be used to launch the training job?

A.CreateHyperParameterTuningJob

B.CreateEndpoint

C.CreateTransformJob

D.CreateTrainingJob

AnswerD

CreateTrainingJob starts a training job.

Why this answer

The correct API operation to launch a training job in Amazon SageMaker is CreateTrainingJob. This operation specifies the training algorithm (or custom script), resource configuration (instance type and count), input data configuration (pointing to S3), and output location for the model artifacts. It directly initiates the training process on SageMaker-managed infrastructure.

Exam trap

Cisco often tests the distinction between training, tuning, inference, and batch transform operations, so candidates mistakenly choose CreateHyperParameterTuningJob when the question only asks for a single training job, or choose CreateEndpoint when they confuse training with deployment.

How to eliminate wrong answers

Option A is wrong because CreateHyperParameterTuningJob is used to launch a hyperparameter tuning job, which runs multiple training jobs with different hyperparameter combinations to find the best model; it is not the direct API for a single training job. Option B is wrong because CreateEndpoint is used to deploy a trained model as a real-time inference endpoint, not to start training. Option C is wrong because CreateTransformJob is used for batch inference on an existing trained model, not for training a new model.

Full explanation →

18

MCQhard

During exploratory data analysis, a data scientist discovers that a feature has a variance of 0.01, while other features have variances around 1.0. Which action should be taken?

A.Scale the feature to have unit variance.

B.Apply a log transformation to the feature.

C.Impute missing values in the feature.

D.Consider removing the feature or applying variance threshold.

AnswerD

Near-zero variance features are often uninformative.

Why this answer

Option B is correct because features with near-zero variance contribute little information and may cause numerical instability. Option A is wrong because low variance does not automatically mean missing values. Option C is wrong because scaling to unit variance amplifies noise.

Option D is wrong because transformation does not increase variance meaningfully.

Full explanation →

19

Multi-Selecthard

A machine learning engineer is designing an automated ML pipeline for training and deploying models. The pipeline must include data validation, model training, hyperparameter tuning, and model deployment. The engineer wants to use AWS services that integrate well and provide version control. Which THREE services should be combined to achieve this? (Choose THREE.)

Select 3 answers

A.AWS Glue

B.AWS Step Functions

C.AWS CodePipeline

D.Amazon EMR

E.Amazon SageMaker

AnswersB, C, E

Correct: Orchestrates the ML pipeline steps.

Why this answer

Option A (AWS Step Functions) orchestrates the pipeline steps. Option C (Amazon SageMaker) provides training, tuning, and deployment. Option D (AWS CodePipeline) manages version control and automation of the entire CI/CD workflow.

Option B (AWS Glue) is for ETL but not pipeline orchestration. Option E (Amazon EMR) is for big data processing, not ML pipeline management.

Full explanation →

20

MCQmedium

Refer to the exhibit. A data scientist is using an IAM role with this policy to run a SageMaker processing job that reads data from S3. The job fails with an access error. What is the most likely cause?

A.The policy does not allow sagemaker:CreateProcessingJob

B.The policy does not allow s3:PutObject

C.The policy does not allow s3:ListBucket

D.The policy does not allow s3:GetObject

AnswerC

ListBucket is required to list objects in the bucket.

Why this answer

The processing job needs both s3:GetObject and s3:ListBucket to read objects. The policy lacks s3:ListBucket. Option A is wrong because sagemaker:CreateProcessingJob is allowed.

Option B is wrong because the policy allows s3:GetObject. Option D is wrong because s3:PutObject is not needed for reading.

Full explanation →

21

MCQeasy

A data scientist is using Amazon SageMaker to train a model. The training job fails with an error 'Insufficient instance capacity'. Which action should the scientist take to resolve this?

A.Check IAM permissions

B.Retry with a different instance type or Availability Zone

C.Increase the number of instances

D.Reduce the training data size

AnswerB

Different instances/AZs may have available capacity.

Why this answer

Option D is correct because retrying with a different instance type or in a different Availability Zone often resolves capacity issues. Option A is wrong because the error is about capacity, not permissions. Option B is wrong because reducing data size doesn't affect capacity.

Option C is wrong because increasing the instance count may exacerbate the problem.

Full explanation →

22

MCQmedium

A data scientist ran an XGBoost training job on Amazon SageMaker using a CSV dataset. The training job failed with the error shown. What is the most likely cause of this failure?

A.The dataset includes a header row

B.One of the rows in the CSV has an extra column

C.The delimiter used in the CSV is not a comma

D.The dataset contains missing values

AnswerB

The error message indicates that line 1 has 11 fields instead of the expected 10, meaning an extra column in that row.

Why this answer

XGBoost on SageMaker expects the CSV input to have a consistent number of columns across all rows. If one row contains an extra column, the parser will fail because it cannot map the additional field to the feature schema defined by the training job. This mismatch causes the 'Error: Number of columns does not match' or similar parsing error.

Exam trap

The trap here is that candidates often confuse a column count mismatch with missing values or delimiter issues, but the error message specifically points to inconsistent row lengths, not data quality or formatting problems.

How to eliminate wrong answers

Option A is wrong because XGBoost can handle a header row if the `csv_reader` parameter is set to skip it, and a header alone does not cause a column count mismatch error. Option C is wrong because SageMaker's XGBoost implementation defaults to comma delimiter, but if a different delimiter is used, the error would be about unrecognized delimiter or parsing, not a column count mismatch. Option D is wrong because XGBoost natively handles missing values (e.g., by treating them as NaN or using the `missing` parameter), and missing values do not cause a column count error.

Full explanation →

23

MCQmedium

A company uses Amazon SageMaker to deploy a model for real-time inference. The endpoint uses an ml.m5.large instance with automatic scaling based on CPU utilization. The team notices that during traffic spikes, the endpoint returns 5xx errors. What should the team do to improve the endpoint's availability?

A.Increase the instance type to ml.c5.2xlarge.

B.Reduce the scaling cooldown period.

C.Place an Application Load Balancer in front of the endpoint.

D.Use Amazon API Gateway to throttle requests.

AnswerA

Larger instance type provides more capacity.

Why this answer

The correct answer is A because upgrading the instance type from ml.m5.large to ml.c5.2xlarge provides more CPU and memory resources, which directly addresses the root cause of 5xx errors during traffic spikes — insufficient compute capacity to handle the request load. Automatic scaling based on CPU utilization may not react quickly enough to sudden spikes, leading to request queuing and timeouts that manifest as 5xx errors. A larger instance type increases the baseline throughput, reducing the likelihood of resource exhaustion before scaling can take effect.

Exam trap

The trap here is that candidates often confuse auto-scaling configuration (cooldown periods, thresholds) with raw capacity planning, assuming that tuning scaling parameters alone can handle sudden spikes, when in fact the instance must have enough headroom to survive the scaling latency.

How to eliminate wrong answers

Option B is wrong because reducing the scaling cooldown period would cause the auto-scaling policy to react more aggressively to short-lived CPU spikes, potentially leading to thrashing and unnecessary instance provisioning, but it does not address the immediate capacity shortfall during the spike itself — the endpoint still needs enough raw compute to handle the burst. Option C is wrong because placing an Application Load Balancer (ALB) in front of the SageMaker endpoint is not supported — SageMaker endpoints use a built-in load balancer (AWS-managed) and cannot have an external ALB inserted; ALBs distribute traffic to targets like EC2 instances, not SageMaker hosted endpoints. Option D is wrong because using Amazon API Gateway to throttle requests would intentionally drop or delay requests during spikes, which would not improve availability but instead cause client-side errors (429 Too Many Requests) and degrade user experience, whereas the goal is to serve all requests successfully.

Full explanation →

24

MCQeasy

A machine learning engineer is working on a regression problem to predict house prices. The dataset contains 500,000 rows and 20 features, including 'sqft_living', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'yr_built', 'zipcode', and 'lat'. After performing exploratory data analysis, the engineer notices that the 'sqft_living' feature has a right-skewed distribution with a long tail. The 'zipcode' feature is categorical with 70 unique values. The 'lat' feature is continuous. The engineer wants to prepare the data for a linear regression model. Which action should the engineer take to improve model performance?

A.Remove the 'sqft_living' feature because it violates the normality assumption.

B.Apply a log transformation to the 'sqft_living' feature.

C.One-hot encode the 'zipcode' feature to capture location effects.

D.Apply standard scaling (z-score) to the 'sqft_living' feature.

AnswerB

Log transformation reduces right skewness, making the distribution more symmetric.

Why this answer

Linear regression assumes that features are approximately normally distributed, and a right-skewed distribution like 'sqft_living' can violate this assumption, leading to poor model performance. Applying a log transformation compresses the long tail, making the distribution more symmetric and helping the model learn a linear relationship between the feature and the target. This is a standard preprocessing step for skewed features in regression tasks.

Exam trap

Cisco often tests the misconception that standard scaling (z-score) can fix skewness, when in reality it only normalizes the mean and variance without altering the shape of the distribution.

How to eliminate wrong answers

Option A is wrong because removing the 'sqft_living' feature outright discards valuable information; linear regression does not require strict normality of features, only that residuals are normally distributed, and skewness can be addressed via transformation. Option C is wrong because one-hot encoding 'zipcode' with 70 unique values would create 69 dummy variables, which is acceptable but not the most impactful action for improving model performance given the stated issue of skewness in 'sqft_living'. Option D is wrong because standard scaling (z-score) only centers and scales the data, which does not address right skewness; it would preserve the long tail and fail to make the distribution more normal.

Full explanation →

25

MCQhard

A company uses a linear regression model to predict house prices. The model's R-squared is 0.95 on the training set but 0.60 on the test set. Which of the following is the most likely cause?

A.Overfitting

B.Underfitting

C.Multicollinearity among features

D.Data leakage

AnswerA

High training R-squared with much lower test R-squared is a classic sign of overfitting.

Why this answer

Option A is correct because such a large gap between training and test R-squared indicates overfitting. Option B is wrong because if the model were underfitting, both would be low. Option C is wrong because data leakage would likely cause high test R-squared.

Option D is wrong because multicollinearity affects both sets similarly.

Full explanation →

26

MCQhard

A team is training a deep learning model on Amazon SageMaker using a large dataset stored in S3. The training job is taking a long time, and the team suspects that data loading is the bottleneck. The dataset consists of many small files (average size 10KB). Which change would most effectively reduce the I/O bottleneck?

A.Combine the small files into larger files (e.g., TFRecord format)

B.Use SageMaker Pipe mode instead of File mode

C.Increase the number of training instances

D.Use a P3 instance type for better GPU performance

AnswerA

Larger files reduce the number of S3 API calls and improve throughput.

Why this answer

Combining small files into larger ones (e.g., TFRecord or Parquet) reduces the number of S3 GET requests and improves throughput. Using Pipe mode reads data sequentially, which is less efficient for random access. Increasing instance count or using P3 instances addresses compute, not I/O.

Amazon EFS is not recommended for training jobs due to higher latency.

Full explanation →

27

Multi-Selecthard

Which THREE actions can help reduce the inference latency of a SageMaker endpoint? (Choose three.)

Select 3 answers

A.Use a larger instance type with more CPU/GPU

B.Enable SageMaker Batch Transform to process predictions offline

C.Enable data compression to reduce payload size

D.Use a multi-model endpoint to share instances across models

E.Increase the number of instances in the endpoint

AnswersA, B, C

More compute power reduces per-request latency.

Why this answer

Options A, B, and E are correct. Using a larger instance with more compute power, enabling SageMaker Batch Transform for offline predictions, and enabling data compression reduce latency. Option C (multi-model endpoint) helps with memory but not latency.

Option D (more instances) improves throughput but not per-request latency.

Full explanation →

28

MCQmedium

A financial services company is building a model to detect fraudulent credit card transactions. The dataset contains 1 million transactions, with only 0.1% labeled as fraud. The data scientist trains a logistic regression model on the raw dataset and obtains the following results on a held-out test set: accuracy = 99.8%, precision = 50%, recall = 60%, F1 = 0.545. The business requirement is to maximize recall while keeping precision above 80%. Which course of action should the data scientist take to improve the model?

A.Use random undersampling of the majority class to balance the dataset

B.Collect more historical transaction data and retrain the model

C.Train the model with class weights inversely proportional to class frequencies

D.Apply L2 regularization with a higher penalty to reduce overfitting

AnswerC

Class weights help the model focus on the minority class, often improving precision and recall.

Why this answer

The correct answer is C because assigning class weights inversely proportional to class frequencies penalizes misclassifications of the minority class (fraud) more heavily during training. This directly addresses the severe class imbalance (0.1% fraud) by forcing the logistic regression model to learn decision boundaries that improve recall, while the weight ratio can be tuned to maintain precision above 80%. Unlike naive resampling, this approach preserves the original data distribution and avoids information loss.

Exam trap

Cisco often tests the misconception that resampling (undersampling or oversampling) is always the best first step for imbalance, when in fact cost-sensitive learning via class weights is often more effective and stable for linear models like logistic regression.

How to eliminate wrong answers

Option A is wrong because random undersampling of the majority class discards a large number of legitimate transactions, which can lead to loss of valuable patterns and increased variance, often degrading precision rather than maintaining it above 80%. Option B is wrong because simply collecting more historical data does not change the underlying class imbalance ratio; without addressing the imbalance, the model will still be biased toward the majority class and fail to improve recall. Option D is wrong because L2 regularization with a higher penalty reduces overfitting by shrinking coefficients, but it does not target class imbalance; it may actually worsen recall by further suppressing the already weak signal from the minority class.

Full explanation →

29

MCQmedium

Refer to the exhibit. A data scientist is using AWS Glue ETL jobs to process data from a source database. The job logs show repeated timeout errors. Which EDA step should the scientist perform to diagnose the issue?

A.Test network connectivity from the Glue job to the source database using telnet.

B.Check the source database table sizes and row counts over time.

C.Switch the Glue ETL job type from Spark to Python shell to reduce overhead.

D.Increase the Glue job timeout to 600 seconds and rerun.

AnswerB

Identifies if data volume growth causes timeouts.

Why this answer

The timeout suggests the job is taking longer than the max 300 seconds. Analyzing source data volume trends helps determine if data size has increased, causing longer processing time. Option A is wrong because changing job type does not address root cause.

Option B is wrong because increasing timeout without understanding data growth is a temporary fix. Option D is wrong because the error is about timeout, not connectivity.

Full explanation →

30

MCQmedium

You are deploying a PyTorch model to a SageMaker endpoint. The model is large (5 GB) and the endpoint is using an ml.c5.2xlarge instance. Inference latency is higher than required. Which change would most effectively reduce latency?

A.Reduce the batch size in the inference code

B.Decrease the number of model server workers

C.Enable SageMaker Elastic Inference

D.Use a GPU instance type such as ml.p3.2xlarge

AnswerD

GPU accelerates matrix operations in PyTorch.

Why this answer

Option D is correct because using a GPU instance accelerates PyTorch inference. Option A is wrong because Elastic Inference could help but may not be as effective as full GPU. Option B is wrong because smaller batch size may increase latency per request.

Option C is wrong because reducing workers can increase latency.

Full explanation →

31

MCQmedium

A data scientist is analyzing a dataset of customer reviews for a retail company. The dataset contains text reviews, star ratings (1-5), and customer metadata. The scientist wants to perform sentiment analysis to classify reviews as positive or negative. During EDA, the scientist uses Amazon SageMaker Data Wrangler to visualize the distribution of star ratings and notices that 90% of reviews are 4 or 5 stars, while only 2% are 1 star. The scientist is concerned about class imbalance. Which approach should the scientist take to address the imbalance before modeling?

A.Downsample the majority class to create a balanced dataset.

B.Use random oversampling of the minority class to balance the dataset.

C.Use accuracy as the evaluation metric since it is simple.

D.Use the F1-score as the evaluation metric to account for imbalance.

AnswerD

F1-score balances precision and recall, appropriate for imbalanced classes.

Why this answer

Option B is correct because using F1-score as the evaluation metric accounts for class imbalance better than accuracy. Option A is wrong because random oversampling can lead to overfitting and is not always best. Option C is wrong because downsampling the majority class loses valuable data.

Option D is wrong because accuracy is misleading for imbalanced datasets.

Full explanation →

32

MCQmedium

A company uses SageMaker built-in BlazingText algorithm for text classification. The model performance is poor on the validation set. The data consists of short documents (average 50 words). Which hyperparameter tuning strategy is most likely to improve performance?

A.Increase bucket size from 0 to 1000000

B.Increase vector dimension from 100 to 300

C.Increase minCount from 1 to 5

D.Decrease window size from 5 to 2

AnswerD

Smaller window size captures local context better for short documents.

Why this answer

BlazingText's default window size of 5 may be too large for short documents (average 50 words), causing the model to learn overly broad context that dilutes local semantic patterns. Decreasing the window size to 2 forces the model to focus on tighter word co-occurrences, which is more effective for short text classification where local n-gram signals are critical.

Exam trap

The trap here is that candidates often assume increasing model capacity (e.g., vector dimension) or filtering rare words (minCount) always helps, but for short documents, the hyperparameter controlling context granularity (window size) is the most impactful lever.

How to eliminate wrong answers

Option A is wrong because bucket size controls subword n-gram hashing for out-of-vocabulary words, not classification performance on short documents; increasing it from 0 to 1,000,000 would add computational overhead without addressing the core issue of context window. Option B is wrong because increasing vector dimension from 100 to 300 risks overfitting on small short-text datasets and does not fix the problem of overly broad context capture. Option C is wrong because increasing minCount from 1 to 5 discards rare but potentially discriminative words in short documents, further reducing signal in an already sparse dataset.

Full explanation →

33

MCQhard

A data scientist is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a dataset. The dataset contains a feature 'age' with values ranging from 0 to 120. The data scientist wants to detect outliers. Which built-in transform in Data Wrangler is most appropriate for this task?

A.Handle Outliers

B.Scale and Normalize

C.Handle Missing

D.Encode Categorical

AnswerA

This transform includes outlier detection methods such as IQR and z-score.

Why this answer

Option C is correct because the 'Handle Outliers' transform provides methods like IQR and z-score to detect and handle outliers. Option A is wrong because 'Handle Missing' deals with missing values, not outliers. Option B is wrong because 'Scale and Normalize' transforms data but does not detect outliers.

Option D is wrong because 'Encode Categorical' is for categorical features.

Full explanation →

34

MCQhard

A data scientist is performing exploratory data analysis on text data. They want to identify the most common terms and their frequencies. Which approach should they use?

A.Perform sentiment analysis on the text.

B.Apply Latent Dirichlet Allocation (LDA) to extract topics.

C.Create a bag-of-words matrix and compute term frequencies.

D.Use word2vec to generate word embeddings.

AnswerC

Bag-of-words directly provides term counts.

Why this answer

Option A is correct because a term frequency-inverse document frequency (TF-IDF) vectorizer can provide weighted frequencies, but a simple count vectorizer is also common. However, the question asks for common terms and frequencies, so a bag-of-words approach is appropriate. Option B is wrong because word2vec produces embeddings, not frequencies.

Option C is wrong because LDA is a topic model. Option D is wrong because sentiment analysis is not about frequency.

Full explanation →

35

MCQmedium

A data scientist is working with a dataset containing customer transactions. The dataset has a column named 'transaction_date' with timestamp values. The scientist wants to create new features such as day of week, hour, and whether the transaction occurred on a weekend. Which AWS service provides built-in feature engineering capabilities for datetime columns?

A.Amazon SageMaker Data Wrangler

B.Amazon Athena

C.AWS Glue ETL

D.Amazon EMR

AnswerA

Data Wrangler has built-in datetime feature extraction.

Why this answer

Amazon SageMaker Data Wrangler includes built-in transformations for datetime features like extracting day, month, hour, etc. Option B (AWS Glue ETL) requires custom code. Option C (Amazon Athena) can extract parts but not as a feature engineering step.

Option D (Amazon EMR) requires more manual effort.

Full explanation →

36

MCQmedium

Refer to the exhibit. A data scientist is unable to run an Amazon Athena query on data in `my-bucket`. The IAM policy shown is attached to the user. What is the most likely reason for the failure?

A.The ListBucket action is not granted.

B.Athena needs s3:PutObject permission to write results.

C.The data is encrypted with SSE-C.

D.The bucket does not exist.

AnswerB

Athena writes output to S3.

Why this answer

Option D is correct because Athena also requires `s3:PutObject` to write query results to an output location. Option A (bucket exists) is fine; Option B (ListBucket is allowed) is present; Option C (no encryption) is not an issue.

Full explanation →

37

MCQmedium

A data scientist is exploring a dataset with many missing values. They want to understand the pattern of missingness before deciding on imputation. Which approach is most appropriate?

A.Compute the correlation matrix of the features with missing values.

B.Drop all rows with any missing values.

C.Impute all missing values with the mean of each column.

D.Visualize the missingness using a heatmap or bar chart.

AnswerD

Visualization helps identify patterns like monotonic or random missingness.

Why this answer

Option A is correct because a heatmap of missing values (using libraries like missingno) visually shows patterns. Option B (drop rows) is premature; Option C (mean imputation) assumes MCAR; Option D (correlation matrix) does not show missingness patterns.

Full explanation →

38

MCQeasy

A data scientist is training a binary classification model on a dataset with 10,000 features. The model overfits severely. Which technique is MOST appropriate to reduce overfitting?

A.Apply L1 regularization (Lasso)

B.Use early stopping during training

C.Use PCA to reduce dimensionality

D.Increase the max depth of the model

AnswerA

L1 regularization penalizes the absolute size of coefficients, driving some to zero and reducing overfitting.

Why this answer

L1 regularization (Lasso) can shrink some feature coefficients to zero, effectively performing feature selection and reducing overfitting. Early stopping is more for iterative algorithms. PCA reduces dimensionality but may lose interpretability.

Increasing max depth would worsen overfitting.

Full explanation →

39

MCQeasy

A data scientist is training a binary classification model using Amazon SageMaker's XGBoost. The dataset is highly imbalanced (99% negative class, 1% positive class). The data scientist wants to maximize the F1-score. Which parameter adjustment is most appropriate?

A.Set max_depth to 10

B.Set eta to 0.01

C.Set subsample to 0.5

D.Set scale_pos_weight to 99

AnswerD

scale_pos_weight adjusts the balance of positive and negative weights; a value of 99 (ratio of negatives to positives) helps the model focus on the minority class.

Why this answer

Setting scale_pos_weight to 99 is the most appropriate adjustment because it directly addresses class imbalance by assigning a higher weight to the minority (positive) class during training. In XGBoost, scale_pos_weight controls the balance of positive and negative weights, typically set as sum(negative instances) / sum(positive instances), which here is 99/1 = 99. This forces the model to penalize misclassifications of the positive class more heavily, thereby improving recall and F1-score.

Exam trap

The trap here is that candidates may confuse hyperparameters that control model complexity (max_depth, subsample) or learning rate (eta) with those that directly handle class imbalance, missing that scale_pos_weight is the specific XGBoost parameter designed for this purpose.

How to eliminate wrong answers

Option A is wrong because increasing max_depth to 10 can lead to overfitting, especially on imbalanced data, and does not directly address class imbalance or optimize F1-score. Option B is wrong because setting eta (learning rate) to 0.01 reduces the step size for updates, which can slow convergence but does not specifically handle class imbalance or improve F1-score. Option C is wrong because setting subsample to 0.5 randomly samples 50% of the training data per tree, which may reduce overfitting but does not target the imbalance between positive and negative classes.

Full explanation →

40

MCQeasy

Refer to the exhibit. A data scientist is evaluating a binary classification model for spam detection. The exhibit shows a single prediction instance. What is the model's prediction for this instance?

A.Ham

B.0.95

C.The model is unsure because probability is not 1.0

D.Spam

AnswerD

The 'predicted_label' is 'spam'.

Why this answer

The model's prediction is 'Spam' because the prediction instance shows a probability of 0.95 for the 'Spam' class, which exceeds the typical decision threshold of 0.5 used in binary classification. Since the probability for 'Spam' is higher than for 'Ham' (0.05), the model assigns the instance to the class with the highest probability, which is Spam.

Exam trap

Cisco often tests the distinction between a model's probability output and its final class prediction, leading candidates to mistakenly select the probability value (0.95) as the prediction instead of the class label (Spam).

How to eliminate wrong answers

Option A is wrong because 'Ham' is the class with the lower probability (0.05), and the model predicts the class with the highest probability, not the lowest. Option B is wrong because 0.95 is the probability score for the Spam class, not the final class prediction; the model outputs a class label (Spam or Ham), not a probability value as the prediction. Option C is wrong because the model is not 'unsure' — in binary classification, a probability of 0.95 indicates high confidence for the Spam class, and the model always makes a deterministic prediction based on the decision threshold (typically 0.5), regardless of whether the probability is 1.0.

Full explanation →

41

MCQmedium

A machine learning engineer is deploying a sentiment analysis model using Amazon SageMaker. The model is a BERT-based transformer that takes up to 512 tokens. The engineer notices that inference latency is high (over 500 ms per request) on a single ml.c5.xlarge instance. The application requires latency under 100 ms. The model has already been optimized using half-precision (FP16). Which action should the engineer take to reduce latency?

A.Use a GPU instance such as ml.g4dn.xlarge

B.Reduce the maximum sequence length to 128

C.Increase the batch size for inference requests

D.Use SageMaker Neo to compile the model for the target instance

AnswerA

GPUs accelerate transformer inference significantly.

Why this answer

Option B (use a GPU instance) accelerates inference for transformers. Option A (increase batch size) can help throughput but not latency for single requests. Option C (reduce max sequence length) may hurt accuracy.

Option D (use SageMaker Neo) is for compilation but may not achieve sub-100ms.

Full explanation →

42

MCQmedium

A company is building a text classification model to categorize customer support tickets. The dataset is highly imbalanced with 95% of tickets belonging to 'General Inquiry' and 5% to 'Complaint'. The data scientist is using a random forest classifier. Which metric is most appropriate for evaluating model performance on the minority class?

A.Accuracy

B.F1-score for 'Complaint'

C.ROC AUC

D.Precision for 'Complaint'

AnswerB

F1-score balances precision and recall, making it suitable for imbalanced classification.

Why this answer

In a highly imbalanced dataset (95% General Inquiry, 5% Complaint), accuracy is misleading because a model that predicts 'General Inquiry' for every ticket would achieve 95% accuracy but completely fail on the minority class. The F1-score for 'Complaint' is the harmonic mean of precision and recall, providing a balanced evaluation of the model's ability to correctly identify complaints without being skewed by the majority class. For a random forest classifier, this metric directly addresses the minority class performance, which is the primary concern.

Exam trap

The trap here is that candidates often choose accuracy due to its simplicity, failing to recognize that on imbalanced datasets it is a deceptive metric, or they select ROC AUC because it is commonly used for binary classification, but it does not isolate minority class performance as effectively as the F1-score.

How to eliminate wrong answers

Option A is wrong because accuracy is inappropriate for imbalanced datasets; a naive model predicting only the majority class would achieve 95% accuracy, masking poor minority class performance. Option C is wrong because ROC AUC measures the trade-off between true positive rate and false positive rate across all thresholds, which can be overly optimistic on imbalanced data and does not directly reflect precision or recall for the minority class. Option D is wrong because precision for 'Complaint' alone ignores recall; a model could achieve high precision by making very few positive predictions (e.g., only when extremely confident), but miss most actual complaints, which is unacceptable for detecting customer complaints.

Full explanation →

43

MCQmedium

A data scientist is training a binary classifier using logistic regression on a dataset that is highly imbalanced (95% negative class, 5% positive class). The model achieves 95% accuracy but only predicts the negative class. Which metric should the scientist use to evaluate the model's performance on the positive class?

A.Recall

B.Precision

C.F1 Score

D.Accuracy

AnswerB

Precision measures the proportion of positive predictions that are correct, which is 0 here, correctly indicating poor performance.

Why this answer

Precision measures the proportion of positive identifications that were actually correct, which is suitable for imbalanced datasets. Accuracy is misleading because the model predicts only the majority class. Recall would be high if the model predicted all positives, but here it predicts none.

F1 score is a harmonic mean of precision and recall, but precision is more directly relevant to the issue of false positives.

Full explanation →

44

MCQmedium

A company has a time series forecasting problem with daily sales data. The data shows both trend and seasonality. Which Amazon SageMaker built-in algorithm is most suitable?

A.K-Means

B.Linear Learner

C.DeepAR

D.XGBoost

AnswerC

DeepAR is a built-in algorithm for time series forecasting that handles trend and seasonality.

Why this answer

DeepAR is a supervised learning algorithm for time series forecasting that explicitly models both trend and seasonality using recurrent neural networks (RNNs). It is designed to handle multiple related time series, incorporate additional features like holidays or promotions, and produce probabilistic forecasts, making it the most suitable choice for daily sales data with trend and seasonal patterns.

Exam trap

The trap here is that candidates often choose XGBoost (Option D) because it is a powerful general-purpose algorithm, but they overlook that DeepAR is specifically designed for time series forecasting with trend and seasonality, whereas XGBoost requires manual feature engineering (e.g., lag variables, rolling statistics) to capture temporal patterns and does not natively produce probabilistic forecasts.

How to eliminate wrong answers

Option A is wrong because K-Means is an unsupervised clustering algorithm used for grouping data points based on similarity, not for forecasting time series with trend and seasonality. Option B is wrong because Linear Learner is a linear regression or classification algorithm that assumes independence of observations and cannot inherently capture temporal dependencies like seasonality or trend without extensive feature engineering. Option D is wrong because XGBoost is a gradient boosting algorithm primarily for tabular data and classification/regression tasks; while it can be used for time series with lag features, it is not purpose-built for sequential forecasting and lacks native support for probabilistic outputs and seasonal patterns.

Full explanation →

45

MCQhard

A data scientist runs a SageMaker training job that fails with the above error. The S3 bucket and object exist, and the IAM role has s3:GetObject permission. What is the MOST likely cause?

A.The S3 object was uploaded with incorrect checksum or is corrupted

B.The training instance does not have internet access

C.The S3 bucket has versioning enabled and the object version is not specified

D.The IAM role lacks kms:Decrypt permission for an encrypted S3 object

AnswerA

A corrupted file can cause size mismatch and zero-byte download.

Why this answer

The error indicates that SageMaker cannot read the S3 object, even though the bucket and object exist and the IAM role has s3:GetObject permission. The most likely cause is that the object was uploaded with an incorrect checksum or is corrupted, which prevents SageMaker from verifying the integrity of the data during the training job initialization. SageMaker uses ETag (MD5 checksum) validation when reading objects, and a mismatch triggers a failure.

Exam trap

The trap here is that candidates often assume permission or network issues are the root cause, but the error specifically points to a data integrity problem, which is a subtle but critical detail in SageMaker's S3 interaction.

How to eliminate wrong answers

Option B is wrong because SageMaker training jobs access S3 via AWS internal network endpoints, not the public internet; internet access is not required for S3 data retrieval. Option C is wrong because versioning does not require specifying an object version unless the object is explicitly requested by version ID; SageMaker defaults to the latest version if none is specified. Option D is wrong because the error message does not mention encryption or KMS; if the object were encrypted with SSE-KMS, the error would explicitly indicate a decryption permission failure, not a generic read failure.

Full explanation →

46

Multi-Selecthard

A company is building a real-time anomaly detection system for network traffic logs. The logs are ingested via Amazon Kinesis Data Streams and processed with an Amazon SageMaker endpoint for inference. The team needs to ensure that the inference results are stored durably and can be replayed for model retraining. The system must handle at least 10,000 records per second with low latency. Which three AWS services should the team use to build this architecture? (Select THREE.)

Select 3 answers

A.AWS Glue ETL

B.Amazon Kinesis Data Streams

C.Amazon Kinesis Data Analytics for Apache Flink

D.Amazon Kinesis Data Firehose

E.Amazon SageMaker

AnswersB, C, E

Kinesis Data Streams provides the ingestion layer with low latency and high throughput.

Why this answer

Amazon Kinesis Data Streams is the correct ingestion layer because it provides durable, real-time data streaming with the ability to handle over 10,000 records per second. It acts as the source of truth for network traffic logs, enabling low-latency processing and replay for model retraining.

Exam trap

The trap here is that candidates often confuse Kinesis Data Firehose with Kinesis Data Streams, assuming Firehose's simplicity and S3 integration make it suitable for real-time inference, but Firehose lacks the record-level replay and low-latency processing required for this use case.

Full explanation →

47

MCQhard

A company uses Amazon Redshift as a data warehouse. They need to load 50 TB of clickstream data from S3 into Redshift daily. The data arrives in 5-minute intervals as gzipped CSV files. The target table has a sort key and a distribution key. The load must complete within 2 hours. Which approach is MOST efficient?

A.Use AWS Glue to transform the data and write to Redshift using JDBC.

B.Use a staging table and then merge using a stored procedure.

C.Use a series of INSERT statements from a Lambda function.

D.Use the COPY command with a manifest file and gzip compression.

AnswerD

COPY is optimized for bulk loading from S3.

Why this answer

The COPY command is the most efficient way to load large volumes of data into Amazon Redshift because it uses the cluster's massively parallel processing (MPP) architecture to read data directly from S3 in parallel across all nodes. With a manifest file, you can specify multiple gzipped CSV files, and the gzip compression reduces network I/O and storage overhead. This approach can easily load 50 TB within 2 hours, especially when the target table has a sort key and distribution key, as COPY automatically leverages these for optimal data distribution and sorting during the load.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing AWS Glue or staging tables, not realizing that Redshift's COPY command is purpose-built for high-speed parallel ingestion from S3 and is the most efficient method for bulk data loads.

How to eliminate wrong answers

Option A is wrong because AWS Glue writing to Redshift via JDBC is a row-by-row or small-batch operation that cannot match the parallel throughput of the COPY command, and it would introduce unnecessary transformation overhead for already-structured CSV data. Option B is wrong because using a staging table and a stored procedure merge adds extra steps and complexity without improving load speed; the COPY command can directly load into the target table with proper sort and distribution keys, making a staging table redundant for this bulk load scenario. Option C is wrong because a series of INSERT statements from a Lambda function would be extremely slow and inefficient for 50 TB of data, as each INSERT is a single-row operation that cannot leverage Redshift's parallel processing, and Lambda has a 15-minute execution timeout that would require complex orchestration to handle the full load.

Full explanation →

48

MCQhard

A data scientist is building a recommendation system for an e-commerce platform using Amazon SageMaker. The system needs to provide personalized product recommendations based on user purchase history and product metadata. The dataset contains 10 million users and 1 million products. Which algorithm should the data scientist use as the core of the recommendation engine?

A.Linear Learner

B.XGBoost

C.K-Means

D.Factorization Machines

AnswerD

Factorization Machines handle sparse data well and are designed for recommendation tasks.

Why this answer

Factorization Machines (FM) are specifically designed for high-dimensional sparse data like user-item interactions, making them ideal for recommendation systems with 10 million users and 1 million products. FM can capture pairwise feature interactions (e.g., user-product affinities) efficiently using factorized parameters, which scales well to large datasets and supports personalized recommendations from purchase history and metadata.

Exam trap

The trap here is that candidates often pick XGBoost (B) because it is a powerful general-purpose algorithm, but they overlook that it cannot efficiently handle the extreme sparsity and pairwise interaction learning required for large-scale recommendation systems, which Factorization Machines are purpose-built for.

How to eliminate wrong answers

Option A is wrong because Linear Learner is a supervised learning algorithm for regression or classification that assumes linear relationships and cannot model complex feature interactions (e.g., user-product pairs) inherent in recommendation systems. Option B is wrong because XGBoost is a tree-based ensemble method that struggles with high-dimensional sparse categorical data (e.g., user and product IDs) and does not naturally handle pairwise interaction learning without extensive feature engineering. Option C is wrong because K-Means is an unsupervised clustering algorithm that groups similar users or products but cannot produce personalized recommendations based on user-item interactions or predict ratings for unseen pairs.

Full explanation →

49

MCQhard

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?

A.The S3 bucket is not configured with versioning, causing overwrites.

B.The Lambda function is reading from the oldest sequence number, causing high IteratorAgeSeconds.

C.The Lambda function’s reserved concurrency is too low for the increased shard count.

D.The partition key used by the producer does not ensure that related records go to the same shard after resharding.

AnswerD

After resharding, the mapping of partition keys to shards changes. If ordering matters, the partition key must be chosen to keep related records together.

Why this answer

Option D is correct because after resharding from 2 to 4 shards, the mapping of partition keys to shards changes. If the producer does not use a partition key that ensures related records (e.g., same user session) are routed to the same shard, records that were previously ordered within a shard may now be split across multiple shards. Since the Lambda consumer processes shards independently, records from the same logical sequence can arrive out of order, and the increased shard count can also cause higher latency if the consumer is not properly parallelized.

Exam trap

The trap here is that candidates often confuse increased shard count with a need for more concurrency (Option C), but the real issue is that resharding changes the partition-to-shard mapping, which can break ordering guarantees unless the producer explicitly handles the new hash range.

How to eliminate wrong answers

Option A is wrong because S3 versioning controls object overwrites and deletions, not the ordering or latency of records written by Lambda; out-of-order writes are caused by upstream data distribution, not S3 configuration. Option B is wrong because the Lambda function reads from the latest sequence number by default when using the Kinesis trigger, not the oldest; high IteratorAgeSeconds would indicate a slow consumer, not a configuration to read from the oldest record. Option C is wrong because reserved concurrency limits the maximum number of concurrent Lambda executions, but the default unreserved concurrency is usually sufficient for 4 shards; low concurrency would cause throttling (e.g., 429 errors), not out-of-order processing.

Full explanation →

50

MCQeasy

A data scientist needs to process a large volume of streaming data from IoT devices and store the results in Amazon S3 for further analysis. Which AWS service is most suitable for ingesting and processing this data in near real-time?

A.Amazon Redshift

B.AWS Glue

C.Amazon Kinesis Data Analytics

D.Amazon EMR

AnswerC

Kinesis Data Analytics processes streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics is designed for real-time processing of streaming data. AWS Glue is for batch ETL, Amazon EMR for big data processing, and Amazon Redshift for data warehousing.

Full explanation →

51

MCQeasy

A company is building a binary classification model to predict customer churn. The dataset has 10,000 samples with 500 churners (positive class). The data scientist trains a logistic regression model and obtains an accuracy of 95%. However, the model predicts all customers as non-churn. Which metric should the data scientist use to evaluate the model's performance?

A.AUC-ROC

B.F1-score

C.Accuracy

D.Confusion matrix

AnswerB

F1-score balances precision and recall; with all negatives predicted, recall is 0, so F1 is 0, clearly showing poor performance on churners.

Why this answer

The F1-score is the harmonic mean of precision and recall, making it robust to class imbalance. Since the model predicts all customers as non-churn (accuracy 95% due to 9500 non-churners), precision for the positive class is undefined (0 true positives) and recall is 0, so the F1-score correctly reveals the model's failure to identify any churners.

Exam trap

AWS often tests the trap that candidates choose accuracy because it is high (95%), failing to recognize that accuracy is meaningless in imbalanced datasets when the model predicts only the majority class.

How to eliminate wrong answers

Option A is wrong because AUC-ROC can be misleading with severe class imbalance; a model that always predicts the majority class can still achieve a high AUC-ROC if the scores are well-separated, but it fails to capture the complete lack of positive predictions. Option C is wrong because accuracy is dominated by the majority class (95% non-churn) and gives a false sense of performance when the model never predicts the positive class. Option D is wrong because a confusion matrix is a visualization tool, not a single metric; while it would show zero true positives, the question asks for a metric to evaluate performance, and the confusion matrix itself is not a scalar metric.

Full explanation →

52

Multi-Selectmedium

Which TWO actions can help reduce overfitting in a decision tree model? (Choose 2.)

Select 2 answers

A.Prune the tree after training

B.Increase the maximum depth of the tree

C.Set a minimum number of samples per leaf

D.Increase the number of features considered at each split

E.Use all training data without validation

AnswersA, C

Pruning removes branches that have little predictive power, reducing overfitting.

Why this answer

Pruning the tree after training removes branches that have little predictive power, reducing overfitting by simplifying the model. This technique directly addresses the variance component of the bias-variance tradeoff, making the model generalize better to unseen data.

Exam trap

Cisco often tests the misconception that increasing model complexity (e.g., deeper trees or more features) always improves accuracy, when in fact it increases overfitting; candidates may incorrectly select options that add complexity instead of regularization.

Full explanation →

53

MCQeasy

A data scientist is exploring a dataset and finds that the correlation between two features is 0.95. What should the data scientist do to address multicollinearity before training a linear regression model?

A.Remove one of the two features

B.Apply L2 regularization

C.Standardize the features

D.Apply Principal Component Analysis

AnswerA

Removing one feature eliminates the high correlation.

Why this answer

Option A is correct because removing one of the highly correlated features reduces multicollinearity. Regularization (B) like Ridge can help but does not remove multicollinearity. PCA (C) changes interpretability.

Scaling (D) does not affect correlation.

Full explanation →

54

MCQeasy

A company is training a large language model on Amazon SageMaker using a single GPU instance. The training is taking too long. Which change would most likely reduce training time?

A.Use a larger instance with multiple GPUs and enable distributed training

B.Increase the instance memory

C.Decrease the batch size

D.Store the training data in S3 Glacier for faster access

AnswerA

Distributed training across multiple GPUs reduces training time.

Why this answer

Using multiple GPUs in a distributed training job can parallelize work and reduce time. Option B is wrong because decreasing batch size often increases training time due to more updates. Option C is wrong because storing data in S3 Standard vs Glacier Access does not affect training speed.

Option D is wrong because increasing instance memory may not help if the bottleneck is compute.

Full explanation →

55

MCQmedium

A data scientist is analyzing server logs stored in Amazon CloudWatch Logs. The above snippet shows three log entries. They want to count the number of 500 errors per minute using CloudWatch Logs Insights. Which query should they use?

A.fields @timestamp, status | filter status = 500 | stats count() by bin(1m)

B.fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(1m)

C.fields @timestamp, @message | filter @message like /500/ | sort @timestamp desc

D.fields @timestamp, @message | filter @message like /500/ | stats count() by bin(1m)

AnswerD

Correctly filters for 500 status code and counts per minute.

Why this answer

Option B is correct because the query filters status 500, parses the timestamp, and counts per minute. Option A is wrong because it filters error messages, not status codes. Option C is wrong because it filters by fields that may not exist.

Option D is wrong because it uses sort without stats.

Full explanation →

56

MCQmedium

A data scientist is exploring a dataset with 100 features. After generating pair plots, the scientist notices that many features have skewed distributions. Which transformation should the scientist apply to make the distributions more Gaussian-like for modeling?

A.Log transformation

B.Yeo-Johnson transformation

C.Standard scaling (z-score normalization)

D.Box-Cox transformation

AnswerB

Works for any real values.

Why this answer

Option C is correct because Yeo-Johnson can handle both positive and negative values. Option A is wrong because log transform only works for positive values. Option B is wrong because Box-Cox also requires positive values.

Option D is wrong because standard scaling does not fix skewness.

Full explanation →

57

MCQeasy

During EDA, a data scientist notices that a numeric feature 'age' has outliers beyond 3 standard deviations. What is the most appropriate first step?

A.Use the feature as-is in the model

B.Apply a log transformation to the feature

C.Remove all rows with outlier values

D.Investigate the source of the outliers

AnswerD

Understanding outliers guides proper handling.

Why this answer

Option C is correct because investigating the outliers helps determine if they are data errors or valid. Option A is wrong because removing without investigation may lose valuable data. Option B is wrong because winsorizing is a transformation, not first step.

Option D is wrong because modeling with noise is not prudent.

Full explanation →

58

MCQeasy

A company is building a model to classify customer reviews as positive or negative. The dataset has 10,000 positive and 100 negative reviews. Which metric is most appropriate for evaluating model performance?

A.F1 score.

B.Accuracy.

C.Mean squared error.

D.AUC-ROC.

AnswerA

F1 score considers both precision and recall, good for imbalance.

Why this answer

Option B is correct because F1 score balances precision and recall, important for imbalanced datasets. Option A is wrong because accuracy can be misleading (e.g., 99% by predicting all positive). Option C is wrong because AUC-ROC can be optimistic for imbalanced data.

Option D is wrong because mean squared error is for regression.

Full explanation →

59

MCQeasy

Refer to the exhibit. A SageMaker endpoint logs this error. What is the most likely cause?

A.The model is corrupted

B.There is a network connectivity issue

C.The input data type is incorrect

D.The input data has fewer features than the model expects

AnswerD

The error explicitly states shape mismatch: expected 10 features, got 8.

Why this answer

Option A is correct because the error indicates input shape mismatch. Option B is wrong because data type is not mentioned. Option C is wrong because model corruption would cause different errors.

Option D is wrong because network issues would not cause shape mismatch.

Full explanation →

60

MCQeasy

A machine learning team is developing a model to predict housing prices. They have a dataset with numerical features like square footage and number of bedrooms, and categorical features like neighborhood. Which preprocessing step is essential before training a linear regression model?

A.Normalize all numerical features to have zero mean and unit variance

B.Remove highly correlated features

C.One-hot encode categorical features

D.Apply Principal Component Analysis (PCA) to reduce dimensionality

AnswerC

Linear regression requires numerical input; one-hot encoding is needed for categorical variables.

Why this answer

One-hot encoding converts categorical features into binary columns, which linear regression requires. Option A is wrong because scaling is important but not the only essential step; encoding is needed first. Option B is wrong because PCA reduces dimensionality but is optional.

Option D is wrong because feature selection is not essential for all models.

Full explanation →

61

Multi-Selecthard

A data engineering team uses AWS Glue to run ETL jobs. They notice that jobs are taking longer to complete as data volume grows. They want to optimize performance without increasing cost significantly. Which THREE strategies should they consider?

Select 3 answers

A.Remove partitioning from the output

B.Partition the input data in S3

C.Use Amazon EMR instead of Glue

D.Convert input data to columnar format (e.g., Parquet)

E.Increase the number of DPUs (workers)

AnswersB, D, E

Enables parallel processing.

Why this answer

Partitioning the input data allows Glue to process in parallel. Using columnar formats like Parquet reduces I/O. Increasing the number of DPUs (workers) improves parallelism but increases cost; however, it can be cost-effective if job duration decreases significantly.

Removing partitions would hurt performance. Using Spark directly is not necessary.

Full explanation →

62

MCQmedium

A company is using AWS Glue ETL jobs to process data stored in Amazon S3. The jobs currently run sequentially and take too long. The data engineer wants to reduce job duration without rewriting the code. Which action is most effective?

A.Change the underlying EC2 instance type to a compute-optimized instance

B.Increase the number of DPUs (Data Processing Units) for the job

C.Convert the data from CSV to Parquet format

D.Enable job bookmarks to skip already processed data

AnswerB

More DPUs allow parallel execution, reducing job duration.

Why this answer

Increasing the number of DPUs (Data Processing Units) allocated to the Glue job can parallelize the processing and reduce runtime without code changes. Option A (changing to a larger instance type) is not applicable because Glue uses DPUs, not EC2 instances. Option B (using a different data format) may help but is not a direct solution for parallelization.

Option D (enabling job bookmarks) helps with incremental processing but does not speed up the existing job.

Full explanation →

63

MCQeasy

A data analyst is investigating a dataset where the target variable is binary (0/1). The analyst wants to check for multicollinearity among the numerical features. Which statistical measure should the analyst use?

A.Variance Inflation Factor (VIF).

B.Mutual information between features and target.

C.Chi-square test of independence.

D.Pearson correlation coefficient between each pair of features.

AnswerA

VIF measures how much a feature is explained by other features.

Why this answer

Option B is correct because Variance Inflation Factor (VIF) quantifies how much a feature is correlated with other features. Option A is wrong because Pearson correlation only measures pairwise linear relationships, not multicollinearity among multiple features. Option C is wrong because chi-square is for categorical variables.

Option D is wrong because mutual information measures dependence but does not specifically detect multicollinearity.

Full explanation →

64

MCQmedium

A data scientist is performing EDA on a dataset with many features. They suspect some features are redundant due to high pairwise correlations. Which technique can help identify groups of correlated features?

A.Use t-SNE to visualize feature relationships

B.Apply PCA and examine the loadings

C.Compute mutual information between each feature and the target

D.Use chi-square test for each pair

E.Create a correlation matrix and visualize with a heatmap

AnswerE

A correlation matrix heatmap clearly shows correlated feature groups.

Why this answer

A correlation matrix with a heatmap visualizes pairwise correlations and helps identify groups of correlated features. Option B is wrong because PCA reduces dimensionality but does not show feature groups directly. Option C is wrong because mutual information measures dependency but not specifically linear correlation.

Option D is wrong because chi-square test is for categorical associations. Option E is wrong because t-SNE is for visualization of high-dimensional data, not for correlation analysis.

Full explanation →

65

Drag & Dropmedium

Drag and drop the steps to use Amazon SageMaker Feature Store for feature engineering in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Feature Store involves defining group, ingesting, querying, training, and maintaining.

Full explanation →

66

MCQhard

A financial services company is building a fraud detection model using Amazon SageMaker. The dataset has 10 million transactions, with 0.1% fraudulent. They train an XGBoost model with default hyperparameters. The model achieves 99.9% accuracy on the test set, but only catches 10% of actual fraud cases. The company wants to maximize the number of fraud cases caught while keeping the false positive rate below 5%. The data scientist has already tried adjusting the class weights and threshold, but the recall is still low. What should the data scientist do next?

A.Collect more data, especially fraudulent transactions, to balance the dataset

B.Use a different algorithm such as a balanced random forest or SMOTE with XGBoost

C.Apply PCA to reduce the number of features and prevent overfitting

D.Use a larger instance type to train for more epochs

AnswerB

Balanced random forest or SMOTE are designed to handle imbalanced datasets.

Why this answer

Option C is correct because the model is underfitting the minority class; XGBoost with default settings may not handle extreme imbalance well. Using a specialized algorithm like balanced random forest or SMOTE can improve recall. Option A (more data) may not help if the new data is also imbalanced.

Option B (PCA) reduces dimensionality but not imbalance. Option D (larger instance) does not improve model performance.

Full explanation →

67

MCQmedium

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The solution must handle data that arrives in bursts and must be able to reprocess failed records automatically. Which combination of AWS services should the team use?

A.AWS Glue with Amazon S3

B.Amazon SQS with AWS Lambda

C.Amazon Kinesis Data Streams with AWS Lambda

D.Amazon DynamoDB Streams with AWS Lambda

AnswerC

Kinesis Data Streams can ingest bursty streaming data and retain it for replay; Lambda can process and load to S3.

Why this answer

Option B is correct because Amazon Kinesis Data Streams can ingest high-throughput streaming data and retain it for up to 365 days, allowing reprocessing. AWS Lambda can be used to transform and load the data into S3. Option A is wrong because Amazon SQS is not optimized for streaming ingestion and lacks replay capability.

Option C is wrong because Amazon Glue is a batch ETL service, not for streaming. Option D is wrong because Amazon DynamoDB Streams are tied to DynamoDB table changes, not direct IoT ingestion.

Full explanation →

68

MCQmedium

A company needs to perform complex transformations on large datasets stored in Amazon S3 using Apache Spark. They want to minimize operational overhead. Which AWS service should they use?

A.Amazon EMR

B.Amazon EC2 with manually configured Spark

C.Amazon Athena

D.AWS Glue

AnswerA

EMR provides managed Spark clusters for complex transformations.

Why this answer

Amazon EMR with Spark is a managed service that reduces operational overhead. AWS Glue is for simpler ETL, not complex Spark transformations. EC2 requires manual setup.

Athena is SQL-based, not Spark.

Full explanation →

69

MCQhard

A company uses AWS Glue crawlers to populate the AWS Glue Data Catalog from Amazon S3. The data is partitioned by year/month/day/hour. The crawler runs every hour and adds new partitions. However, the data engineer notices that the crawler is taking longer to run as the number of partitions grows, and sometimes it misses new partitions. What is the most cost-effective and reliable way to address this?

A.Enable the crawler's partition index feature.

B.Manually add new partitions using ALTER TABLE ADD PARTITION in Athena.

C.Use the Athena MSCK REPAIR TABLE command after the crawler runs.

D.Increase the crawler's schedule to run every 30 minutes.

AnswerA

Partition indexes allow the crawler to efficiently discover new partitions without scanning the entire dataset.

Why this answer

Option B is correct because enabling the crawler's partition index feature allows Glue to quickly find new partitions without re-scanning the entire table. Option A is wrong because increasing the crawler's frequency will not help if the crawler is already missing partitions due to scanning overhead. Option C is wrong because manually adding partitions is error-prone and does not scale.

Option D is wrong because Athena MSCK REPAIR TABLE is a manual command and requires running it after data arrives, not automated.

Full explanation →

70

MCQmedium

A company runs a daily ETL job that reads data from Amazon RDS, transforms it using AWS Glue, and writes the results to Amazon S3. The job started failing yesterday with the error: 'Rate exceeded'. What is the most likely cause and solution?

A.The Glue job is using too many DPUs; reduce the number of DPUs

B.The RDS database is overwhelmed by the number of connections; reduce the Glue job's parallelism or increase RDS instance size

C.The S3 bucket has reached its request rate limit; request a limit increase

D.Enable job bookmarks in the Glue job to process only new data

AnswerB

Rate exceeded errors often come from RDS when connection or IO limits are reached.

Why this answer

The 'Rate exceeded' error typically indicates that the job is exceeding the RDS database's maximum connections or IOPS limits. The best solution is to reduce the parallelism or increase the database capacity. Option A (increasing S3 bucket limits) does not address the RDS issue.

Option B (increasing Glue DPUs) may exacerbate the issue. Option D (enabling Glue job bookmarks) does not solve rate limiting.

Full explanation →

71

MCQmedium

A company uses Amazon Kinesis Data Streams to collect IoT sensor data. The stream has 4 shards. A consumer application reads from the stream using the Kinesis Client Library (KCL). The application processes records and stores them in Amazon DynamoDB. Recently, the data volume has increased, and the consumer is falling behind. Which action should the team take to increase the processing throughput?

A.Deploy additional consumer instances using the same application name.

B.Increase the write capacity of the DynamoDB table.

C.Increase the data retention period of the stream to 7 days.

D.Increase the number of shards in the Kinesis stream.

AnswerD

More shards provide more read capacity units and allow more parallel consumers.

Why this answer

Option C is correct because increasing the number of shards increases the stream's throughput and allows more concurrent consumers. Option A is wrong because increasing the retention period does not affect throughput. Option B is wrong because adding more KCL workers without increasing shards will cause them to idle.

Option D is wrong because increasing DynamoDB write capacity may reduce throttling but does not increase the consumer's reading throughput.

Full explanation →

72

MCQeasy

A data scientist is using Amazon SageMaker to train a model. The training job is using a large dataset stored in S3. Which data input mode provides the FASTEST data loading for training?

A.FastFile mode

B.Augmented manifest file

C.File mode

D.Pipe mode

AnswerD

Pipe mode streams data directly from S3.

Why this answer

Pipe mode is the fastest data loading mode for SageMaker training because it streams data directly from S3 to the training algorithm via a FIFO pipe, bypassing disk writes. This eliminates the I/O overhead of downloading files to the local storage, enabling near-zero latency data ingestion for large datasets.

Exam trap

The trap here is that candidates often confuse 'FastFile mode' (which is fast but still disk-bound) with 'Pipe mode' (which is truly streaming), or they mistakenly think 'Augmented manifest file' is a data loading mode rather than a metadata file format.

How to eliminate wrong answers

Option A is wrong because FastFile mode, while optimized for high-throughput access, still requires data to be written to the local instance's Amazon Elastic Block Store (EBS) volume before training, adding disk I/O latency compared to streaming. Option B is wrong because an augmented manifest file is a metadata format for labeling jobs and data sources, not a data loading mode; it does not affect the speed of data transfer during training. Option C is wrong because File mode downloads the entire dataset from S3 to the local EBS volume before training starts, incurring significant download time and storage overhead, making it slower than streaming approaches.

Full explanation →

73

Multi-Selecteasy

A data analyst is performing EDA on a tabular dataset with 500 features. The goal is to reduce dimensionality before modeling. Which TWO techniques are appropriate for this task?

Select 2 answers

A.t-distributed Stochastic Neighbor Embedding (t-SNE).

B.Principal Component Analysis (PCA).

C.k-fold cross-validation to assess model performance.

D.Chi-square test for independence between features.

E.One-hot encoding of categorical variables.

AnswersA, B

t-SNE reduces dimensions for visualization.

Why this answer

PCA and t-SNE are both dimensionality reduction techniques. PCA is linear, t-SNE is nonlinear. Option A (Chi-square test) is for feature selection with categorical targets, not dimensionality reduction.

Option C (cross-validation) is for model evaluation. Option E (one-hot encoding) expands features.

Full explanation →

74

MCQmedium

A machine learning team is using SageMaker to train a model. They want to ensure that the training data is encrypted at rest in the S3 bucket and that the data is also encrypted during transit. Which configuration should they use?

A.Use client-side encryption and transfer data via HTTP

B.Use SSE-S3 encryption on the S3 bucket and enforce HTTPS

C.Use SSE-KMS encryption on the S3 bucket and disable HTTP

D.Use SSE-C encryption on the S3 bucket and HTTPS

E.Use no encryption on S3 but use HTTPS

AnswerB

SSE-S3 encrypts at rest; HTTPS encrypts in transit.

Why this answer

Option D is correct because SSE-S3 encrypts data at rest, and HTTPS ensures encryption in transit. Option A (SSE-KMS) also works but requires KMS keys. Option B (client-side encryption) is not managed by SageMaker.

Option C (SSE-C) requires customer keys. Option E (HTTP) is not encrypted.

Full explanation →

75

MCQhard

A company is building a machine learning model to detect anomalies in industrial sensor data. The data is time-series with seasonal patterns. The data scientist wants to use Amazon SageMaker to train a model. Which algorithm is most suitable for this task?

A.Random Cut Forest (RCF)

B.K-Means

C.DeepAR

D.XGBoost

AnswerA

RCF is a SageMaker built-in algorithm for anomaly detection, suitable for time-series data.

Why this answer

Random Cut Forest (RCF) is the most suitable algorithm because it is designed for unsupervised anomaly detection on streaming and time-series data. It works by constructing an ensemble of random trees that isolate anomalies based on how quickly a data point can be separated from the rest, making it effective for detecting outliers in sensor data with seasonal patterns without requiring labeled training data.

Exam trap

The trap here is that candidates often confuse unsupervised anomaly detection with supervised forecasting or classification, leading them to pick DeepAR or XGBoost, but RCF is the only algorithm among the options specifically built for unsupervised anomaly detection on streaming data without requiring labels.

How to eliminate wrong answers

Option B (K-Means) is wrong because it is a clustering algorithm that groups data into clusters based on distance, not specifically designed for anomaly detection; it requires specifying the number of clusters (k) and does not inherently handle time-series seasonality or detect anomalies as outliers. Option C (DeepAR) is wrong because it is a supervised forecasting algorithm for time-series that predicts future values based on historical patterns, not for detecting anomalies in existing data; it requires a target variable and is not suited for unsupervised anomaly detection. Option D (XGBoost) is wrong because it is a supervised gradient boosting algorithm used for regression and classification tasks, requiring labeled data and feature engineering; it does not natively handle unsupervised anomaly detection or time-series seasonality without extensive preprocessing and labeling.

Full explanation →

Page 1 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →