AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 76150

1755 questions total · 24pages · All types, answers revealed

Page 1

Page 2 of 24

Page 3
76
MCQhard

A data engineer is performing exploratory data analysis on a dataset stored in Amazon S3 using AWS Glue DataBrew. The dataset contains a column 'age' with missing values. DataBrew's profile shows that the column has 5% missing values, a mean of 45, and a standard deviation of 15. Which imputation strategy should the engineer recommend to minimize bias if the missing data is Missing at Random (MAR)?

A.Replace missing values with the mean (45)
B.Remove rows with missing 'age' values
C.Replace missing values with the median
D.Use multiple imputation to generate several plausible values and combine results
AnswerD

Multiple imputation preserves the natural variability and provides valid statistical inferences under MAR.

Why this answer

Option C is correct because multiple imputation provides unbiased estimates under MAR by accounting for uncertainty. Option A is wrong because mean imputation reduces variance and can bias relationships. Option B is wrong because median imputation is robust but still single imputation.

Option D is wrong because dropping rows reduces sample size and may introduce bias if missingness is related to other variables.

77
MCQmedium

A data scientist wants to use AWS Step Functions to orchestrate a machine learning workflow including data preprocessing, training, and evaluation. Which SageMaker integration is best suited for this purpose?

A.Implement each step as an AWS Lambda function and call Step Functions
B.Use the SageMaker SDK with Step Functions service integrations
C.Use SageMaker Pipelines to define the workflow
D.Use AWS Batch to run the steps sequentially
AnswerB

Step Functions has built-in integrations for SageMaker training, processing, and endpoints.

Why this answer

Option B is correct: SageMaker SDK with Step Functions integration allows direct orchestration. Option A (SageMaker Pipelines) is a different orchestration tool. Option C (Lambda) adds complexity.

Option D (Batch) is for non-interactive jobs.

78
Multi-Selecthard

Which THREE measures can help reduce inference latency for a deep learning model deployed on SageMaker real-time endpoints? (Select THREE.)

Select 3 answers
A.Enable SageMaker Neo to compile the model.
B.Increase the batch size for inference.
C.Use GPU instances for inference.
D.Reduce the input data size (e.g., lower resolution images).
E.Use a multi-model endpoint to share the instance.
AnswersA, C, D

Neo optimizes models for target hardware, reducing latency.

Why this answer

To reduce latency, use GPU instances, enable model compilation with SageMaker Neo, reduce input size, and use multi-model endpoints to share resources. However, multi-model endpoints add latency when switching models. Increasing batch size usually increases latency per request but can improve throughput.

The three correct measures are: use GPU instances, enable SageMaker Neo, and reduce input data size.

79
MCQeasy

A company uses Amazon Redshift for its data warehouse. The data engineering team notices that queries are slow and wants to improve performance without changing the schema. Which action is most likely to improve query performance?

A.Decrease the number of nodes to reduce network overhead.
B.Disable compression on all tables to reduce CPU overhead.
C.Increase the number of nodes in the cluster.
D.Change the distribution style from AUTO to EVEN.
AnswerC

Adding nodes increases parallelism and improves query performance.

Why this answer

Option C is correct because increasing the number of nodes adds compute resources, improving parallel processing. Option A is wrong because changing distribution style would alter the schema. Option B is wrong because decreasing node count reduces resources.

Option D is wrong because disabling compression increases storage and I/O.

80
MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?

A.The data is encrypted with AWS KMS and Firehose cannot write to encrypted buckets.
B.The delivery stream does not have dynamic partitioning enabled with the appropriate custom prefix.
C.The buffer interval is too short for the data volume, causing incomplete records.
D.The S3 bucket has versioning enabled, which prevents partitioning.
AnswerB

Without dynamic partitioning and the correct prefix, Firehose will not partition the data by year/month/day.

Why this answer

Option B is correct because Kinesis Data Firehose requires dynamic partitioning to be explicitly enabled and configured with a custom prefix (e.g., 'year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/') to automatically partition data by year, month, and day. Without this setting, Firehose writes all data to a single S3 prefix, ignoring the desired partition structure.

Exam trap

The trap here is that candidates assume simply setting a prefix with date-like placeholders (e.g., 'data/year=2025/') is enough, but Firehose requires explicit dynamic partitioning to be enabled and the prefix must use the correct !{timestamp:...} syntax for automatic date-based partitioning.

How to eliminate wrong answers

Option A is wrong because Firehose can write to KMS-encrypted S3 buckets when the correct IAM permissions and KMS key policies are in place; encryption does not prevent partitioning. Option C is wrong because a 60-second buffer interval is sufficient for 1 MB/s data (60 MB per interval), and Firehose buffers complete records, not partial ones. Option D is wrong because S3 versioning does not affect Firehose's ability to write partitioned data; versioning simply maintains multiple versions of objects.

81
MCQmedium

A data scientist is using SageMaker Ground Truth to create a labeled dataset for object detection. After the labeling job completes, the scientist notices that the output manifest file contains incorrect labels. What is the most efficient way to correct these labels?

A.Create an incremental labeling job that includes only the mislabeled items.
B.Delete the labeling job and start over with a different set of workers.
C.Use the SageMaker console to edit the incorrect labels directly in the manifest file.
D.Create a new labeling job with the same dataset and manually verify all labels.
AnswerA

Efficiently corrects only errors.

Why this answer

Option B is correct because Ground Truth allows creating an incremental labeling job to correct only the mislabeled items, avoiding re-labeling all data. Option A is wrong because it would require re-labeling everything, which is inefficient. Option C is wrong because it does not address labeling errors.

Option D is wrong because it would lose the correctly labeled data.

82
MCQhard

A financial services company is developing a fraud detection model using a highly imbalanced dataset where fraudulent transactions are only 0.1% of the data. The data scientist has trained a gradient boosting model that achieves 99.9% accuracy but only detects 20% of actual fraud cases. The business requirement is to detect at least 80% of fraud while minimizing false positives. The data scientist has access to SageMaker and can use any built-in algorithm or custom script. Which approach should the data scientist take to meet the business requirement?

A.Keep the model but adjust the classification threshold to increase recall.
B.Use random under-sampling of the majority class to balance the dataset and retrain the model.
C.Use Amazon SageMaker Random Cut Forest (RCF) algorithm for anomaly detection.
D.Use random oversampling of the minority class to balance the dataset and retrain the model.
AnswerC

RCF is designed for anomaly detection on highly imbalanced data and can detect fraud effectively.

Why this answer

Amazon SageMaker Random Cut Forest (RCF) is an unsupervised anomaly detection algorithm that is well-suited for highly imbalanced datasets like this one (0.1% fraud). Unlike supervised methods that struggle with extreme class imbalance, RCF isolates anomalies by measuring how many random cuts are needed to separate a point from the rest of the data, making it effective at detecting rare fraud cases without requiring balanced training data. This approach can meet the 80% fraud detection requirement while minimizing false positives by tuning the anomaly score threshold.

Exam trap

The trap here is that candidates assume a supervised model with threshold tuning (Option A) can solve the imbalance, but they overlook that the model's learned decision boundary is fundamentally biased, and unsupervised anomaly detection like RCF is specifically designed for such extreme imbalance scenarios.

How to eliminate wrong answers

Option A is wrong because simply adjusting the classification threshold on the existing gradient boosting model will increase recall but will also dramatically increase false positives, as the model was trained on imbalanced data and its decision boundary is already skewed toward the majority class. Option B is wrong because random under-sampling of the majority class discards a large amount of legitimate transaction data, which can lead to loss of valuable patterns and increase false positive rates, and it does not guarantee achieving 80% recall with minimal false positives. Option D is wrong because random oversampling of the minority class duplicates existing fraud examples, which can cause overfitting to those specific instances and reduce generalization, and it still relies on a supervised model that may not effectively learn the rare fraud patterns.

83
Multi-Selecteasy

Which TWO are common steps in exploratory data analysis?

Select 2 answers
A.Training a machine learning model.
B.Checking for missing values.
C.Visualizing distributions of features.
D.Deploying the model to production.
E.Tuning hyperparameters.
AnswersB, C

Missing value analysis is a key step.

84
Multi-Selectmedium

Which THREE actions should be taken to ensure data security when training a model using Amazon SageMaker with data stored in Amazon S3? (Choose 3.)

Select 3 answers
A.Use a VPC to isolate the SageMaker training job
B.Apply an S3 bucket policy that denies all access except from the SageMaker service
C.Attach an EBS volume for storing training data
D.Enable server-side encryption on the S3 bucket
E.Use an IAM role with least privilege permissions
AnswersA, D, E

Network isolation.

Why this answer

Encrypt data at rest (Option A), use VPC (Option B), and use an IAM role with least privilege (Option D). Option C is wrong because SageMaker uses S3, not EBS, for training data. Option E is wrong because SageMaker does not support S3 bucket policies directly; IAM roles are used.

85
MCQmedium

A data engineer is troubleshooting an AWS Glue job that reads from an S3 bucket and writes to another S3 bucket. The job fails with an 'Access Denied' error when trying to write to the output bucket. The IAM policy attached to the Glue service role is shown. What is the MOST likely cause of the failure?

A.The user who runs the job does not have S3 permissions
B.The Glue job role does not have permissions to start a job run
C.The output bucket is not listed in the Resource of the IAM policy
D.The S3 bucket policy denies access to the Glue service
AnswerC

The policy only allows PutObject on example-bucket, not the output bucket.

Why this answer

The policy only allows s3:GetObject and s3:PutObject on 'example-bucket', but the job is writing to a different bucket. The Glue job needs permissions on the output bucket as well. Option A is wrong because the job role has permissions for Glue actions.

Option B is wrong because S3 permissions are for a specific bucket. Option D is wrong because the job role is separate from the user role.

86
Multi-Selecteasy

A machine learning pipeline uses SageMaker Processing jobs for feature engineering. Which TWO are benefits of using SageMaker Processing over running a custom script on an EC2 instance?

Select 2 answers
A.Automatically manages the compute resources
B.Integrates with SageMaker Experiments for tracking
C.Provides a built-in VPC for network isolation
D.Allows use of custom Docker images from any registry
E.Supports multiple programming languages
AnswersA, B

SageMaker provisions and tears down resources.

Why this answer

Options B and D are correct. Option B: SageMaker Processing manages infrastructure. Option D: Integrates with SageMaker experiments.

Option A is wrong because EC2 can also use various runtimes. Option C is wrong because EC2 instance can be in a VPC. Option E is wrong because both can use Git.

87
MCQmedium

A data scientist is assigned an IAM policy as shown. The data scientist attempts to create a SageMaker endpoint to deploy a model, but the request fails. What is the most likely reason?

A.The data scientist does not have permission to upload the model to S3.
B.The data scientist does not have permission to create a training job.
C.The data scientist does not have permission to create an endpoint.
D.The data scientist does not have permission to pass roles.
AnswerC

The policy has a Deny for sagemaker:CreateEndpoint.

Why this answer

The IAM policy shown does not include the `sagemaker:CreateEndpoint` action, which is required to create a SageMaker endpoint. Even if the data scientist has permissions for other SageMaker actions like `CreateModel` or `CreateEndpointConfig`, the explicit absence of `CreateEndpoint` in the policy will cause the request to fail with an access denied error. AWS IAM policies must explicitly grant each action needed for the operation.

Exam trap

The trap here is that candidates assume that having permissions for model creation and configuration automatically implies permission for endpoint creation, but AWS requires each SageMaker API action to be explicitly listed in the IAM policy.

How to eliminate wrong answers

Option A is wrong because the policy includes `s3:PutObject` and `s3:GetObject` actions on the specified S3 bucket, so the data scientist has permission to upload the model to S3. Option B is wrong because the policy includes `sagemaker:CreateTrainingJob`, so the data scientist has permission to create a training job. Option D is wrong because the policy includes `iam:PassRole` on the specified role ARN, so the data scientist has permission to pass roles.

88
Multi-Selectmedium

A data scientist needs to deploy a model with a custom inference container. Which THREE requirements must the container meet for SageMaker hosting?

Select 3 answers
A.Provide a training script at /opt/ml/input/data
B.Use the SageMaker Python SDK to load the model
C.Implement a /ping endpoint for health checks
D.Serve on port 8080
E.Implement a /invocations endpoint for predictions
AnswersC, D, E

SageMaker uses /ping to check container health.

Why this answer

SageMaker requires containers to implement the /invocations and /ping endpoints, and to serve on port 8080. Options A, B, and E are correct. Option C is for training, Option D is optional.

89
Multi-Selecthard

A data scientist is using SageMaker to train a model. The training job needs to access data in an S3 bucket in a different AWS account. The data scientist has set up proper S3 bucket policies and IAM roles. Which THREE steps are necessary to allow SageMaker to access the cross-account S3 bucket? (Select THREE.)

Select 3 answers
A.Configure the S3 bucket policy to grant access to the SageMaker execution role ARN from the training account
B.Create a VPC endpoint for S3 in the training account
C.Create an IAM role in the data account with permissions to read from the S3 bucket
D.Use an AWS KMS key to encrypt the data in transit
E.Configure the SageMaker execution role in the training account to assume the IAM role in the data account
AnswersA, C, E

Bucket policy must allow cross-account access.

Why this answer

Options A, B, and C are correct. A: Role must have S3 access. B: Bucket policy must allow cross-account access.

C: SageMaker execution role must be able to assume the cross-account role. D (VPC endpoint) is not required for cross-account access. E (KMS) is only if encryption is used.

90
MCQhard

A data scientist needs to run a one-time training job on a 5 TB dataset stored in Amazon S3. The training algorithm requires random access to individual records. Which SageMaker input mode and data format combination would be MOST appropriate?

A.Use Pipe mode with Parquet format
B.Use Pipe mode with RecordIO-Protobuf format
C.Use File mode with RecordIO-Protobuf format
D.Use Pipe mode with CSV format
AnswerC

File mode downloads data to disk, allowing random access; Protobuf is efficient.

Why this answer

Option C is correct because File mode downloads the entire dataset to the instance's local disk, enabling random access to any record. Pipe mode does not support random access. Option A is wrong because Pipe mode streams data sequentially.

Option B is wrong because Pipe mode does not support random access. Option D is wrong because Pipe mode is not suitable for random access.

91
MCQmedium

A data scientist is analyzing a time series dataset of daily website traffic. The scientist notices a strong weekly seasonality. To better understand the underlying patterns, which decomposition method should the scientist use to separate the trend, seasonal, and residual components?

A.Additive decomposition using moving averages.
B.Use STL (Seasonal and Trend decomposition using Loess).
C.Fit an ARIMA model and examine residuals.
D.Apply an ETS (Error, Trend, Seasonal) model.
AnswerB

STL is robust and flexible for any seasonality.

Why this answer

Option C is correct because STL decomposition is robust to outliers and can handle any seasonality period, making it suitable for daily data with weekly seasonality. Option A is wrong because classical decomposition assumes additive seasonality and is less robust. Option B is wrong because ARIMA is a forecasting model, not a decomposition method.

Option D is wrong because ETS is an exponential smoothing model, not primarily for decomposition.

92
MCQmedium

Refer to the exhibit. A data engineer is creating an IAM policy for an AWS Glue ETL job that reads encrypted objects from an S3 bucket, transforms them, and writes the results back to the same bucket. The bucket uses SSE-KMS encryption with the KMS key specified. The ETL job is failing with an "Access Denied" error when trying to write data. What is the likely cause?

A.The policy is missing the kms:Decrypt permission
B.The policy is missing the s3:PutObjectAcl permission
C.The policy is missing the s3:PutObject permission
D.The policy is missing the kms:Encrypt permission
AnswerD

Writing with SSE-KMS requires kms:Encrypt.

Why this answer

Option C is correct because the policy grants s3:PutObject, which is needed to write, but the KMS permissions include kms:Decrypt and kms:GenerateDataKey, which are sufficient for reading and writing with SSE-KMS. The issue is that the job role must also have kms:Encrypt to write encrypted objects. Option A is wrong because the policy includes s3:PutObject.

Option B is wrong because the policy includes both KMS actions needed for reading. Option D is wrong because there is no s3:PutObject condition missing.

93
Multi-Selecteasy

Which TWO approaches are appropriate for handling missing categorical data during exploratory data analysis? (Choose two.)

Select 2 answers
A.Use one-hot encoding to represent missingness as a binary feature.
B.Impute with the mode (most frequent) of the column.
C.Treat missing values as a separate 'Unknown' category.
D.Drop all rows with missing values in that column.
E.Impute missing values with the mean of the column.
AnswersB, C

Mode is a simple imputation for categorical data.

Why this answer

Options B and D are correct. B: Creating an 'Unknown' category preserves information about missingness. D: Using mode imputation is a simple baseline.

A: Dropping rows may lose data; C: Mean is for numerical data; E: One-hot encoding requires values first.

94
MCQhard

A data scientist is exploring a dataset with 1,000 features and only 200 samples. The goal is to build a binary classifier. Which technique should be used first during exploratory data analysis to reduce dimensionality and avoid overfitting?

A.Compute pairwise correlations and remove highly correlated features.
B.Apply L1 regularization (Lasso) to select features.
C.Use t-SNE to visualize clusters and reduce dimensions.
D.Use principal component analysis (PCA) to reduce dimensions.
AnswerD

PCA reduces dimensionality while preserving variance.

Why this answer

Option C is correct because PCA is an unsupervised dimensionality reduction technique suitable for high-dimensional data with few samples. Option A is wrong because feature selection based on correlation may miss interactions. Option B is wrong because L1 regularization is model-dependent and not part of EDA.

Option D is wrong because t-SNE is for visualization, not feature reduction for modeling.

95
MCQmedium

A company is building a recommendation system for an e-commerce platform. They have user-item interaction data and want to use matrix factorization. However, the dataset is sparse (99% missing interactions). Which approach should the data scientist take to train the model effectively?

A.Impute missing values with zeros and use singular value decomposition (SVD)
B.Use alternating least squares (ALS) with implicit feedback and assign lower confidence to unobserved interactions
C.Remove all users and items with fewer than 10 interactions to reduce sparsity
D.Use item-based collaborative filtering with cosine similarity
AnswerB

ALS with implicit feedback naturally handles sparsity by weighting unobserved interactions.

Why this answer

Option B is correct because Alternating Least Squares (ALS) with implicit feedback is specifically designed to handle sparse implicit feedback datasets by assigning lower confidence to unobserved interactions (e.g., confidence = 1 + alpha * r_ui, where r_ui is 0 for unobserved). This avoids the pitfalls of treating missing values as zeros (which distorts the factorization) and scales well to 99% sparsity by leveraging weighted regularization.

Exam trap

The trap here is that candidates assume missing values must be imputed (e.g., with zeros) or that reducing sparsity by filtering is necessary, but ALS with implicit feedback is purpose-built for sparse implicit data without imputation.

How to eliminate wrong answers

Option A is wrong because imputing missing values with zeros and applying SVD treats all unobserved interactions as negative signals, which introduces bias and distorts the latent factor model; SVD also requires a dense matrix and fails on sparse data due to overfitting and computational inefficiency. Option C is wrong because removing users and items with fewer than 10 interactions reduces the dataset size but does not address the fundamental sparsity problem—it discards valuable cold-start data and can still leave a sparse matrix, while matrix factorization methods like ALS handle sparsity natively. Option D is wrong because item-based collaborative filtering with cosine similarity relies on pairwise item similarity computed from co-occurrence patterns, which becomes unreliable when 99% of interactions are missing (cosine similarity over sparse vectors yields unstable or zero similarity scores).

96
MCQmedium

A company is training a deep learning model on Amazon SageMaker. The training job is taking a long time and the data scientist suspects that the model is overfitting. Which of the following actions can help reduce overfitting and improve generalization?

A.Increase the batch size used during training.
B.Add dropout layers to the model architecture.
C.Increase the number of training epochs.
D.Remove regularization terms from the loss function.
AnswerB

Dropout is a regularization technique that helps prevent overfitting by randomly dropping neurons during training.

Why this answer

Adding dropout layers is a regularization technique that randomly drops neurons during training to prevent overfitting. Increasing the number of epochs (Option B) would likely worsen overfitting. Using a larger batch size (Option C) can sometimes help generalization but is not a direct regularization technique.

Removing regularization (Option D) would increase overfitting.

97
MCQhard

A machine learning engineer is deploying a model using Amazon SageMaker. The model is a PyTorch model that performs real-time inference with low latency requirements. The engineer wants to use automatic scaling based on the number of concurrent requests. Which SageMaker feature should be used to achieve this?

A.Create an AWS Auto Scaling group for the SageMaker endpoint.
B.Enable Elastic Load Balancing for the endpoint.
C.Use Amazon SageMaker automatic scaling with a target tracking scaling policy.
D.Deploy the model behind Amazon API Gateway with a Lambda function.
AnswerC

This scales based on invocations per instance.

Why this answer

Option B is correct because SageMaker's application auto scaling with a target tracking scaling policy based on the SageMakerVariantInvocationsPerInstance metric automatically adjusts the number of instances. Option A is wrong because SageMaker does not have an integrated Elastic Load Balancer; it uses a built-in load balancer. Option C is wrong because SageMaker does not natively support AWS Auto Scaling groups for endpoints.

Option D is wrong because Amazon API Gateway is used for REST APIs, not for scaling SageMaker endpoints.

98
MCQeasy

A machine learning team is building a model to predict customer churn. They have a dataset with 10,000 samples and 50 features, including categorical variables with high cardinality (e.g., ZIP code). Which feature engineering technique is most appropriate to reduce dimensionality while preserving predictive information?

A.Principal Component Analysis (PCA)
B.One-hot encoding
C.Target encoding
D.Label encoding
AnswerC

Target encoding reduces dimensionality by replacing categories with target mean, preserving predictive information.

Why this answer

Target encoding replaces high-cardinality categories with the mean target value, reducing dimensionality while capturing predictive signal. Option A is wrong because one-hot encoding creates many sparse features, increasing dimensionality. Option B is wrong because PCA is applied to numerical features, not categorical.

Option D is wrong because label encoding imposes ordinality that may not exist.

99
MCQmedium

A data scientist is training a model using Amazon SageMaker and wants to track hyperparameter tuning jobs, training jobs, and model metrics. The team also needs to compare experiments visually. Which AWS service should be used?

A.Amazon CloudWatch Logs
B.AWS Glue DataBrew
C.AWS Step Functions
D.Amazon SageMaker Experiments
AnswerD

SageMaker Experiments tracks and compares experiments.

Why this answer

Option C is correct because SageMaker Experiments is purpose-built for tracking and comparing ML experiments. Option A is wrong because AWS Glue is for ETL, not experiment tracking. Option B is wrong because CloudWatch Logs stores logs but does not provide experiment comparison.

Option D is wrong because Step Functions is for workflow orchestration.

100
Multi-Selecteasy

A data scientist is building a binary classifier and wants to evaluate model performance. Which THREE metrics are most commonly used?

Select 3 answers
A.Mean Absolute Error
B.RMSE
C.Precision
D.Recall
E.Accuracy
AnswersC, D, E

Common classification metric.

Why this answer

Precision is a core metric for binary classifiers, measuring the proportion of true positive predictions among all positive predictions. It is especially important when the cost of false positives is high, such as in spam detection or fraud alert systems.

Exam trap

AWS often tests the distinction between regression and classification metrics, and the trap here is that candidates mistakenly apply regression metrics like MAE or RMSE to binary classification problems.

101
MCQmedium

A deployed SageMaker endpoint is returning high latency. The model is a scikit-learn Random Forest. Which action is most likely to reduce latency?

A.Reduce the number of trees in the ensemble
B.Prune decision trees in the model
C.Increase the number of instances behind the endpoint
D.Switch to a GPU instance type
AnswerA

Fewer trees reduce computation time per inference.

Why this answer

Option D is correct because reducing the number of trees directly speeds up inference. Option A is wrong because increasing instance count may not reduce per-request latency. Option B is wrong because using GPU may not help tree-based models.

Option C is wrong because pruning reduces accuracy but is not the best approach.

102
MCQeasy

A data scientist is training a linear regression model to predict house prices. The dataset includes features such as square footage, number of bedrooms, and location. After training, the model achieves an R² of 0.85 on the training set but only 0.60 on the test set. Which of the following is the MOST likely cause of this discrepancy?

A.The model is overfitting the training data
B.There is multicollinearity among the features
C.The model is underfitting the training data
D.There is data leakage between the training and test sets
AnswerA

Overfitting causes high training performance but poor generalization to test data.

Why this answer

A high R² on the training set (0.85) paired with a significantly lower R² on the test set (0.60) is a classic symptom of overfitting. The model has learned noise and specific patterns in the training data that do not generalize to unseen data, causing poor test performance. Regularization techniques like Lasso or Ridge, or reducing model complexity, would typically address this issue.

Exam trap

Cisco often tests the distinction between overfitting and multicollinearity, where candidates mistakenly attribute a training-test R² gap to multicollinearity instead of recognizing it as a generalization failure.

How to eliminate wrong answers

Option B is wrong because multicollinearity inflates the variance of coefficient estimates but does not inherently cause a large gap between training and test R²; it affects interpretability and stability, not generalization performance directly. Option C is wrong because underfitting would result in low R² on both training and test sets (e.g., both below 0.60), not a high training R² with a much lower test R². Option D is wrong because data leakage would typically inflate both training and test R² artificially, making them both appear deceptively high, not creating a large discrepancy between them.

103
MCQhard

A data scientist is building a model to predict customer churn. The dataset contains categorical features with high cardinality (e.g., ZIP code, customer ID). Which encoding method is MOST suitable?

A.One-hot encoding
B.Label encoding
C.Hashing encoding
D.Target encoding
AnswerD

Target encoding captures information without expanding dimensionality.

Why this answer

Target encoding is most suitable for high-cardinality categorical features because it replaces each category with the mean of the target variable for that category, effectively capturing the predictive signal while keeping the feature space dense. This avoids the curse of dimensionality from one-hot encoding and the arbitrary ordinality of label encoding, which can mislead tree-based models.

Exam trap

The trap here is that candidates often choose one-hot encoding as the default for categorical data, failing to recognize that high cardinality makes it impractical, or they pick label encoding assuming it is safe for tree models, but it introduces false ordinality that can degrade performance.

How to eliminate wrong answers

Option A is wrong because one-hot encoding creates a binary column for each unique category, which with high cardinality (e.g., thousands of ZIP codes) leads to an extremely sparse feature matrix, causing memory issues and model overfitting. Option B is wrong because label encoding assigns arbitrary integer labels to categories, implying an ordinal relationship that does not exist, which can distort distance-based and tree-based models. Option C is wrong because hashing encoding maps categories to a fixed number of buckets via a hash function, which can cause collisions (different categories mapping to the same bucket) and loss of information, making it less reliable for churn prediction where each category's signal matters.

104
MCQmedium

A machine learning team is preparing a large dataset for training. The dataset consists of 10,000 CSV files, each about 100 MB, stored in Amazon S3. The team wants to transform the data using AWS Glue ETL jobs. The transformation involves filtering rows, adding new columns, and joining with a small reference table (100 KB). The team is concerned about job performance and cost. They currently have a Glue job with 10 DPU (Data Processing Units) and it takes about 2 hours to complete. The team wants to reduce the runtime and cost. Which approach should they take?

A.Use Amazon Athena to transform the data.
B.Increase the number of DPUs to 100.
C.Use Amazon EMR with Spot Instances instead of AWS Glue.
D.Convert the CSV files to Parquet format and partition the data by a column.
AnswerD

Parquet reduces I/O and partitioning reduces data scanned.

Why this answer

Using a columnar format like Parquet and partitioning the data on a relevant column (e.g., date) can significantly reduce the amount of data scanned and improve performance. Additionally, optimizing the number of DPUs (e.g., using a larger number of DPUs for a shorter time) can reduce cost if the job is billed by DPU-hour.

105
Multi-Selectmedium

A machine learning engineer is deploying a model on Amazon SageMaker. Which TWO steps are required to create a SageMaker endpoint?

Select 2 answers
A.Create a SageMaker model
B.Submit a training job
C.Create a SageMaker pipeline
D.Create an endpoint configuration
E.Create a SageMaker notebook instance
AnswersA, D

Model must be registered first.

Why this answer

A is correct because creating a SageMaker model is the first required step to define the model artifacts, inference code, and container image that will be used for predictions. Without a model object, SageMaker has no executable artifact to deploy behind the endpoint.

Exam trap

The trap here is that candidates confuse the training job (Option B) as a prerequisite for deployment, but SageMaker allows deploying a pre-trained model without ever running a training job, so only the model creation and endpoint configuration are mandatory.

106
MCQhard

Refer to the exhibit. A training job failed with the error shown. What is the most likely cause?

A.The model architecture is incorrect
B.The training data contains missing values or outliers that cause numerical instability
C.The instance type does not have enough memory
D.The training job exceeded the maximum runtime
AnswerB

Error indicates NaN or infinity in input.

Why this answer

Option A is correct because the error explicitly states input contains NaN or infinity, indicating missing or invalid values in the training data. Option B is wrong because the error is about input values, not about memory. Option C is wrong because the error is from the training script, not from SageMaker capacity.

Option D is wrong because the error is about input data, not about model architecture.

107
MCQeasy

A data scientist is using Amazon SageMaker to train a model with the built-in XGBoost algorithm. The dataset contains missing values. What is the default behavior of SageMaker XGBoost regarding missing values?

A.It raises an error and stops training
B.It imputes missing values with the column mean
C.It removes rows with missing values
D.It automatically learns the best direction (left or right) for missing values during training
AnswerD

XGBoost's sparsity-aware algorithm learns the optimal branch for missing values.

Why this answer

Option A is correct because XGBoost treats missing values as a separate category and learns the best direction to handle them (by default). Option B (mean imputation) is not default; XGBoost handles missingness internally. Option C (removing rows) is not default.

Option D (fail) is not default.

108
Multi-Selecthard

A company uses Amazon Redshift to run analytics on sales data. The data is loaded daily from S3 using COPY commands. The team notices that the COPY command performance degrades over time due to table bloat. The team needs to maintain query performance and reduce storage costs. Which combination of maintenance operations should the team perform regularly? (Choose THREE.)

Select 3 answers
A.Run the UNLOAD command to export data to S3 and then reload.
B.Change the distribution style of the table to KEY.
C.Run the VACUUM command to reclaim space and re-sort rows.
D.Run a DEEP COPY to recreate the table with optimal physical storage.
E.Run the ANALYZE command to update table statistics.
AnswersC, D, E

VACUUM removes deleted rows and re-sorts data.

Why this answer

Option A, C, and D are correct. VACUUM reclaims space and re-sorts rows (if sort keys defined). ANALYZE updates statistics for query planning.

DEEP COPY recreates the table to eliminate bloat completely. Option B is wrong because UNLOAD exports data, not maintenance. Option E is wrong because changing distribution style is a schema change, not regular maintenance.

109
MCQmedium

A data scientist is exploring a dataset stored as a single 2 GB object in S3. The scientist wants to read only a subset of the file (e.g., the first 1000 lines) to perform initial data inspection. Which approach should the scientist take to minimize data transfer and cost?

A.Use the AWS CLI to download the entire file and then use head to get the first lines.
B.Use S3 Select with a SQL query to retrieve the first 1000 rows.
C.Use the S3 Range header to read the first 1 MB of the file and parse lines.
D.Use Amazon Athena to query the file with LIMIT 1000.
AnswerB

S3 Select efficiently retrieves only the required subset.

Why this answer

Option C is correct because S3 Select allows retrieving a subset of data using SQL-like queries, reducing data transfer. Option A is wrong because downloading the entire 2 GB file is inefficient and costly. Option B is wrong because the S3 range read (with Range header) can retrieve bytes, but the user wants lines, not bytes; it's possible but less convenient and still transfers the entire file if lines are not aligned.

Option D is wrong because Athena scans the entire file; it's not efficient for a single large file.

110
MCQmedium

A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has 1 million rows with 50 features, including numerical and categorical variables. The goal is to identify patterns and potential data quality issues before building a model. Which approach should the data scientist take to efficiently explore the data?

A.Use AWS Glue DataBrew to profile the dataset, view data quality reports, and visualize distributions.
B.Use Amazon Athena to run SQL queries and generate summary statistics.
C.Use Amazon SageMaker Data Wrangler to import the data and create a flow for feature engineering.
D.Use Amazon SageMaker Ground Truth to label the data and then analyze the labels.
AnswerA

DataBrew provides an interactive interface for data profiling, cleaning, and visualization, making it suitable for EDA.

Why this answer

AWS Glue DataBrew is purpose-built for visual data preparation and profiling without writing code. It can directly profile the 1-million-row dataset, automatically generate data quality reports (e.g., missing values, outliers, data types), and provide distribution visualizations for both numerical and categorical features, making it the most efficient choice for exploratory data analysis.

Exam trap

Cisco often tests the distinction between tools for exploratory data analysis versus tools for data transformation or labeling, leading candidates to confuse SageMaker Data Wrangler (feature engineering) or Athena (SQL querying) with a dedicated profiling tool like DataBrew.

How to eliminate wrong answers

Option B is wrong because Amazon Athena is a serverless query engine for analyzing data in S3 using SQL, but it does not provide built-in profiling, data quality reports, or visualizations; it requires manual SQL queries to generate summary statistics, which is less efficient for exploratory analysis. Option C is wrong because Amazon SageMaker Data Wrangler is designed for importing, transforming, and creating feature engineering flows, not for initial data profiling and quality assessment; its primary purpose is preparing data for model training, not exploratory analysis. Option D is wrong because Amazon SageMaker Ground Truth is a data labeling service for creating labeled datasets, not for exploratory data analysis or profiling; using it to analyze labels would be an incorrect and inefficient use of the service.

111
MCQhard

A machine learning engineer is deploying a TensorFlow model to an Amazon SageMaker endpoint. The endpoint is behind an Application Load Balancer (ALB) for A/B testing. The engineer notices that the new variant is not receiving any traffic. What is the most likely cause?

A.The new variant's health checks are failing.
B.The ALB target group weight for the new variant is set to 0.
C.The model is not compatible with the ALB's protocol.
D.The ALB is not configured to route to SageMaker endpoints.
AnswerB

Weight of 0 means no traffic is sent.

Why this answer

Option A is correct because the ALB target group weight determines traffic distribution. Option B is wrong because routing is based on weights, not health checks if healthy. Option C is wrong because SageMaker endpoint is the target.

Option D is wrong because it would cause errors, not zero traffic.

112
MCQhard

A machine learning team is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket. The team wants to ensure that the training job can access the data securely without using long-lived AWS credentials. Which approach should the team use?

A.Store AWS access keys in the training script
B.Use an S3 bucket policy that allows public access
C.Specify an IAM role in the SageMaker training job configuration
D.Create a new IAM user for each training job
AnswerC

SageMaker assumes the IAM role to access S3, providing temporary credentials and secure access.

Why this answer

Option C is correct because SageMaker training jobs can assume an IAM role specified in the job configuration to obtain temporary security credentials via AWS Security Token Service (STS). This allows the training job to access the S3 bucket securely without embedding long-lived AWS access keys in code or configuration files.

Exam trap

The trap here is that candidates may think embedding credentials in code (Option A) is acceptable for automation, but AWS services like SageMaker are designed to use IAM roles for temporary, scoped access, making long-lived credentials unnecessary and insecure.

How to eliminate wrong answers

Option A is wrong because storing AWS access keys in the training script violates security best practices by exposing long-lived credentials that could be compromised; SageMaker provides IAM roles to avoid this. Option B is wrong because making the S3 bucket publicly accessible would expose the training data to anyone on the internet, creating a severe security risk and violating data privacy requirements. Option D is wrong because creating a new IAM user for each training job is impractical and insecure, as it would require managing many long-lived credentials and does not leverage SageMaker's built-in IAM role-based access control.

113
Multi-Selecthard

A data scientist is building a binary classification model to predict loan default. The dataset is highly imbalanced (5% default, 95% non-default). Which TWO techniques should the data scientist use to address the class imbalance?

Select 2 answers
A.Undersample the majority class
B.Use RMSE as the evaluation metric
C.Oversample the minority class using SMOTE
D.Use accuracy as the evaluation metric
E.Use class weights in the loss function
AnswersC, E

SMOTE generates synthetic samples for the minority class, balancing the dataset.

Why this answer

Oversampling the minority class using SMOTE (Synthetic Minority Oversampling Technique) is correct because it generates synthetic samples for the minority class by interpolating between existing minority instances, rather than simply duplicating them. This helps balance the dataset without introducing exact copies, which can reduce overfitting and improve the model's ability to generalize to the minority class.

Exam trap

AWS often tests the misconception that accuracy is a valid metric for imbalanced datasets, or that undersampling is always preferable to oversampling, when in fact accuracy can be highly misleading and undersampling can discard critical data.

114
MCQmedium

A data scientist is exploring a dataset stored in an Amazon S3 bucket. The dataset contains both numerical and categorical features. The scientist wants to compute summary statistics (mean, median, standard deviation) for all numerical features and count the distinct values for categorical features. Which AWS service is most appropriate for this task with minimal coding?

A.Amazon Athena
B.AWS Glue ETL jobs
C.AWS Glue DataBrew
D.Amazon SageMaker Data Wrangler
E.Amazon EMR
AnswerC

DataBrew offers visual data profiling with summary statistics and distinct counts.

Why this answer

AWS Glue DataBrew provides a visual interface for data preparation and profiling, including summary statistics and distinct value counts, without writing code. Option A is wrong because Amazon SageMaker Data Wrangler requires integration with SageMaker and may require more setup. Option B is wrong because AWS Glue ETL jobs require coding in Python or Scala.

Option D is wrong because Amazon Athena requires SQL queries. Option E is wrong because Amazon EMR requires cluster management and coding.

115
MCQeasy

A machine learning engineer notices that the target variable in a regression dataset has a long-tailed distribution. Which visualization technique is most appropriate to assess the distribution before applying a log transformation?

A.Bar chart
B.Histogram with density curve
C.Box plot
D.Scatter plot
AnswerB

Histogram and density curve show the distribution shape, including long tails.

Why this answer

Option C is correct because a histogram or density plot clearly shows the shape and spread of the distribution, including long tails. Option A is incorrect because box plots show quartiles and outliers but not the full distribution shape. Option B is incorrect because scatter plots show relationships between two variables, not univariate distribution.

Option D is incorrect because bar charts are for categorical data.

116
MCQhard

A data scientist is analyzing a dataset with a timestamp column and several numeric measurements. The goal is to detect seasonality and trends. Which AWS service can be used directly from SageMaker Studio to perform this analysis without writing code?

A.Amazon Forecast
B.Amazon SageMaker Data Wrangler
C.Amazon QuickSight ML Insights
D.AWS Glue DataBrew
AnswerB

Includes built-in time series analysis.

Why this answer

SageMaker Data Wrangler includes built-in time series analysis capabilities such as seasonality detection. Option A is wrong because QuickSight ML Insights requires data to be in QuickSight. Option B is wrong because Glue DataBrew is for data preparation but not specifically for time series analysis.

Option D is wrong because Forecast is for forecasting, not exploratory analysis.

117
MCQmedium

A data scientist is using SageMaker to train a model that requires access to a private S3 bucket in another account. The scientist has set up the correct IAM roles and bucket policies. However, the training job fails with an access denied error. What is the most likely cause?

A.The SageMaker execution role does not have s3:GetObject permission
B.The S3 objects are encrypted with SSE-KMS and the KMS key is not accessible
C.The training instance type does not support S3 access
D.The VPC used for training does not have a route to S3 (e.g., missing VPC endpoint or NAT)
AnswerD

If training is in a VPC without S3 access, the job cannot reach S3 despite correct IAM policies.

Why this answer

When a SageMaker training job runs inside a VPC (which is common for cross-account access), the VPC must have a route to S3, either through a VPC endpoint (Gateway or Interface) or a NAT gateway. Without such a route, the training instances cannot reach the private S3 bucket in the other account, even if IAM roles and bucket policies are correctly configured, resulting in an 'access denied' error because the network path is blocked.

Exam trap

The trap here is that candidates often assume 'access denied' always means an IAM or bucket policy issue, but in cross-account scenarios with VPCs, network misconfigurations (like missing VPC endpoints) are a common hidden cause that produces the same error message.

How to eliminate wrong answers

Option A is wrong because the question states that the correct IAM roles and bucket policies have been set up, so the execution role likely already has s3:GetObject permission; the error is not due to missing IAM permissions. Option B is wrong because while SSE-KMS can cause access issues, the question specifically says the scientist set up correct IAM roles and bucket policies, and there is no mention of KMS key configuration; the most likely cause in a cross-account scenario with VPC is network connectivity. Option C is wrong because all SageMaker training instance types support S3 access via the SageMaker-managed S3 client; there is no instance type that inherently lacks S3 connectivity.

118
Multi-Selecthard

A data scientist is training a binary classification model using Amazon SageMaker's built-in XGBoost algorithm. The dataset is highly imbalanced (95% negative class, 5% positive class). The model achieves high accuracy but poor recall on the positive class. Which TWO actions should the data scientist take to improve recall without significantly sacrificing precision?

Select 2 answers
A.Perform random undersampling of the majority class.
B.Set scale_pos_weight to the ratio of negative to positive samples.
C.Increase the max_depth hyperparameter.
D.Reduce the learning rate (eta) and increase num_round.
E.Use SMOTE to generate synthetic samples of the minority class.
AnswersB, E

This parameter assigns higher weight to the minority class, penalizing misclassifications more.

Why this answer

Options B and E are correct. Using scale_pos_weight adjusts the weight of the positive class, directly addressing imbalance. SMOTE oversamples the minority class to balance the dataset.

Option A is wrong because subsampling the majority class may lose information. Option C is wrong because increasing max_depth may overfit. Option D is wrong because reducing eta may slow convergence but not directly help imbalance.

119
MCQhard

A data scientist is training a binary classifier on a dataset with 1 million rows and 500 features. The model uses XGBoost and achieves an AUC of 0.95 on the training set but only 0.72 on the test set. The scientist suspects overfitting. Which combination of hyperparameter adjustments is most likely to improve generalization?

A.Increase 'max_depth' and decrease 'learning_rate'
B.Increase 'subsample' and decrease 'colsample_bytree'
C.Decrease 'max_depth' and increase 'min_child_weight'
D.Decrease 'gamma' and increase 'learning_rate'
AnswerC

Decreasing max_depth reduces tree complexity; increasing min_child_weight prevents overfitting by requiring more samples per leaf.

Why this answer

Option C is correct because decreasing 'max_depth' reduces the complexity of individual trees, preventing the model from learning overly specific patterns in the training data. Increasing 'min_child_child_weight' forces the algorithm to require a higher sum of instance weights (hessian) before further partitioning, which acts as a regularization mechanism that discourages splits on noisy or sparse data. Together, these adjustments directly combat overfitting in XGBoost by limiting tree depth and requiring more evidence for splits, which improves generalization from the training AUC of 0.95 to a higher test AUC.

Exam trap

Cisco often tests the misconception that increasing regularization parameters like 'max_depth' or decreasing 'learning_rate' alone will fix overfitting, when in fact the correct approach is to reduce model complexity (decrease 'max_depth') and increase split regularization (increase 'min_child_weight').

How to eliminate wrong answers

Option A is wrong because increasing 'max_depth' makes trees deeper and more complex, which exacerbates overfitting, and decreasing 'learning_rate' alone does not compensate for the added depth; this combination would likely worsen the generalization gap. Option B is wrong because increasing 'subsample' (the fraction of rows sampled per tree) actually reduces randomness and can increase overfitting, while decreasing 'colsample_bytree' (the fraction of features sampled) adds some regularization but is insufficient to counterbalance the increased subsample; the net effect is ambiguous and not the most direct fix for overfitting. Option D is wrong because decreasing 'gamma' (the minimum loss reduction required for a split) allows more splits, increasing model complexity and overfitting, and increasing 'learning_rate' makes the model converge faster but with larger steps, which can also lead to overfitting; this combination moves in the wrong direction for regularization.

120
MCQhard

A data scientist is analyzing a dataset stored in Amazon S3 (100 GB, CSV format) using Amazon SageMaker Studio. The dataset contains 500 columns and 10 million rows. The data scientist wants to understand the distribution of each column, detect missing values, and identify outliers. However, the SageMaker Studio notebook instance runs out of memory when loading the entire dataset into a pandas DataFrame. The data scientist needs to complete the EDA efficiently without modifying the source data. What should the data scientist do?

A.Write a script that loads only a random 10% sample of rows to reduce memory usage.
B.Use AWS Glue ETL to transform the data into Parquet format and then load into pandas.
C.Launch a larger notebook instance with more memory (e.g., ml.r5.24xlarge) and reload the data.
D.Use Amazon SageMaker Data Wrangler to create a data flow that samples and profiles the data.
AnswerD

Data Wrangler can handle large datasets efficiently.

Why this answer

Using SageMaker Data Wrangler allows processing data in a distributed manner without loading everything into memory. Option A is wrong because increasing instance size may still not be enough and is costly. Option B is wrong because Glue ETL is more complex and not integrated with Studio.

Option D is wrong because sampling may miss important data.

121
MCQeasy

During EDA, a data scientist finds that a categorical feature 'city' has 500 unique values but only 10 cities account for 90% of the data. What is a recommended way to handle the rare categories?

A.Group rare categories into a single 'Other' category.
B.Apply label encoding to all categories.
C.One-hot encode all 500 categories.
D.Drop all rows with rare categories.
AnswerA

Reduces cardinality and retains data.

Why this answer

Option D is correct because grouping rare categories into 'Other' reduces cardinality. Option A is wrong because one-hot encoding all 500 creates many features. Option B is wrong because dropping all may lose information.

Option C is wrong because label encoding rare categories may not help.

122
MCQhard

A data scientist is performing exploratory data analysis on a dataset with mixed data types (numerical, categorical, text). The goal is to identify clusters of similar records. Which technique is most appropriate?

A.DBSCAN
B.Hierarchical clustering
C.K-means clustering
D.K-prototypes clustering
AnswerD

K-prototypes is designed for mixed numerical and categorical data.

Why this answer

K-prototypes extends k-means to handle mixed data by combining Euclidean distance for numerical and Hamming distance for categorical. K-means only works with numerical data. DBSCAN works on numerical data.

Hierarchical clustering typically uses numerical distance. Gower distance can be used but is less common in clustering algorithms.

123
MCQmedium

A company is building a data pipeline that ingests data from on-premises databases into Amazon S3 using AWS Database Migration Service (AWS DMS). The company wants to capture continuous changes from the source database and replicate them to S3 in near-real time. Which AWS DMS configuration should the company use?

A.Create a full-load task to copy the existing data
B.Create a full-load plus CDC task with S3 target
C.Create a validation task to compare source and target
D.Create a CDC-only task with S3 as the target endpoint
AnswerD

CDC-only captures and replicates changes in near-real time.

Why this answer

Option D is correct because using a CDC-only task with S3 as the target endpoint replicates continuous changes to S3. Option A is wrong because a full-load task only migrates existing data. Option B is wrong because a full-load plus CDC task includes both, but the requirement is only changes.

Option C is wrong because a validation task is for data validation, not replication.

124
Multi-Selecteasy

A company wants to deploy a machine learning model on Amazon SageMaker and needs to monitor the model's performance in production. Which TWO AWS services can be used to set up monitoring?

Select 2 answers
A.Amazon CloudWatch
B.AWS X-Ray
C.Amazon Inspector
D.Amazon SageMaker Model Monitor
E.AWS Config
AnswersA, D

CloudWatch monitors endpoint metrics like latency and invocations.

Why this answer

Options A and D are correct. Amazon CloudWatch can monitor endpoint metrics, and SageMaker Model Monitor can detect data drift. Option B is wrong because AWS Config is for resource compliance, not model monitoring.

Option C is wrong because Amazon Inspector is for security assessment. Option E is wrong because AWS X-Ray is for tracing requests, not model performance.

125
Multi-Selectmedium

A data scientist is deploying a model to a SageMaker endpoint and needs to optimize for cost while maintaining low latency. Which TWO actions should the data scientist take?

Select 2 answers
A.Use a larger instance type
B.Deploy to a single instance
C.Switch to batch transform
D.Use SageMaker Serverless Inference
E.Enable Auto Scaling on the endpoint
AnswersD, E

Pay per inference, scales automatically, cost-effective.

Why this answer

Options A and D are correct. Option A: Serverless Inference automatically scales and costs are based on usage. Option D: Auto Scaling adjusts instance count based on traffic, reducing cost during low demand.

Option B is wrong because larger instances increase cost. Option C is wrong because batch transform is not real-time. Option E is wrong because reducing instances increases latency.

126
Multi-Selecteasy

A data analyst is using AWS Glue to catalog datasets for exploratory analysis. The analyst wants to understand the schema and data types. Which TWO tools can the analyst use to view the schema of a table in the AWS Glue Data Catalog? (Choose TWO.)

Select 2 answers
A.Amazon Athena
B.Amazon Redshift query editor
C.Amazon QuickSight
D.AWS Glue console
E.Amazon S3 console
AnswersA, D

Athena can query the Glue Data Catalog using SHOW CREATE TABLE or INFORMATION_SCHEMA.

Why this answer

Option B is correct because Athena can query the INFORMATION_SCHEMA to view table schemas. Option D is correct because the Glue console displays table schemas directly. Option A is wrong because QuickSight is a visualization tool, not for schema viewing.

Option C is wrong because S3 does not store schemas; it stores objects. Option E is wrong because Redshift returns data, not schema from Glue Catalog directly.

127
Multi-Selecthard

A company is using AWS Glue ETL jobs to transform data. The jobs are failing due to insufficient memory. The data processing involves complex joins and aggregations. Which THREE actions can improve job performance and reduce memory usage?

Select 3 answers
A.Filter and project data early in the transformation to reduce data volume
B.Decrease the number of DPUs allocated to the job
C.Repartition the data and use bucketing to reduce shuffle size
D.Increase the number of DPUs (workers) allocated to the job
E.Use a single node cluster to avoid shuffle overhead
AnswersA, C, D

Reduces memory footprint.

Why this answer

Option A is correct because filtering and projecting data early in the transformation reduces the volume of data that must be processed in subsequent operations like joins and aggregations. By using pushdown predicates and selecting only necessary columns, you minimize the data shuffled across the cluster, which directly reduces memory pressure and improves job performance in AWS Glue ETL.

Exam trap

The trap here is that candidates often assume reducing resources (Option B) or eliminating parallelism (Option E) will solve memory issues, when in fact these actions exacerbate the problem by increasing the data load per executor or removing the benefits of distributed processing.

128
MCQmedium

A data scientist is training a binary classifier on an imbalanced dataset where the positive class represents 1% of the data. The model is evaluated using accuracy, but the accuracy is 99% even though the model predicts all instances as negative. Which metric should the data scientist use to properly evaluate the model?

A.Root mean squared error (RMSE)
B.Mean squared error (MSE)
C.F1 score
D.Accuracy
AnswerC

F1 score combines precision and recall, providing a better measure for imbalanced classification.

Why this answer

The F1 score is the harmonic mean of precision and recall, making it robust to class imbalance. With 99% negative instances, accuracy is misleadingly high even if the model never predicts the positive class. F1 captures both false positives and false negatives, providing a balanced evaluation of the minority class performance.

Exam trap

The trap here is that candidates see 99% accuracy and assume the model is performing well, failing to recognize that accuracy is unreliable for imbalanced datasets, and they may incorrectly choose accuracy or a regression metric without considering the need for a precision-recall based metric like F1.

How to eliminate wrong answers

Option A is wrong because RMSE is a regression metric that measures the square root of the average squared differences between predicted and actual values, not suitable for binary classification evaluation. Option B is wrong because MSE is also a regression metric that penalizes larger errors quadratically, and it does not account for class imbalance or the confusion matrix structure. Option D is wrong because accuracy is dominated by the majority class in imbalanced datasets; a model predicting all negatives achieves 99% accuracy but fails to identify any positive instances, making it a misleading metric.

129
MCQeasy

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The SageMaker training role has the necessary permissions to decrypt the data. However, the training job fails with an access denied error. What is the most likely cause?

A.The S3 bucket policy does not grant access to the training role
B.The training image is not compatible with encrypted data
C.The training role does not have kms:Decrypt permission for the KMS key
D.CloudTrail logging is disabled
E.The training job is not in the same VPC as the S3 bucket
AnswerC

KMS requires explicit decrypt permission.

Why this answer

Option B is correct because SageMaker needs explicit permissions to use the KMS key for decryption. Option A (S3 bucket policy) is less likely if the role has S3 access. Option C (VPC) is unrelated to KMS.

Option D (training image) is not relevant. Option E (CloudTrail) is for logging.

130
MCQhard

A machine learning engineer is deploying a model that predicts customer churn. The model outputs probabilities between 0 and 1. The business requires that at least 90% of customers flagged for churn actually churn (precision >= 0.9). Currently, the model's precision is 0.85 at the default threshold of 0.5. Which threshold adjustment should the engineer consider?

A.Decrease the threshold to 0.4
B.Decrease the threshold to 0.3
C.Increase the threshold to 0.7
D.Keep the threshold at 0.5
AnswerC

Higher threshold increases precision by requiring higher confidence for positive predictions.

Why this answer

Increasing the threshold to 0.7 raises the probability cutoff for classifying a customer as churning. This means only customers with a high predicted probability (strong model confidence) are flagged, which reduces false positives and increases precision. Since the current precision at 0.5 is 0.85 and the goal is ≥0.9, moving the threshold higher is the correct direction to achieve the required precision.

Exam trap

The trap here is that candidates often associate higher thresholds with lower recall and assume precision will drop, but in reality, increasing the threshold filters out low-confidence positives, which reduces false positives and increases precision.

How to eliminate wrong answers

Option A is wrong because decreasing the threshold to 0.4 would classify more customers as churn, including those with lower probabilities, which typically increases false positives and lowers precision further below 0.9. Option B is wrong because decreasing the threshold to 0.3 would have an even more extreme effect, flooding the flagged set with low-confidence predictions and worsening precision. Option D is wrong because keeping the threshold at 0.5 maintains the current precision of 0.85, which does not meet the business requirement of at least 0.9.

131
MCQeasy

A team is using SageMaker to train a model. They want to track hyperparameters, metrics, and model artifacts. Which SageMaker feature should they use?

A.SageMaker Pipelines
B.SageMaker Experiments
C.SageMaker Debugger
D.SageMaker Model Registry
AnswerB

Experiments track hyperparameters, metrics, and artifacts.

Why this answer

Option D is correct: SageMaker Experiments tracks parameters, metrics, and artifacts. Option A (Model Registry) manages model versions. Option B (Pipelines) orchestrates workflows.

Option C (Debugger) monitors training.

132
MCQmedium

A data scientist is working on a project to predict customer churn for a telecom company. The dataset includes 50,000 records with 20 features, including customer demographics, account information, and service usage. The data scientist uses Amazon SageMaker Studio and loads the data into a pandas DataFrame. During EDA, the data scientist notices that the target variable 'churn' has only 10% positive cases. Additionally, several features have missing values: 'income' has 5% missing, 'age' has 2% missing, and 'total_charges' has 1% missing. The data scientist also observes that 'income' is highly skewed with a long right tail, and 'age' is moderately skewed. The data scientist wants to handle missing values and prepare the data for modeling. Which course of action is most appropriate?

A.Impute 'income' with median, 'age' with median, 'total_charges' with median, and use SMOTE to handle class imbalance after splitting the data.
B.Remove all rows with any missing values, and use random oversampling to handle class imbalance.
C.Impute 'income' with mode, 'age' with mode, 'total_charges' with mode, and use SMOTE after splitting.
D.Impute all missing values with the mean of each column, and use stratified sampling to handle class imbalance.
AnswerA

Median is robust to skewness. SMOTE is appropriate for imbalance.

Why this answer

Option A is correct because median imputation is robust for skewed data, and for the target imbalance, SMOTE can be applied after splitting. Option B is wrong because mean imputation is sensitive to skewness. Option C is wrong because removing rows with missing values would discard 8% of data, which is significant.

Option D is wrong because mode is for categorical data, not continuous.

133
MCQhard

A data scientist is building a model to predict housing prices using a dataset with 100,000 records and 50 features. The features include 'sqft_living', 'sqft_lot', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', etc. The data scientist uses Amazon SageMaker Data Wrangler for EDA. Upon reviewing the data, the data scientist finds that 'sqft_living' has a correlation of 0.7 with 'sqft_above' (square footage above ground) and 0.6 with 'sqft_basement'. Also, 'grade' (overall grade of the house) is highly correlated with 'condition' (0.8). The target variable 'price' is right-skewed. The data scientist plans to use a linear regression model. Which set of actions should the data scientist take to improve model performance?

A.Apply standard scaling to all numeric features and use the data as is, since linear regression is robust to multicollinearity.
B.Remove all features that have correlation >0.5 with any other feature to eliminate multicollinearity, and apply standard scaling to all numeric features.
C.Apply principal component analysis (PCA) to all features to reduce dimensionality, and then fit linear regression on the principal components.
D.Apply log transformation to the target variable 'price' to reduce skewness, and remove either 'sqft_above' or 'sqft_living' and either 'grade' or 'condition' to handle multicollinearity.
AnswerD

Log transform addresses skewness; removing one of each pair reduces multicollinearity.

Why this answer

Option B is correct because log-transforming the target addresses skewness, and removing or combining highly correlated features reduces multicollinearity. Option A is wrong because removing all correlated features may discard predictive power. Option C is wrong because standard scaling does not fix skewness or multicollinearity.

Option D is wrong because PCA on all features may lose interpretability and does not target the specific issues.

134
MCQhard

A data scientist is setting up an IAM policy for a SageMaker notebook instance that needs to read and write data in the 'training/' folder of an S3 bucket, and also list objects in the bucket. Does the policy satisfy the requirements?

A.Yes, the policy correctly grants the required permissions.
B.No, the policy must also include s3:DeleteObject for data cleaning.
C.No, the policy misses s3:GetObject for the bucket itself.
D.No, the condition on ListBucket is invalid.
AnswerA

The policy grants read/write on objects under training/ and list with prefix condition.

Why this answer

Option A is correct. The policy allows GetObject and PutObject on the training/ folder, and ListBucket with a condition restricting prefix to training/*, which allows listing only that prefix. Option B is wrong because the policy works.

Option C is wrong because the condition is valid. Option D is wrong because the policy is sufficient.

135
MCQeasy

A data scientist is training a binary classification model on an imbalanced dataset where the positive class is rare. The model currently achieves 95% accuracy but only 10% recall on the positive class. Which metric should the data scientist prioritize to improve model performance?

A.F1 score
B.AUC-ROC
C.Precision
D.Accuracy
AnswerA

F1 score combines precision and recall, making it suitable for imbalanced datasets where both false positives and false negatives are important.

Why this answer

The F1 score is the harmonic mean of precision and recall, making it the best single metric to optimize when the positive class is rare and both false positives and false negatives are costly. With 95% accuracy but only 10% recall, the model is likely predicting the majority class almost exclusively, so improving recall without sacrificing precision is critical — the F1 score directly balances this trade-off.

Exam trap

Cisco often tests the misconception that accuracy is always the best metric, but the trap here is that on imbalanced datasets, accuracy is misleadingly high even when the model fails to detect the rare positive class, so candidates must recognize that F1 score (or precision-recall AUC) is the appropriate choice.

How to eliminate wrong answers

Option B (AUC-ROC) is wrong because AUC-ROC measures the model's ability to rank positive instances higher than negative ones across all thresholds, but it can be misleading on highly imbalanced datasets — a high AUC-ROC can still correspond to poor recall if the model scores all positives slightly above negatives but never predicts them. Option C (Precision) is wrong because optimizing precision alone would further reduce recall, making the model even less useful for detecting the rare positive class — precision focuses on minimizing false positives, not on capturing true positives. Option D (Accuracy) is wrong because accuracy is dominated by the majority class in imbalanced settings; a model that predicts the negative class for every instance can achieve 95% accuracy while having 0% recall, which is the exact problem described.

136
MCQeasy

A data scientist wants to deploy a PyTorch model for real-time inference with latency under 100 ms. Which AWS service is most suitable?

A.Amazon SageMaker real-time endpoint
B.Amazon SageMaker Processing
C.AWS Lambda with container image
D.Amazon SageMaker Batch Transform
AnswerA

Provides low-latency inference suitable for real-time applications.

Why this answer

Option B is correct because Amazon SageMaker real-time endpoints provide low-latency inference. Option A (SageMaker Batch Transform) is for batch predictions, not real-time. Option C (Lambda) has limited runtime and scalability for large models.

Option D (SageMaker Processing) is for data processing, not inference.

137
MCQmedium

A company is using SageMaker to deploy a real-time inference endpoint for a natural language processing model. The model receives input text and returns predictions. The data scientist notices that the endpoint latency increases significantly under load. Which design change would MOST effectively reduce latency?

A.Enable data capture for monitoring
B.Switch to batch transform for real-time predictions
C.Increase the number of instances behind the endpoint
D.Use an inference pipeline to combine preprocessing and model inference
AnswerD

Inference pipelines reduce network overhead between preprocessing and prediction.

Why this answer

Option D is correct because an inference pipeline in SageMaker allows you to chain preprocessing logic directly with the model inference within the same endpoint container. This eliminates the need for separate Lambda functions or client-side preprocessing, which reduces network round-trips and serialization overhead, thereby lowering latency under load.

Exam trap

The trap here is that candidates often assume scaling out (Option C) is the universal fix for latency, but the question specifically targets latency under load caused by preprocessing overhead, not throughput limits.

How to eliminate wrong answers

Option A is wrong because enabling data capture for monitoring adds additional I/O overhead and storage writes, which can increase latency rather than reduce it. Option B is wrong because batch transform is designed for offline, asynchronous predictions on large datasets, not for real-time inference; switching to it would break the real-time requirement and introduce significant latency due to job queuing. Option C is wrong because increasing the number of instances behind the endpoint improves throughput and availability but does not directly reduce per-request latency; it may even add slight overhead from load balancing.

138
MCQmedium

During exploratory data analysis, a data scientist notices that the distribution of a continuous feature is heavily right-skewed. Which transformation should be applied to make the distribution more symmetric for linear regression?

A.Standardization (z-score)
B.One-hot encoding
C.Min-max scaling
D.Log transformation
AnswerD

Log transformation reduces right skewness.

Why this answer

Log transformation is commonly used for right-skewed data. Option A is wrong because min-max scaling does not change distribution shape. Option B is wrong because standardization does not fix skewness.

Option D is wrong because one-hot encoding is for categorical features.

139
MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Kinesis Data Analytics application that runs SQL queries. The application has been failing intermittently with 'ProvisionedThroughputExceededException' errors. Which action should be taken to resolve this issue?

A.Disable error logging in the Kinesis Data Analytics application.
B.Increase the record size in the Kinesis data stream.
C.Switch from Kinesis Data Analytics to Kinesis Data Firehose.
D.Increase the number of shards in the Kinesis data stream.
AnswerD

Correct: More shards increase read throughput capacity.

Why this answer

The error indicates that the shard's read throughput limit (5 transactions/second per shard) is being exceeded. Increasing the number of shards increases the total throughput. Option A (increase shard count) is the correct solution.

Option B (increase record size) could worsen the problem. Option C (use Kinesis Firehose) changes the architecture but does not address the shard throughput. Option D (disable error logging) does not solve the underlying issue.

140
MCQhard

A data scientist ran a hyperparameter tuning job for an XGBoost model. The tuning job completed, but the best validation RMSE is 2.34. The data scientist believes the model can perform better. Based on the exhibit, which change to the tuning strategy is most likely to improve the model's performance?

A.Use random search instead of Bayesian optimization
B.Change the objective to binary:logistic
C.Increase the maximum value of eta to 1.0
D.Increase the static num_round hyperparameter to 500
AnswerD

The tuning job fixed num_round to 100; increasing it allows more boosting rounds, which can improve model performance.

Why this answer

Option D is correct because increasing the static `num_round` hyperparameter to 500 allows the model to train for more boosting rounds, which can reduce underfitting and lower the RMSE further. The current best validation RMSE of 2.34 suggests the model may not have converged, and additional rounds can help the XGBoost model learn more complex patterns, provided overfitting is monitored with early stopping.

Exam trap

The trap here is that candidates may think increasing `eta` to 1.0 accelerates learning, but they overlook that a high learning rate without sufficient boosting rounds or regularization often causes the model to overshoot the optimal solution, degrading RMSE.

How to eliminate wrong answers

Option A is wrong because random search is less efficient than Bayesian optimization for hyperparameter tuning, as it does not learn from previous trials to focus on promising regions, so switching to random search would likely degrade performance. Option B is wrong because changing the objective to binary:logistic is for binary classification tasks, but the RMSE metric indicates a regression problem, so this would be a fundamental mismatch. Option C is wrong because increasing the maximum value of `eta` (learning rate) to 1.0 would make the model take overly large steps during training, likely causing divergence or poor convergence, which would worsen RMSE rather than improve it.

141
Multi-Selectmedium

Which TWO actions can help reduce inference latency for a SageMaker endpoint?

Select 2 answers
A.Switch to batch transform
B.Use SageMaker Neo to optimize the model
C.Use a larger instance type
D.Enable SageMaker Endpoint Cache
E.Use a multi-model endpoint
AnswersB, D

Neo optimizes models for target hardware, reducing latency.

Why this answer

Options A and D are correct. Option A reduces model size, Option D enables caching. Option B is wrong because larger instances may not reduce latency.

Option C is wrong because batch transforms are for batch, not real-time. Option E is wrong because multi-model endpoints are for hosting multiple models, not latency reduction.

142
MCQeasy

A data engineering team needs to orchestrate a complex workflow that involves multiple AWS Glue jobs, Lambda functions, and S3 operations. The workflow must run on a schedule and allow monitoring of each step. Which AWS service should they use?

A.Amazon Simple Workflow Service (SWF)
B.AWS Step Functions
C.AWS Data Pipeline
D.Amazon CloudWatch Events
AnswerB

Step Functions provides state machines to orchestrate multi-step workflows.

Why this answer

AWS Step Functions is a serverless orchestration service that can coordinate multiple AWS services and visualizes workflows. Option A is wrong because CloudWatch Events can trigger based on events but not orchestrate. Option B is wrong because Data Pipeline is for data-driven workflows but less flexible.

Option D is wrong because Simple Workflow Service is a legacy service.

143
MCQmedium

A data scientist is analyzing a dataset with a skewed target variable for a regression problem. During EDA, the scientist wants to transform the target variable to approximate a normal distribution. Which transformation should the scientist apply first?

A.Quantile transformation
B.Min-Max scaling
C.Log transformation
D.Box-Cox transformation
AnswerD

Box-Cox automatically finds the best power transformation to achieve normality.

Why this answer

Option B is correct because Box-Cox is designed to make data more normally distributed and handles positive values. Option A is wrong because log is a special case of Box-Cox but less general. Option C is wrong because scaling does not change distribution shape.

Option D is wrong because quantile transformation can be used but may overfit; Box-Cox is parametric and often preferred.

144
MCQhard

A data scientist is training an LSTM model for time series forecasting using Amazon SageMaker. The model is overfitting. Which action is LEAST likely to reduce overfitting?

A.Add dropout layers
B.Increase the number of LSTM layers
C.Reduce the number of hidden units
D.Use early stopping
AnswerB

Increases complexity, likely overfits more.

Why this answer

Option D is correct because increasing the number of LSTM layers typically increases model complexity, which can worsen overfitting. Option A is wrong because dropout helps. Option B is wrong because reducing hidden units reduces complexity.

Option C is wrong because early stopping prevents overfitting.

145
MCQmedium

A data engineer is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a large dataset stored in S3. The analysis reveals high cardinality in a categorical feature with over 1 million unique values. What is the best approach to handle this before training a model?

A.Apply one-hot encoding.
B.Use label encoding to convert categories to integers.
C.Drop the high-cardinality feature.
D.Use target encoding based on the mean of the target variable per category.
AnswerD

Target encoding reduces cardinality and captures target relationship.

Why this answer

Option B is correct because target encoding is effective for high cardinality. Option A is wrong because one-hot encoding would create too many columns. Option C is wrong because label encoding may introduce ordinal relationships.

Option D is wrong because dropping the feature may lose important information.

146
MCQmedium

A data scientist runs the AWS CLI command shown in the exhibit to list objects larger than 100 KB in an S3 bucket. The data scientist wants to understand the size distribution of these files. What is the most significant limitation of this approach for EDA?

A.The command only returns objects larger than 100 KB, not equal to.
B.The command may return incomplete results if there are more than 1000 objects.
C.The command uses the wrong query syntax and will fail.
D.The command does not return the file names, only sizes.
AnswerB

S3 list-objects returns up to 1000 objects per call; pagination is required for more.

Why this answer

Option D is correct because the CLI command may not return all objects if there are more than 1000, as S3 list-objects paginates by default. Option A is wrong because the command returns size and key, which are sufficient. Option B is wrong because the command does include objects exactly 100 KB.

Option C is wrong because the query syntax is correct for filtering.

147
MCQmedium

A data scientist is using Amazon SageMaker Data Wrangler to explore a dataset. They notice that a feature has a very high correlation (0.95) with the target variable. What should they do to avoid overfitting?

A.Use L2 regularization in the model
B.Apply PCA to reduce dimensionality
C.Standardize the feature using StandardScaler
D.Remove the feature from the dataset
AnswerD

Correct: High correlation with target can indicate data leakage; removing is safest.

Why this answer

Option A is correct because a feature with correlation 0.95 may be directly derived from the target or leaking information; removing it prevents overfitting. Option B is wrong because PCA is for dimensionality reduction but may still include the leak. Option C is wrong because regularization helps but may not fully address leakage.

Option D is wrong because scaling does not address correlation.

148
MCQmedium

A data scientist is reviewing the training logs from a SageMaker training job. The model's loss decreases steadily and accuracy increases. However, when the model is evaluated on a holdout test set, the accuracy is only 0.65. Which issue does this behavior suggest?

A.The model is overfitting to the training data.
B.The learning rate is too high.
C.The model is underfitting the training data.
D.The training data is corrupted.
AnswerA

High training accuracy but low test accuracy is classic overfitting.

Why this answer

Option C is correct because low test accuracy indicates overfitting. Option A is wrong because loss and accuracy improve. Option B is wrong because it's not a data issue.

Option D is wrong because it's not underfitting.

149
MCQeasy

A data scientist is training a regression model on a dataset with 50 features. After training a linear regression model, the model achieves an R-squared of 0.85 on the training set but only 0.55 on the test set. Which technique is most likely to reduce the generalization error?

A.Add more features
B.Remove highly correlated features
C.Increase the polynomial degree of the model
D.Apply L2 regularization (Ridge regression)
AnswerD

L2 regularization shrinks coefficients, reducing variance and improving test performance.

Why this answer

The model exhibits high variance (overfitting): high training R² (0.85) but much lower test R² (0.55). L2 regularization (Ridge regression) shrinks coefficients toward zero, reducing model complexity and penalizing large weights, which directly combats overfitting and improves generalization to unseen data.

Exam trap

AWS often tests the distinction between overfitting (high variance) and underfitting (high bias), and candidates mistakenly choose feature removal or polynomial adjustment when regularization is the direct fix for variance-dominated error.

How to eliminate wrong answers

Option A is wrong because adding more features would increase model complexity and likely worsen overfitting, not reduce generalization error. Option B is wrong because removing highly correlated features addresses multicollinearity, which inflates coefficient variance but is not the primary cause of the large train-test gap (overfitting) seen here. Option C is wrong because increasing the polynomial degree would further increase model flexibility and exacerbate overfitting, leading to even lower test performance.

150
MCQmedium

A data scientist is training a multiclass classification model to categorize support tickets into 50 categories. The dataset has 100,000 labeled tickets. The scientist uses a random forest classifier with 100 trees. The model achieves 90% accuracy on the test set, but the F1-score for some rare categories is below 0.1. The scientist wants to improve performance on rare categories without significantly reducing overall accuracy. Which approach should the scientist try?

A.Increase the maximum depth of trees
B.Reduce the number of trees to 50 to prevent overfitting
C.Switch to a one-vs-rest logistic regression model
D.Use class_weight='balanced' or compute custom class weights
AnswerD

Class weights penalize misclassifications of rare classes more heavily.

Why this answer

Option A (use class weights) helps the model focus on rare classes. Option B (reduce the number of trees) may hurt overall performance. Option C (use one-vs-rest logistic regression) may not handle rare classes well.

Option D (increase max_depth) could overfit.

Page 1

Page 2 of 24

Page 3