Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1501–1575

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 21 of 24

1501

MCQhard

A data scientist is training a gradient boosting model on a large dataset (100 GB) stored in Amazon S3. The training job uses a SageMaker built-in XGBoost algorithm with a single ml.p3.2xlarge instance. The job fails with a memory error. Which solution should the data scientist adopt to resolve the memory issue?

A.Reduce the number of features by using PCA before training.

B.Use SageMaker Pipe mode to stream data directly from S3 instead of downloading it.

C.Increase the number of training instances to use distributed training with XGBoost.

D.Use SageMaker BlazingText algorithm with negative sampling.

E.Switch to SageMaker Linear Learner algorithm, which requires less memory.

AnswerC

Distributed training splits data across instances, reducing memory per instance.

Why this answer

Option C is correct because increasing the number of instances allows distributed training, reducing per-instance memory pressure. Option A may not fully load the dataset. Option B uses a different algorithm with different memory characteristics, but XGBoost can handle large data with distributed training.

Option D reduces features but may degrade model quality. Option E uses a different algorithm but not necessarily more memory-efficient.

Full explanation →

1502

MCQmedium

A data scientist ran a SageMaker training job that failed with the error shown. The training script expects the data in '/opt/ml/input/data/training/train.csv'. What is the most likely issue?

A.The hyperparameter 'sagemaker_program' is misspelled

B.The training script has a bug in reading the file

C.The channel name should be 'train' instead of 'training'

D.The S3 data path should point to the exact file, not the folder

AnswerD

SageMaker copies the prefix content into the channel directory; if train.csv is not at the root of that prefix, the path is wrong.

Why this answer

Option A is correct. The S3Uri points to a prefix (folder), but the script expects a specific file name. SageMaker copies the contents of the S3 prefix to the channel directory; if the file is not directly under that prefix, it won't be found.

Option B is wrong because the channel name matches. Option C is wrong if the file exists. Option D is wrong because the error is about the file, not hyperparameters.

Full explanation →

1503

MCQeasy

A machine learning engineer is deploying a model to SageMaker for real-time inference. The model is a TensorFlow SavedModel. Which SageMaker capability should be used to create an endpoint?

A.SageMaker hosting with TensorFlow Serving container

B.SageMaker Pipelines

C.SageMaker Model Monitor

D.SageMaker Ground Truth

AnswerA

SageMaker supports TensorFlow Serving for model deployment.

Why this answer

Option A is correct because SageMaker provides managed TensorFlow serving containers. Option B is wrong because SageMaker Ground Truth is for labeling data. Option C is wrong because SageMaker Model Monitor is for monitoring.

Option D is wrong because SageMaker Pipelines is for ML workflows.

Full explanation →

1504

MCQhard

A data engineer is performing exploratory data analysis on a large dataset stored in Amazon S3 (10 TB in CSV format). The dataset has 2000 columns and 50 million rows. The engineer needs to compute summary statistics (mean, median, standard deviation) for each numeric column and identify missing values. Which approach is MOST cost-effective and time-efficient?

A.Use Amazon Redshift Spectrum to query the data directly from S3.

B.Load the data into Amazon SageMaker Data Wrangler and compute statistics interactively.

C.Convert the data to Apache Parquet format, then use Amazon Athena to run SQL queries for statistics.

D.Use AWS Glue ETL to compute statistics and write results to S3.

AnswerC

Parquet reduces data scanned, and Athena is cost-effective for ad-hoc queries.

Why this answer

Using Amazon Athena with columnar formats like Parquet after converting from CSV reduces query costs and improves performance. Option A (SageMaker Data Wrangler) may struggle with 2000 columns. Option B (AWS Glue ETL) is more expensive and slower for simple statistics.

Option D (Redshift Spectrum) requires setting up a Redshift cluster, which is overkill.

Full explanation →

1505

Multi-Selectmedium

Which TWO are appropriate techniques for detecting outliers in a dataset during exploratory data analysis?

Select 2 answers

A.Z-score method (assuming normal distribution)

B.One-hot encoding

C.Principal component analysis (PCA)

D.t-SNE

E.Interquartile range (IQR) method

AnswersA, E

Z-score identifies outliers based on standard deviations.

Full explanation →

1506

Multi-Selecteasy

A data scientist is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. Which TWO actions are necessary to allow SageMaker to access the data?

Select 2 answers

A.Ensure the SageMaker execution role has s3:GetObject permission.

B.Enable S3 Transfer Acceleration.

C.Set up a VPC endpoint for S3.

D.Add a bucket policy allowing SageMaker access.

E.Grant the SageMaker execution role kms:Decrypt permission.

AnswersA, E

Required to read objects.

Why this answer

Option A is correct because SageMaker needs permission to decrypt. Option C is correct because the execution role needs permission. Option B is wrong because bucket policy is separate.

Option D is wrong because it's not required. Option E is wrong because VPC endpoint is not required.

Full explanation →

1507

Multi-Selecthard

A data scientist is using Amazon SageMaker Debugger to monitor training. Which THREE types of issues can Debugger monitor?

Select 3 answers

A.Hardware failures

B.Poor weight initialization

C.Data drift

D.Overfitting

E.Vanishing gradients

AnswersB, D, E

Debugger can detect issues from poor initialization.

Why this answer

Amazon SageMaker Debugger can monitor training for poor weight initialization by analyzing tensors and gradients during the training process. It uses built-in rules to detect if weights are initialized with values that are too large or too small, which can lead to slow convergence or failure to learn. This is a core capability of Debugger's real-time monitoring of model parameters.

Exam trap

The trap here is that candidates confuse SageMaker Debugger (which monitors training metrics like gradients and weights) with SageMaker Model Monitor (which monitors inference data for drift and bias), leading them to incorrectly select data drift as a Debugger capability.

Full explanation →

1508

MCQhard

A data science team uses Amazon SageMaker to train models on a dataset stored in Amazon S3. The dataset is 2 TB and is accessed by multiple training jobs. The team notices that training jobs are slow due to high S3 GET request latency. Which solution would provide the fastest and most cost-effective data access?

A.Place all training instances in a Cluster Placement Group

B.Enable S3 Transfer Acceleration on the bucket

C.Mount an Amazon FSx for Lustre file system integrated with the S3 bucket

D.Use Elastic Fabric Adapter (EFA) for training instances

AnswerC

FSx for Lustre provides a high-performance file system that can read data from S3 with low latency.

Why this answer

Using Amazon FSx for Lustre as a managed Lustre file system with S3 integration provides high-throughput, low-latency access for training jobs. Option A (Cluster Placement Group) reduces network latency but does not improve S3 access. Option B (Elastic Fabric Adapter) improves inter-node communication, not data access from S3.

Option D (S3 Transfer Acceleration) improves upload speed, not GET latency for large datasets.

Full explanation →

1509

Multi-Selecthard

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be processed and stored in S3 in near real-time. Which THREE services can be used together to achieve this?

Select 3 answers

A.Amazon Kinesis Data Analytics

B.Amazon Kinesis Data Firehose

C.AWS Glue ETL

D.AWS Lambda

E.Amazon EMR

AnswersA, B, D

Can process streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics is correct because it can process streaming clickstream data in real-time using SQL or Apache Flink, enabling transformations, aggregations, and filtering before the data is delivered downstream. It integrates directly with Kinesis Data Streams as a source and can output processed records to Kinesis Data Firehose for storage in Amazon S3, achieving near real-time processing and storage.

Exam trap

The trap here is that candidates often assume AWS Glue ETL can handle real-time streaming because it supports Spark Streaming, but Glue ETL jobs are fundamentally batch-oriented and not designed for continuous, low-latency ingestion from Kinesis Data Streams into S3.

Full explanation →

1510

MCQeasy

A data engineer needs to store streaming data from thousands of IoT devices for real-time analytics. Which AWS service is most suitable for ingesting and storing this data for subsequent processing by Amazon Kinesis Data Analytics?

A.Amazon Kinesis Data Streams

B.Amazon S3

C.Amazon RDS

D.Amazon DynamoDB

AnswerA

Kinesis Data Streams ingests and stores streaming data in real-time for analytics.

Why this answer

Option C is correct because Kinesis Data Streams is designed for real-time data ingestion and storage, and integrates with Kinesis Data Analytics. Option A is wrong because S3 is not real-time; Option B is wrong because DynamoDB is for NoSQL, not streaming; Option D is wrong because RDS is a relational database not suited for high-velocity streaming.

Full explanation →

1511

Multi-Selecthard

A company is using Amazon SageMaker to tune hyperparameters for a gradient boosting model. The objective is to minimize root mean squared error (RMSE). The data scientist wants to explore the hyperparameter space efficiently. Which THREE hyperparameter tuning strategies should the data scientist consider? (Choose 3.)

Select 3 answers

A.Bayesian optimization

B.Random search

C.Grid search

D.Manual search

E.Hyperband

AnswersA, B, E

Uses probabilistic model to guide search.

Why this answer

Bayesian optimization is correct because it builds a probabilistic model of the objective function (RMSE) and uses an acquisition function to select the next hyperparameter combination to evaluate. This approach is sample-efficient, making it ideal for expensive-to-evaluate models like gradient boosting, as it balances exploration and exploitation to find optimal hyperparameters with fewer trials.

Exam trap

The trap here is that candidates often assume grid search is the most thorough strategy, but in practice it is inefficient for high-dimensional spaces, while SageMaker explicitly supports Bayesian optimization, random search, and Hyperband as the three built-in tuning strategies.

Full explanation →

1512

Multi-Selectmedium

A data scientist is training a neural network for image classification. The training loss is not decreasing significantly, and the validation loss is high. Which TWO actions should the scientist take to address potential vanishing gradients?

Select 2 answers

A.Increase the learning rate

B.Use ReLU activation functions in hidden layers

C.Switch activation functions from ReLU to sigmoid

D.Add batch normalization layers

E.Remove dropout layers

AnswersB, D

ReLU does not saturate for positive inputs, reducing vanishing gradient risk.

Why this answer

ReLU activation functions help mitigate vanishing gradients because they output a constant gradient of 1 for positive inputs, preventing the gradient from shrinking as it propagates backward through many layers. This avoids the exponential decay of gradients that occurs with saturating activations like sigmoid or tanh, enabling effective training of deep networks.

Exam trap

The trap here is that candidates may confuse vanishing gradients with overfitting or learning rate issues, leading them to choose options like increasing the learning rate or removing dropout, which do not address the fundamental gradient propagation problem.

Full explanation →

1513

MCQmedium

In exploratory data analysis, a data scientist notices that the distribution of a feature 'income' is heavily right-skewed. Which transformation is most appropriate to reduce skewness?

A.Standardization (z-score).

B.Square transformation.

C.Min-max scaling.

D.Log transformation.

AnswerD

Log transformation reduces right skew.

Why this answer

Log transformation is the most appropriate technique to reduce right skewness in a feature like 'income' because it compresses the long tail of high values while expanding the lower end, making the distribution more symmetric. This is particularly effective for income data, which often follows a log-normal distribution, and is a standard preprocessing step in machine learning to improve model performance.

Exam trap

The trap here is that candidates confuse scaling techniques (which change range or variance) with transformations that alter distribution shape, leading them to pick standardization or min-max scaling as a fix for skewness.

How to eliminate wrong answers

Option A is wrong because standardization (z-score) centers and scales the data to have mean 0 and standard deviation 1, but it does not change the shape of the distribution, so skewness remains. Option B is wrong because a square transformation amplifies larger values even more, which would worsen right skewness rather than reduce it. Option C is wrong because min-max scaling linearly rescales the data to a fixed range (e.g., [0,1]), which preserves the original distribution shape and does not address skewness.

Full explanation →

1514

MCQmedium

A company is building a data pipeline using AWS Glue to transform data from Amazon RDS to Amazon S3. The pipeline runs daily and processes about 500 GB of data. The team notices that the job is taking longer than expected. Which change would MOST improve the job performance?

A.Disable job bookmarking

B.Increase the number of DPUs for the Glue job

C.Upgrade the RDS instance to a larger class

D.Use smaller file sizes in S3 output

AnswerB

More DPUs provide more parallelism and can speed up the job.

Why this answer

Increasing the number of DPUs (Data Processing Units) allocated to the Glue job can improve parallelism and reduce execution time. Option A is wrong because upgrading RDS may not be the bottleneck. Option B is wrong because using a smaller file size could increase overhead.

Option D is wrong because disabling job bookmarking is not a performance improvement.

Full explanation →

1515

MCQmedium

A company runs a daily batch ETL job using AWS Glue that reads from Amazon RDS (MySQL), transforms the data, and writes to Amazon Redshift. The job takes 6 hours and processes 500 GB of data. Management wants to reduce the runtime. Which action would be MOST effective?

A.Increase the node size of the Redshift cluster

B.Use the Redshift COPY command to load data directly from RDS

C.Use Amazon RDS with Provisioned IOPS SSD storage

D.Increase the number of DPUs allocated to the Glue job

AnswerD

More DPUs allow Glue to process data in parallel, reducing overall runtime.

Why this answer

Option B is correct. Increasing the number of DPUs (Data Processing Units) in the Glue job can parallelize the workload and reduce runtime. Option A (increase Redshift node size) helps only at the write stage.

Option C (SSD on RDS) is not a bottleneck. Option D (using COPY command) is for Redshift, not Glue.

Full explanation →

1516

MCQhard

A data scientist is analyzing a dataset for a binary classification problem. The dataset has 10,000 samples and 200 features. After splitting into training (80%) and test (20%), the data scientist trains a decision tree classifier and achieves 100% accuracy on the training set but only 55% on the test set. Which step should the data scientist take first to address this issue?

A.Use cross-validation to evaluate model performance

B.Collect more training data

C.Add more features to the model

D.Prune the decision tree to reduce complexity

AnswerD

Why D is correct

Why this answer

Option D is correct because the large discrepancy between training and test accuracy indicates overfitting, and pruning the decision tree (e.g., limiting max_depth) reduces overfitting. Option A is wrong because more features may worsen overfitting. Option B is wrong because more data may help but is not the first step; also data is limited.

Option C is wrong because cross-validation is a technique to evaluate model performance but does not directly fix overfitting; pruning does.

Full explanation →

1517

MCQhard

A data scientist is using Amazon SageMaker's built-in BlazingText algorithm for word2vec embeddings. The dataset is a corpus of 10 million documents. After training, the data scientist observes that the learned embeddings do not capture semantic similarity well (e.g., 'king' and 'queen' are not close). Which hyperparameter adjustment is most likely to improve the quality of embeddings?

A.Increase the vector dimensionality

B.Decrease the window size

C.Decrease the number of negative samples

D.Increase the learning rate

AnswerA

Higher dimensionality allows embeddings to capture more fine-grained semantic relationships.

Why this answer

Increasing the vector dimensionality allows the model to capture more nuanced semantic relationships and co-occurrence patterns in the data. With 10 million documents, the default dimensionality (typically 100 or 300) may be insufficient to encode the rich contextual information, so raising it (e.g., to 300 or 500) gives the model more capacity to learn high-quality embeddings where words like 'king' and 'queen' become closer in vector space.

Exam trap

The trap here is that candidates often confuse 'window size' with 'context size' and assume decreasing it helps with similarity, but in reality, a larger window captures broader topical relationships, while a smaller window captures syntactic patterns; for semantic similarity, a moderate to large window is needed.

How to eliminate wrong answers

Option B is wrong because decreasing the window size reduces the context window, making the model focus on very local word co-occurrences, which actually harms the capture of broader semantic similarity like 'king' and 'queen'. Option C is wrong because decreasing the number of negative samples reduces the discriminative training signal, making it harder for the model to separate similar from dissimilar words, thus degrading embedding quality. Option D is wrong because increasing the learning rate can cause the optimization to overshoot minima or diverge, leading to unstable training and poor embeddings; the default learning rate in BlazingText is already tuned for convergence.

Full explanation →

1518

MCQmedium

A data scientist is training a neural network for time series forecasting. The training loss decreases initially but then starts to increase after 20 epochs. Which action should the scientist take to address this?

A.Increase the dropout rate

B.Increase the learning rate

C.Implement early stopping based on validation loss

D.Add more layers to the network

AnswerC

Early stopping halts training when validation loss stops improving, preventing overfitting.

Why this answer

Option C is correct because early stopping monitors validation loss and stops training when it starts increasing, preventing overfitting. Option A is wrong because increasing learning rate may cause divergence. Option B is wrong because adding more layers could worsen overfitting.

Option D is wrong because dropout helps with overfitting but the issue is that loss is increasing, not just overfitting; early stopping is direct.

Full explanation →

1519

Multi-Selecteasy

A data scientist is using Amazon SageMaker to build a custom training algorithm. The algorithm requires a specific library that is not included in the default SageMaker containers. The scientist wants to create a custom container that includes this library. Which TWO steps are required? (Choose TWO.)

Select 2 answers

A.Upload the Docker image to an Amazon S3 bucket

B.Create an AWS Lambda layer with the library

C.Build a Docker image with the required library

D.Register the container in the SageMaker Model Registry

E.Push the Docker image to Amazon ECR

AnswersC, E

Docker is used to create custom containers.

Why this answer

Options B and D are correct. The custom container must be built using Docker, and it must be pushed to Amazon ECR to be used by SageMaker. Option A is wrong because the container does not need to be registered in SageMaker Model Registry.

Option C is wrong because the container image is stored in ECR, not S3. Option E is wrong because AWS Lambda is for serverless functions, not for running training containers.

Full explanation →

1520

MCQmedium

An IAM policy is attached to a SageMaker notebook instance. The data scientist wants to use the notebook to train a model using data from S3 bucket 'my-bucket'. However, the training job fails with an access denied error. What is the MOST likely cause?

A.The notebook instance role does not have iam:PassRole permission to pass the SageMaker execution role

B.The sagemaker:CreateTrainingJob permission is not allowed on the specific resource

C.The S3 bucket resource ARN is incorrectly formatted

D.The s3:GetObject permission is missing for the bucket

AnswerA

SageMaker needs the notebook role to pass an execution role to training jobs.

Why this answer

Option B is correct because the notebook instance role needs permission to pass the execution role to SageMaker (iam:PassRole). The policy allows s3 and sagemaker actions but not PassRole. Option A is incorrect because GetObject and PutObject are allowed.

Option C is incorrect because SageMaker actions are allowed. Option D is incorrect because the resource is correct.

Full explanation →

1521

Multi-Selecthard

A data scientist is performing EDA on a dataset with 1 million rows and 50 features. The dataset includes a column 'user_id' with unique identifiers, a column 'event_date' with timestamps, and other columns. Which TWO actions should the data scientist take to understand data quality issues?

Select 2 answers

A.Analyze missing value patterns across columns

B.Check for duplicate rows based on 'user_id' and 'event_date'

C.Drop the 'user_id' column to reduce dimensionality

D.Use PCA to reduce dimensions and visualize

E.Train a random forest model to identify feature importance

AnswersA, B

Missing value analysis is key for data quality.

Why this answer

Checking for duplicate rows and analyzing missing values are fundamental steps in EDA. Option B is wrong because dropping user_id before analysis may lose information. Option C is wrong because training a model is not part of EDA.

Option D is wrong because PCA is not for data quality.

Full explanation →

1522

MCQmedium

A data pipeline uses Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by an AWS Lambda function that transforms and writes to Amazon DynamoDB. The Lambda function is throttled during traffic spikes, causing data to be reprocessed. Which solution should the team implement to handle the throttling without losing data?

A.Use Amazon SQS as an intermediate buffer between Kinesis and Lambda.

B.Increase the number of shards in the Kinesis stream and configure a dead-letter queue (DLQ) for the Lambda function.

C.Enable DynamoDB auto scaling to handle writes.

D.Reduce the batch size in the Lambda event source mapping.

AnswerB

More shards increase parallelism; DLQ captures failures for reprocessing.

Why this answer

Option C is correct because increasing the number of shards increases the streaming capacity, and using a DLQ captures failed records for later reprocessing. Option A is wrong because DynamoDB auto scaling does not address Lambda throttling. Option B is wrong because reducing batch size may increase processing overhead.

Option D is wrong because SQS is not needed as Kinesis already buffers data.

Full explanation →

1523

MCQeasy

A data scientist needs to understand the distribution of a continuous variable in a large dataset stored in Amazon S3. Which AWS service is most appropriate for quickly generating summary statistics and visualizations?

A.AWS Glue

B.Amazon Athena

C.Amazon QuickSight

D.Amazon SageMaker Studio

AnswerC

Correct: QuickSight can directly connect to S3 and create interactive dashboards with summary statistics.

Why this answer

Option A is correct because Amazon QuickSight integrates with S3 and can generate histograms and summary statistics without needing to move data. Option B is wrong because SageMaker is for model building, not quick ad-hoc analysis. Option C is wrong because Athena is for querying, not visualization.

Option D is wrong because Glue is for ETL, not analysis.

Full explanation →

1524

MCQhard

A company is using Amazon SageMaker to deploy a model for real-time inference. The model endpoint is behind an Application Load Balancer (ALB) for A/B testing. The data scientist notices that the endpoint is returning HTTP 503 errors intermittently. The CloudWatch metrics show that the endpoint's Invocations metric is within limits, but the ModelLatency metric has high variance. What is the most likely cause?

A.The model container is using a custom inference code that has a bug.

B.The ALB health check is misconfigured and marking instances unhealthy.

C.The endpoint instance type does not have enough memory for the model.

D.The endpoint is configured with too few instances; increase the instance count.

AnswerC

Insufficient memory can cause the model to fail to respond, leading to 503 errors.

Why this answer

Option C is correct because high variance in ModelLatency combined with intermittent 503 errors strongly indicates that the model container is running out of memory under load. When memory is insufficient, the inference process may be killed by the kernel (OOM killer) or the container may be throttled, causing sporadic failures that manifest as 503s even though the Invocations metric (request count) appears within limits. The latency spikes occur because the container struggles to allocate memory for each request, leading to timeouts or crashes.

Exam trap

The trap here is that candidates confuse 'Invocations within limits' with 'sufficient capacity,' overlooking that memory exhaustion can cause failures even when request rate is low, and they incorrectly attribute 503s solely to scaling issues (Option D) rather than resource constraints on each instance.

How to eliminate wrong answers

Option A is wrong because a bug in custom inference code would typically cause consistent errors (e.g., 500s) or incorrect predictions, not intermittent 503s with high latency variance; the 503 status specifically points to resource exhaustion or overload, not application logic bugs. Option B is wrong because a misconfigured ALB health check would cause the ALB to mark instances as unhealthy and stop routing traffic to them, resulting in persistent 503s for all requests, not intermittent errors with high latency variance; the health check failure would be visible in ALB metrics, not ModelLatency. Option D is wrong because too few instances would cause the Invocations metric to exceed the instance's capacity, leading to throttling and 503s, but the question states Invocations is within limits; increasing instance count would not fix memory exhaustion on each instance, which is the root cause.

Full explanation →

1525

MCQeasy

A data scientist needs to perform hyperparameter optimization for a model. Which AWS service provides built-in hyperparameter tuning jobs?

A.Amazon EMR

B.AWS Step Functions

C.Amazon SageMaker

D.AWS Batch

AnswerC

SageMaker has automatic model tuning.

Why this answer

Amazon SageMaker provides built-in hyperparameter tuning (automatic model tuning). Option C is correct. Option A is wrong because AWS Batch is not for tuning.

Option B is wrong because Amazon EMR is for big data. Option D is wrong because AWS Step Functions orchestrates workflows but does not perform tuning.

Full explanation →

1526

Multi-Selectmedium

A data scientist is training a model using Amazon SageMaker. The training job is running on GPU instances, but the GPU utilization is low. Which TWO actions could improve GPU utilization?

Select 2 answers

A.Increase the number of epochs

B.Use a larger instance with multiple GPUs

C.Increase the batch size

D.Switch to CPU instances

E.Decrease the batch size

AnswersB, C

More GPUs increase parallelism.

Why this answer

Option A is correct because increasing batch size can better utilize GPU parallelism. Option C is correct because using a larger instance with more GPUs can improve utilization. Option B is wrong because decreasing batch size would reduce utilization.

Option D is wrong because using CPU instances would not utilize GPU. Option E is wrong because adding more epochs does not affect utilization per step.

Full explanation →

1527

Multi-Selectmedium

Which THREE actions are valid steps in exploratory data analysis when working with a new dataset? (Choose three.)

Select 3 answers

A.Check the data types of each column.

B.Generate descriptive statistics (mean, std, min, max).

C.Fit a linear regression model to identify important features.

D.Split the dataset into training and test sets.

E.Create histograms for numerical features.

AnswersA, B, E

Understanding data types is essential.

Why this answer

Options A, C, and E are correct. A: Checking data types is fundamental. C: Descriptive statistics summarize distributions.

E: Visualizing with histograms reveals patterns. B: Splitting into train/test is modeling, not EDA. D: Building a linear model is modeling, not EDA.

Full explanation →

1528

Multi-Selecthard

A data scientist is analyzing a dataset with high multicollinearity. Which TWO techniques can help identify and address multicollinearity?

Select 2 answers

A.Plot a correlation matrix

B.Apply Lasso regression

C.Use Recursive Feature Elimination (RFE)

D.Use Principal Component Analysis (PCA)

E.Compute Variance Inflation Factor (VIF)

AnswersD, E

Correct: PCA creates uncorrelated components.

Why this answer

Correct options: A and D. VIF (A) quantifies multicollinearity; PCA (D) creates orthogonal components. Option B is wrong because correlation matrix only shows pairwise correlations.

Option C is wrong because Lasso does feature selection but does not identify multicollinearity. Option E is wrong because RFE is for feature selection, not multicollinearity detection.

Full explanation →

1529

MCQmedium

A data scientist is training a text classification model using Amazon SageMaker. The dataset consists of 100,000 labeled documents. The data scientist notices that the model performs well on the training set but poorly on the validation set. Which regularization technique should the data scientist apply to reduce overfitting?

A.Dropout

B.Data augmentation

C.Batch normalization

D.Early stopping

AnswerA

Dropout randomly drops units during training, preventing co-adaptation and reducing overfitting.

Why this answer

Dropout is a regularization technique that randomly drops a fraction of neurons during training, which prevents the model from relying too heavily on any single feature and forces it to learn more robust representations. This directly addresses the overfitting symptom of high training accuracy and low validation accuracy by reducing the model's capacity to memorize noise in the training data.

Exam trap

The trap here is that candidates often confuse batch normalization with regularization, but batch normalization primarily addresses internal covariate shift and training stability, not overfitting, while dropout is the explicit regularization technique for neural networks.

How to eliminate wrong answers

Option B (Data augmentation) is wrong because it is primarily used for image or audio data to artificially expand the dataset by applying transformations, but for text classification, simple augmentation (e.g., synonym replacement) may not be as effective and is not a standard regularization technique for overfitting in this context. Option C (Batch normalization) is wrong because it normalizes layer inputs to stabilize and accelerate training, but it does not directly reduce overfitting; it can even have a slight regularizing effect but is not the primary technique for combating overfitting. Option D (Early stopping) is wrong because while it can prevent overfitting by halting training when validation performance plateaus, the question asks for a regularization technique, and early stopping is an optimization trick rather than a structural regularization method like dropout.

Full explanation →

1530

MCQmedium

A data scientist is using Amazon SageMaker to train a model using a built-in algorithm. The training job fails with an error indicating that the algorithm expects the data to be in recordIO-protobuf format, but the input is CSV. What is the most efficient way to resolve this?

A.Change the inference data to recordIO-protobuf format.

B.Use a boto3 script to convert the CSV files locally and upload.

C.Use a SageMaker processing job to convert the CSV data to recordIO-protobuf format.

D.Switch to a different algorithm that accepts CSV format.

AnswerC

Processing jobs can efficiently transform data into the required format.

Why this answer

Option A is correct because converting CSV to recordIO-protobuf format using a SageMaker processing job is efficient and scalable. Option B is wrong because changing the algorithm may not be desirable. Option C is wrong because recordIO-protobuf is not suitable for CSV.

Option D is wrong because direct conversion via boto3 is more manual and less efficient than a processing job.

Full explanation →

1531

MCQhard

A data scientist runs a training job that fails. The CLI output is shown in the exhibit. What is the MOST likely cause of the failure?

A.The S3 bucket or prefix does not exist.

B.The channel name is misspelled.

C.The instance type ml.m5.large is too small.

D.The IAM role does not have s3:GetObject permission.

AnswerA

The error message explicitly says the S3 URI is not found.

Why this answer

The CLI output shows an error indicating that the S3 bucket or prefix does not exist. This is a common failure when the training job's input data path is incorrect, as SageMaker attempts to read from the specified S3 location and fails if the bucket or prefix is missing. The error message directly points to this issue, making it the most likely cause.

Exam trap

The trap here is that candidates may confuse S3 permission errors (403) with bucket-not-found errors (404), leading them to incorrectly select the IAM role permission option when the actual issue is a missing S3 path.

How to eliminate wrong answers

Option B is wrong because a misspelled channel name would typically result in a different error, such as 'Invalid channel name' or 'Channel not found', not an S3 access error. Option C is wrong because the instance type ml.m5.large is a valid and commonly used instance for training; if it were too small, the job would likely start but fail due to resource exhaustion, not an immediate S3-related error. Option D is wrong because an IAM role lacking s3:GetObject permission would produce an 'Access Denied' or '403 Forbidden' error, not a 'bucket or prefix does not exist' error.

Full explanation →

1532

Multi-Selecteasy

Which TWO services can be used to orchestrate a machine learning pipeline?

Select 2 answers

A.Amazon SageMaker Pipelines

B.Amazon SageMaker Ground Truth

C.AWS Step Functions

D.Amazon Redshift

E.AWS Glue

AnswersA, C

SageMaker Pipelines is designed for ML pipeline orchestration.

Why this answer

Options A and D are correct. Option B is wrong because it's for labeling. Option C is wrong because it's for data transformation.

Option E is wrong because it's for data warehousing.

Full explanation →

1533

MCQmedium

A company is building a data lake on Amazon S3. Data arrives from multiple sources in different formats (CSV, JSON, Parquet). The engineering team wants to query this data using Amazon Athena with minimal transformation. Which approach minimizes query cost and improves performance?

A.Use Amazon Redshift Spectrum to query the data directly without transformation

B.Use AWS Glue to convert all data to Parquet format, partition by date, and store in a separate S3 bucket

C.Use Amazon EMR to convert data to CSV format and repartition

D.Store data as-is in S3 and create external tables in Athena for each format

AnswerB

This reduces data scanned, improves performance, and lowers cost.

Why this answer

Using Parquet with partitioning and compression reduces the amount of data scanned by Athena, lowering cost and improving performance. Converting to a single format is not necessary, but optimized formats like Parquet are beneficial. Glue can convert data, but it adds overhead.

Full explanation →

1534

MCQeasy

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data contains personally identifiable information (PII) that must be redacted before storage. Which AWS service can be integrated with Kinesis Data Firehose to transform the data in real time?

A.Amazon Athena

B.Amazon Kinesis Data Analytics

C.Amazon EMR

D.AWS Lambda

AnswerD

Lambda can be invoked by Firehose to transform records in real time.

Why this answer

Option A is correct. AWS Lambda can be used as a transformation function within a Kinesis Data Firehose delivery stream to redact PII. Option B is wrong because Amazon EMR is for big data processing, not inline transformation.

Option C is wrong because Amazon Kinesis Data Analytics is for analyzing streams, not for transformation. Option D is wrong because Amazon Athena is a query service, not a transformation service.

Full explanation →

1535

MCQeasy

A data scientist is training a neural network on Amazon SageMaker and wants to automatically stop training if the validation loss does not improve for 5 consecutive epochs. Which feature should they use?

A.Implement early stopping in the training script

B.SageMaker Debugger

C.SageMaker Checkpointing

D.SageMaker Hyperparameter Tuning

AnswerA

Early stopping is implemented in the training code (e.g., Keras EarlyStopping callback).

Why this answer

Early stopping with a patience parameter is used to stop training when a metric stops improving. Option A is wrong because Checkpointing saves models, does not stop training. Option B is wrong because Hyperparameter tuning searches for best hyperparameters.

Option D is wrong because Debugger monitors training, but early stopping is a built-in feature of the framework.

Full explanation →

1536

Multi-Selectmedium

A data scientist is exploring a dataset with 50 features. Which TWO EDA techniques are most effective for detecting multicollinearity?

Select 2 answers

A.Box plots of each feature

B.Variance Inflation Factor (VIF) analysis

C.Scatter plots of each feature pair

D.Histograms of each feature

E.Correlation matrix visualized as heatmap

AnswersC, E

Why A is correct

Why this answer

Option A is correct because scatter plots of feature pairs can reveal linear relationships. Option C is correct because correlation matrix with heatmap quantifies pairwise correlations. Option B is wrong because histogram shows distribution, not relationship.

Option D is wrong because VIF is a statistical test, but the question asks for EDA techniques (visual). Option E is wrong because box plot shows univariate distribution.

Full explanation →

1537

MCQmedium

A data scientist is exploring a dataset with 10 million rows and 500 features. The target variable is binary. The dataset is stored in an Amazon S3 bucket. The data scientist wants to quickly identify which features have the highest correlation with the target variable. Which approach is MOST efficient?

A.Use Amazon SageMaker Data Wrangler to import the dataset from S3 and generate a correlation matrix.

B.Use Amazon QuickSight to create scatter plots for each feature vs. target.

C.Use Amazon Athena with SQL queries to compute correlation coefficients.

D.Use AWS Glue ETL to compute pairwise correlations and output to Amazon Redshift.

AnswerA

Data Wrangler provides interactive data exploration and correlation analysis.

Why this answer

Option B is correct because Amazon SageMaker Data Wrangler can connect to S3, perform correlation analysis, and generate a correlation matrix without writing code. Option A is wrong because AWS Glue ETL is for ETL pipelines, not interactive exploration. Option C is wrong because Amazon Athena is for SQL queries, not correlation analysis.

Option D is wrong because Amazon QuickSight is for visualization, not statistical correlation.

Full explanation →

1538

MCQmedium

Refer to the exhibit. A data scientist is assigned an IAM policy to deploy a SageMaker model. When the scientist tries to create an endpoint, the action fails with an authorization error. What is the missing permission?

A.iam:PassRole

B.sagemaker:ListEndpoints

C.sagemaker:InvokeEndpoint

D.sagemaker:UpdateEndpoint

AnswerA

SageMaker needs iam:PassRole to assume a role for creating endpoints.

Why this answer

The policy allows creating training jobs, models, endpoint configs, and endpoints, but does not allow invoking the endpoint for inference. The error is for creating the endpoint, which requires `sagemaker:CreateEndpoint` which is present. However, the error might be due to missing `sagemaker:InvokeEndpoint` which is needed for testing? Actually the question says creating the endpoint fails, so maybe the policy lacks `sagemaker:CreateEndpointConfig`? But that is present.

Possibly the missing permission is `sagemaker:ListTags`? No. The most common missing permission is `sagemaker:DescribeEndpoint`? But the error is authorization error, likely missing `sagemaker:CreateEndpoint`? Wait, it is present. Perhaps the policy needs `iam:PassRole` to pass a role to SageMaker.

Yes, SageMaker requires `iam:PassRole` to allow the service to assume a role. So the missing permission is `iam:PassRole`. Option D is correct.

Option A: `sagemaker:InvokeEndpoint` is for inference, not creation. Option B: `sagemaker:UpdateEndpoint` is for updating. Option C: `sagemaker:ListEndpoints` is for listing.

Full explanation →

1539

MCQmedium

A data scientist is using SageMaker to train a deep learning model with a large dataset stored in S3. The training is taking a long time. Which action would most likely reduce training time without sacrificing accuracy?

A.Increase the batch size

B.Use SageMaker Pipe Input mode

C.Use a smaller instance type

D.Reduce the number of epochs

AnswerB

Streams data from S3 directly to the algorithm, reducing I/O time.

Why this answer

SageMaker Pipe Input mode streams training data directly from S3 into the algorithm without first downloading it to the local EBS volume. This eliminates the I/O bottleneck caused by large dataset downloads, significantly reducing training time while preserving accuracy because the model sees the same data.

Exam trap

The trap here is that candidates confuse batch size adjustments (which affect convergence stability) with I/O optimization techniques, overlooking that SageMaker Pipe mode directly addresses the data loading bottleneck without altering the training algorithm.

How to eliminate wrong answers

Option A is wrong because increasing the batch size can reduce training time per epoch but may degrade model accuracy due to convergence to sharper minima or increased generalization error, especially in deep learning. Option C is wrong because using a smaller instance type reduces computational capacity, increasing training time rather than decreasing it. Option D is wrong because reducing the number of epochs directly reduces training time but sacrifices accuracy by underfitting the model.

Full explanation →

1540

MCQhard

A data engineer is configuring an IAM policy to allow users to upload objects to an S3 bucket only if the objects are encrypted using SSE-S3. However, users are getting AccessDenied errors when uploading objects without specifying encryption. What is the most likely cause?

A.The condition should check for aws:SourceIp instead of encryption

B.The condition requires encryption to be specified, but the upload does not specify it

C.The policy is attached to the wrong IAM user

D.The bucket policy denies all PutObject without encryption

AnswerB

The condition requires s3:x-amz-server-side-encryption to be AES256, so without it, access is denied.

Why this answer

Option B is correct because the policy allows PutObject only when encryption is AES256, but denies when no encryption is specified because the condition is not met. Option A is wrong because it's not a service control policy; Option C is wrong because the bucket policy is not shown; Option D is wrong because the condition checks for AES256, not KMS.

Full explanation →

1541

MCQhard

A data scientist is using Amazon SageMaker to train a large language model with PyTorch. The training job is taking too long. The dataset is stored in S3 and the training script uses the SageMaker PyTorch container. Which change is MOST likely to reduce training time?

A.Use Pipe mode to stream data from S3 instead of downloading.

B.Increase the number of instances in the training job.

C.Change the optimizer to AdamW.

D.Switch to spot instances to reduce cost.

AnswerA

Pipe mode reduces data loading time.

Why this answer

Option A is correct because SageMaker Pipe mode streams data directly from S3 to the training algorithm via a Unix FIFO (named pipe), eliminating the need to first download the entire dataset to the training instance's local storage. This reduces I/O wait time and disk usage, which is especially beneficial for large language models where dataset sizes can be in terabytes, thereby significantly cutting total training time.

Exam trap

The trap here is that candidates often confuse cost-saving measures (spot instances) or model-tuning changes (AdamW) with performance improvements, while the actual bottleneck in large-scale training is frequently data I/O, not compute or optimizer choice.

How to eliminate wrong answers

Option B is wrong because simply increasing the number of instances does not address the root cause of slow data loading; it may even introduce communication overhead and increase costs without proportional speedup if the bottleneck is I/O. Option C is wrong because changing the optimizer to AdamW affects convergence behavior and model accuracy, not the data ingestion speed or training job duration directly. Option D is wrong because switching to spot instances reduces cost but does not reduce training time; spot instances can actually increase training time if they are interrupted and require checkpointing and resumption.

Full explanation →

1542

MCQmedium

An ML team deploys a real-time inference endpoint on Amazon SageMaker. Users report high latency. The model is a PyTorch model using a custom container. Which combination of changes should the team implement to reduce latency? (Choose the best answer.)

A.Switch to asynchronous inference endpoint.

B.Use SageMaker Elastic Inference to attach an accelerator.

C.Compile the model using SageMaker Neo.

D.Use SageMaker Inference Recommender to benchmark different instance families and select the best.

AnswerD

Inference Recommender automates benchmarking to find the optimal configuration for low latency.

Why this answer

Enabling SageMaker Inference Recommender (Option D) generates multiple endpoint configurations to find the optimal instance type and model compilation settings. Option A (compilation only) helps but may not find the best instance. Option B (Elastic Inference) is deprecated.

Option C (Asynchronous Inference) is for near-real-time, not true low-latency.

Full explanation →

1543

MCQmedium

A data pipeline uses Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The Lambda function sometimes times out because of spikes in traffic. The team wants to buffer the data before processing to handle spikes. Which approach is most effective?

A.Send the data to an Amazon SQS queue and have Lambda poll from it

B.Increase the number of shards in the Kinesis stream

C.Use Kinesis Data Firehose as an intermediary that buffers data and delivers to Lambda in batches

D.Increase the Lambda function timeout to 15 minutes

AnswerC

Firehose can buffer incoming data and invoke Lambda with larger batches, smoothing spikes.

Why this answer

Using Kinesis Data Firehose between the stream and Lambda buffers data and can deliver it to Lambda in batches, reducing timeouts. Option A (increasing Lambda timeout) may help but is not a buffer. Option B (increasing shards) increases throughput but does not buffer.

Option D (using SQS) adds complexity and delay.

Full explanation →

1544

MCQhard

A company is using Amazon SageMaker to train a large language model with billions of parameters. The training job uses multiple GPU instances in a distributed fashion. The training is converging but the loss is not decreasing as expected. The data scientist suspects that the learning rate is too high. Which technique should the data scientist use to automatically adjust the learning rate during training?

A.Use a fixed learning rate and train for more epochs

B.Increase the batch size to reduce variance

C.Implement learning rate scheduling with a cosine annealing schedule

D.Use gradient clipping

AnswerC

Cosine annealing reduces the learning rate smoothly, helping convergence.

Why this answer

Learning rate scheduling, such as a cosine annealing schedule, can automatically reduce the learning rate over time. This helps the model converge better. SageMaker's built-in algorithms support learning rate scheduling, or the user can implement it in custom training scripts.

Full explanation →

1545

MCQmedium

A data scientist is working with a dataset that contains a 'Price' column. After plotting a histogram, they observe that the distribution is right-skewed with many extreme high values. They plan to use a linear model that assumes normally distributed errors. Which of the following transformations should they apply to the 'Price' column to make it more normally distributed?

A.Apply log transformation (log(Price)).

B.Apply square transformation (Price^2).

C.Apply min-max scaling to the 'Price' column.

D.Bin the 'Price' values into equal-width intervals.

AnswerA

Log transformation compresses the tail and makes the distribution more symmetric.

Why this answer

Option D is correct because log transformation is commonly used for right-skewed data to reduce skewness. Option A is wrong because min-max scaling does not change distribution shape. Option B is wrong because square transformation increases skewness.

Option C is wrong because binning loses information.

Full explanation →

1546

Multi-Selecthard

A company uses AWS Glue Data Catalog to manage metadata for its data lake on Amazon S3. The data lake contains terabytes of data in CSV format. The data engineering team wants to improve query performance in Amazon Athena and reduce costs. Which actions should the team take? (Select THREE.)

Select 3 answers

A.Create views in Athena to simplify queries.

B.Compress the data using Snappy or GZIP.

C.Partition the data by commonly filtered columns.

D.Convert the data to Parquet format.

E.Convert the data to JSON format.

AnswersB, C, D

Compression reduces storage and data scanned.

Why this answer

Option B is correct because converting to columnar formats like Parquet reduces data scanned and improves performance. Option C is correct because partitioning limits the amount of data scanned per query. Option D is correct because compressing data reduces storage cost and data scanned.

Option A is wrong because CSV is not efficient. Option E is wrong because it does not directly address performance or cost for Athena.

Full explanation →

1547

MCQeasy

A company is building a recommendation system using collaborative filtering. The dataset contains implicit feedback (clicks) from users on items. Which algorithm is best suited for this scenario?

A.Linear Regression

B.Alternating Least Squares (ALS)

C.K-means clustering

D.Singular Value Decomposition (SVD)

AnswerB

ALS is designed for implicit feedback and scales well.

Why this answer

Alternating Least Squares (ALS) is designed for implicit feedback datasets in collaborative filtering. Option A is wrong because SVD is for explicit ratings. Option C is wrong because K-means is clustering, not recommendation.

Option D is wrong because Linear Regression is for supervised regression, not recommendation.

Full explanation →

1548

MCQhard

A data scientist is analyzing a dataset with a target variable that is highly imbalanced (99% negative class, 1% positive class). The dataset has 10 million rows. The goal is to train a binary classifier. Which technique should be applied during exploratory data analysis to best address the imbalance?

A.Assign higher class weights to the minority class

B.Random undersampling of the majority class

C.Synthetic Minority Oversampling Technique (SMOTE)

D.Collect more data for the minority class

AnswerB

Feasible for large datasets and can balance classes.

Why this answer

For large datasets, undersampling the majority class is feasible and can be effective. Option A is wrong because SMOTE generates synthetic samples but may be computationally expensive for 10M rows. Option B is wrong because class weights are set during training, not EDA.

Option D is wrong because collecting more data is not guaranteed to fix imbalance.

Full explanation →

1549

Multi-Selecteasy

A company wants to analyze streaming data from IoT devices in near-real-time. They need to store raw data in Amazon S3 and also run SQL queries on the streaming data. Which TWO services should they use?

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Streams

C.Amazon Kinesis Data Analytics

D.Amazon Kinesis Data Firehose

E.AWS Lambda

AnswersC, D

Runs SQL on streaming data.

Why this answer

Amazon Kinesis Data Firehose can deliver streaming data to S3. Amazon Kinesis Data Analytics can run SQL queries on the stream. Kinesis Data Streams is for ingestion but requires custom consumers.

Lambda can process streams but not SQL. Glue is for batch ETL.

Full explanation →

1550

MCQmedium

A data scientist is working with a dataset that includes a 'timestamp' column. They want to create features that capture seasonality. Which feature engineering approach is most appropriate?

A.Bin timestamps into fixed intervals.

B.Convert timestamp to Unix epoch seconds.

C.Extract hour of day and apply sine/cosine transformation.

D.One-hot encode the timestamp column.

AnswerC

Sine/cosine encoding preserves cyclic nature.

Why this answer

Option D is correct because sine and cosine transformations capture cyclic patterns like time of day or day of year. Option A is wrong because one-hot encoding creates many sparse features. Option B is wrong because converting to Unix timestamp loses cyclic nature.

Option C is wrong because binning loses granularity.

Full explanation →

1551

MCQhard

A retail company runs an e-commerce platform on AWS. They have a Data Engineering team that processes clickstream data using Amazon Kinesis Data Streams (KDS) with a shard count of 5. The data is consumed by an AWS Lambda function that transforms and loads the data into an Amazon S3 bucket partitioned by year/month/day/hour. Recently, the team has noticed that the Lambda function is experiencing throttling errors, and the KDS shard iterator age is increasing, indicating that the consumer cannot keep up with the incoming data rate. The team has already increased the Lambda reserved concurrency to 1000 and enabled batch window of 60 seconds. The metrics show that the Lambda function duration is well under the 5-minute timeout, and there are no errors in the transformation logic. The S3 write operations are not failing. Which course of action would MOST effectively resolve the issue without unnecessary cost or complexity?

A.Increase the number of shards in the Kinesis Data Stream to 20 to increase the parallelism of Lambda consumers.

B.Increase the Lambda reserved concurrency to 5000 to allow more parallel executions.

C.Increase the batch window to 300 seconds to accumulate more records per invocation and reduce the number of calls.

D.Switch to using Amazon Kinesis Data Analytics with a larger instance type to process the stream.

AnswerA

More shards allow more concurrent Lambda invocations, improving throughput and reducing iterator age.

Why this answer

The core issue is that the Lambda consumer cannot keep up with the incoming data rate, as evidenced by the increasing shard iterator age. Increasing the shard count from 5 to 20 directly increases the number of Kinesis Data Streams shards, which in turn increases the number of concurrent Lambda invocations (one per shard) and the overall throughput of the stream. This addresses the bottleneck at the source without adding unnecessary complexity or cost, as KDS pricing is based on shard hours and Lambda concurrency is already set to 1000.

Exam trap

The trap here is that candidates often assume increasing Lambda concurrency or batch window will solve throughput issues, but they fail to recognize that Kinesis shard count is the fundamental limiter of parallelism in the Lambda-Kinesis integration.

How to eliminate wrong answers

Option B is wrong because increasing Lambda reserved concurrency to 5000 does not help when the bottleneck is the number of Kinesis shards; Lambda can only process one shard per concurrent invocation, and with only 5 shards, the maximum parallelism is 5, so additional concurrency is unused. Option C is wrong because increasing the batch window to 300 seconds would increase latency and could cause the shard iterator age to grow further, as records would accumulate longer before being processed, worsening the backlog. Option D is wrong because switching to Kinesis Data Analytics introduces a different service (meant for real-time analytics with SQL or Flink) that adds complexity and cost, and does not directly address the consumer throughput limitation caused by insufficient shard parallelism.

Full explanation →

1552

MCQeasy

A company wants to track and compare metrics from multiple machine learning experiments. Which Amazon SageMaker feature should be used?

A.SageMaker Experiments

B.SageMaker Ground Truth

C.SageMaker Model Monitor

D.SageMaker Debugger

AnswerA

Specifically designed for experiment tracking and comparison.

Why this answer

SageMaker Experiments helps track and compare metrics. Option B is correct. Option A is wrong because Model Monitor is for monitoring drift.

Option C is wrong because Debugger is for debugging training. Option D is wrong because Ground Truth is for labeling.

Full explanation →

1553

MCQhard

A data scientist is tuning a gradient boosting model using Amazon SageMaker Automatic Model Tuning. The objective metric is AUC. The training job converges quickly but the final model has low AUC on the validation set. Which hyperparameter should the data scientist adjust to improve validation AUC?

A.Increase the subsample ratio of training data

B.Decrease the learning rate and increase the number of rounds

C.Increase the learning rate

D.Increase the maximum depth of trees

AnswerB

Lower learning rate with more rounds typically improves generalization and AUC.

Why this answer

Decreasing the learning rate and increasing the number of rounds is the correct approach because a low learning rate forces the model to take smaller steps toward the optimum, reducing overfitting and allowing more trees to contribute to the ensemble. This combination often improves generalization and validation AUC when the training job converges too quickly, indicating that the model is overfitting or underfitting due to aggressive learning.

Exam trap

The trap here is that candidates mistakenly think increasing the learning rate will speed up convergence and improve AUC, but in reality it causes overfitting when the model already converges quickly, while decreasing the learning rate with more rounds is the standard remedy for underfitting or overfitting in gradient boosting.

How to eliminate wrong answers

Option A is wrong because increasing the subsample ratio (e.g., from 0.8 to 1.0) actually uses more training data per iteration, which can increase variance and overfitting, not improve validation AUC when the model already converges quickly. Option C is wrong because increasing the learning rate makes the model converge even faster, exacerbating overfitting and further reducing validation AUC. Option D is wrong because increasing the maximum depth of trees makes each tree more complex and prone to overfitting, which typically degrades validation AUC when the model already converges quickly.

Full explanation →

1554

MCQhard

A company uses Amazon SageMaker to train a model using the built-in Linear Learner algorithm. The training data contains missing values in some features. What is the best practice for handling missing values with this algorithm?

A.Remove rows with missing values

B.Impute missing values using mean or median imputation

C.Set missing values to zero

D.Use the `handle_missing` parameter in the algorithm

AnswerB

Imputation is a standard preprocessing step for handling missing values.

Why this answer

Linear Learner expects dense input; it cannot handle missing values. The best practice is to impute missing values before training, such as using mean or median imputation. Setting the missing value to zero could bias the model.

Removing rows with missing values may lose data. The algorithm does not have a built-in missing value handling parameter. Option A: Impute missing values is correct.

Option B: Zero may cause bias. Option C: Removing rows may lose data. Option D: There is no parameter for missing values in Linear Learner.

Full explanation →

1555

MCQhard

A company uses AWS Lake Formation to manage permissions on a data lake stored in Amazon S3. A data analyst tries to query a table using Amazon Athena but receives an 'Access Denied' error. The analyst has SELECT permission on the table in Lake Formation. What is the most likely cause?

A.The S3 bucket is not registered with Lake Formation

B.The S3 bucket is encrypted with a KMS key that the analyst does not have access to

C.The table does not have any partitions defined

D.The IAM role used by Athena does not have lakeformation:GetDataAccess permission

AnswerA

If the bucket is not registered, Lake Formation cannot control access, and the default S3 permissions apply, which may deny access.

Why this answer

Lake Formation requires that the underlying S3 bucket be registered with Lake Formation and that the IAM role used by Athena has Lake Formation permissions. If the S3 bucket is not registered, Lake Formation cannot grant access to it. Option A (missing IAM actions) could also be a cause, but the most common issue is that the bucket is not registered.

Option B (KMS key) not likely. Option D (no partition) would not cause Access Denied.

Full explanation →

1556

MCQeasy

A data engineering team needs to set up a data pipeline that ingests streaming data from an Apache Kafka cluster running on Amazon EKS into an S3 data lake. The data must be stored in Parquet format, partitioned by date and event type. The team wants a fully managed solution with minimal operational overhead. Which solution should they choose?

A.Use Amazon MSK (Managed Streaming for Apache Kafka) and configure an MSK Connect S3 sink connector.

B.Set up a Kinesis Data Firehose delivery stream that reads from Kafka and writes to S3.

C.Use AWS Glue ETL jobs to pull data from Kafka cluster periodically.

D.Create a Kinesis Data Analytics application to read from Kafka and write to S3.

AnswerA

MSK is fully managed Kafka, and MSK Connect can stream data to S3 in Parquet format.

Why this answer

Option D is correct because Amazon MSK (Managed Streaming for Kafka) is a fully managed Kafka service, and MSK Connect with an S3 sink connector can deliver data directly to S3 in Parquet format. Option A (Kinesis Data Analytics) is for real-time analytics, not for data lake ingestion. Option B (Kinesis Data Firehose) works with Kinesis streams, not Kafka directly.

Option C (Glue ETL) is batch-oriented and adds latency.

Full explanation →

1557

Multi-Selectmedium

A data scientist is exploring a dataset with many features and suspects that some features are highly correlated. Which TWO methods can the scientist use to detect and handle multicollinearity before building a linear regression model?

Select 2 answers

A.Apply Principal Component Analysis (PCA) and use all components.

B.Standardize all features to have zero mean and unit variance.

C.Compute Variance Inflation Factor (VIF) for each feature and remove features with VIF > 10.

D.Use stepwise feature selection.

E.Use Ridge regression (L2 regularization) to shrink coefficients.

AnswersC, E

VIF detects multicollinearity; removing high VIF features reduces it.

Why this answer

Options A and C are correct. VIF is a standard measure for detecting multicollinearity; removing features with high VIF reduces multicollinearity. Ridge regression (L2 regularization) can handle multicollinearity by penalizing large coefficients.

Option B is wrong because PCA reduces dimensionality but makes models less interpretable and does not directly handle multicollinearity in the original features. Option D is wrong because stepwise selection does not specifically address multicollinearity. Option E is wrong because standardization does not affect collinearity.

Full explanation →

1558

Multi-Selectmedium

A data engineering team is designing a data pipeline to process streaming data from social media feeds. The data must be deduplicated, enriched with customer information from a relational database, and stored in Amazon S3 in Parquet format. Which AWS services should the team use to build this pipeline? (Select TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Athena

D.Amazon SageMaker

E.Amazon Kinesis Data Streams

AnswersA, E

Glue ETL can transform and enrich data from streams and databases.

Why this answer

Option A is correct because Kinesis Data Streams ingests streaming social media data. Option D is correct because AWS Glue ETL jobs can read from the stream, perform deduplication and enrichment using JDBC connections to the relational database, and write to S3 in Parquet. Option B is wrong because Kinesis Data Firehose does not support enrichment with a relational database.

Option C is wrong because Athena is a query engine, not an ETL tool. Option E is wrong because SageMaker is for ML, not data pipeline.

Full explanation →

1559

MCQeasy

An ML engineer is using Amazon SageMaker to train a model on a dataset that contains personal identifiable information (PII). The data must be encrypted at rest and in transit. The company uses AWS KMS for key management. How should the engineer configure the SageMaker training job to meet these encryption requirements?

A.Enable S3 Server-Side Encryption (SSE-S3) on the input data bucket

B.Use a custom Docker image with built-in encryption and disable inter-container traffic encryption for performance

C.Use a VPC with an S3 VPC Endpoint and enable SSL for the endpoint

D.Specify a KMS key for the training job's VolumeKmsKeyId and enable inter-container traffic encryption

AnswerD

This encrypts the ML storage volume and inter-container traffic.

Why this answer

Option C is correct because SageMaker supports KMS encryption for both the ML storage volume (at rest) and inter-container traffic (in transit). Option A is wrong because S3 Server-Side Encryption only covers data at rest in S3, not during training. Option B is wrong because SSL does not encrypt the storage volume.

Option D is wrong because disabling inter-container encryption is not secure.

Full explanation →

1560

MCQhard

A team is training a neural network for image classification using Amazon SageMaker. The training loss decreases rapidly but the validation loss starts increasing after a few epochs. Which action should the team take?

A.Reduce the batch size

B.Add more convolutional layers

C.Increase the learning rate

D.Implement early stopping based on validation loss

AnswerD

Early stopping prevents overfitting.

Why this answer

Option A is correct because early stopping prevents overfitting by stopping when validation loss increases. Option B is wrong because increasing learning rate may worsen overfitting. Option C is wrong because adding more layers increases model complexity.

Option D is wrong because reducing batch size may slow training but not prevent overfitting.

Full explanation →

1561

Multi-Selectmedium

A company is using SageMaker to deploy a model for real-time inference. The model requires GPU for low latency. Which THREE configurations should the company consider for high availability and cost optimization? (Choose three.)

Select 3 answers

A.Use Spot instances for the endpoint.

B.Use a multi-model endpoint to share GPU instances among multiple models.

C.Use SageMaker Batch Transform for inference.

D.Use multiple production variants with different instance types.

E.Enable automatic scaling based on invocation count.

AnswersB, D, E

Increases GPU utilization and reduces cost.

Why this answer

Option B is correct because a multi-model endpoint allows multiple models to be hosted on the same GPU-backed instance, sharing the GPU resources and reducing idle time. This improves cost efficiency by maximizing GPU utilization while still providing low-latency inference for each model. It is a recommended pattern for serving many models with GPU requirements without provisioning separate endpoints.

Exam trap

The trap here is that candidates often confuse high availability with cost optimization, incorrectly assuming Spot instances (Option A) are suitable for real-time inference despite their interruption risk, or they overlook multi-model endpoints as a GPU-sharing strategy.

Full explanation →

1562

MCQmedium

A machine learning engineer trains a binary classifier on an imbalanced dataset where the positive class represents 1% of the data. After training, the model achieves 99% accuracy but only 10% recall on the positive class. Which metric should the engineer focus on to evaluate the model's performance on the minority class?

A.F1 score

B.Accuracy

C.AUC-ROC

D.Precision

AnswerA

F1 score considers both precision and recall, giving a better measure for imbalanced data.

Why this answer

Option B is correct because the F1 score balances precision and recall, which is suitable for imbalanced datasets. Option A is wrong because accuracy can be misleading with imbalance. Option C is wrong because precision alone ignores recall.

Option D is wrong because AUC-ROC may still be high even with poor recall.

Full explanation →

1563

MCQmedium

A data scientist is training a deep learning model using a large dataset stored in S3. The training job runs on a SageMaker training instance with a GPU. The data engineer notices that the GPU utilization is low, and the training is I/O bound. The data is read directly from S3 using the SageMaker SDK. Which change should the data engineer recommend to improve GPU utilization?

A.Increase the batch size in the training script to process more data per step.

B.Mount the S3 bucket to the training instance using Amazon Elastic File System (EFS).

C.Use SageMaker Pipe mode to stream data directly from S3 to the training container.

D.Copy the entire dataset to an Amazon EBS volume attached to the training instance.

AnswerC

Pipe mode eliminates disk I/O, allowing data to be streamed directly to the GPU.

Why this answer

Option B is correct because enabling SageMaker Pipe mode streams data directly from S3 to the training container without writing to disk, reducing I/O bottlenecks and improving GPU utilization. Option A (increasing batch size) may cause memory issues. Option C (using EFS) adds network latency.

Option D (using EBS) still involves disk I/O.

Full explanation →

1564

MCQeasy

A data scientist is training a Random Forest model on Amazon SageMaker. The model performs well on the training set but poorly on the test set. Which technique should the data scientist use to address this issue?

A.Increase the number of trees in the forest

B.Decrease the maximum depth of each tree

C.Increase the learning rate

D.Increase the maximum depth of each tree

AnswerB

Shallow trees generalize better, reducing overfitting.

Why this answer

The model is overfitting, as indicated by high training performance and poor test performance. Decreasing the maximum depth of each tree limits the complexity of individual trees, reducing overfitting by preventing them from memorizing noise in the training data. This is a standard regularization technique for Random Forest models in Amazon SageMaker.

Exam trap

AWS often tests the misconception that increasing model complexity (e.g., more trees or deeper trees) always improves performance, when in fact overfitting requires reducing complexity or applying regularization.

How to eliminate wrong answers

Option A is wrong because increasing the number of trees in the forest generally improves model stability and reduces variance without significantly increasing overfitting, but it does not address the root cause of overfitting from overly deep trees. Option C is wrong because learning rate is a hyperparameter for gradient boosting models, not for Random Forest; Random Forest does not use a learning rate. Option D is wrong because increasing the maximum depth of each tree would exacerbate overfitting by allowing trees to capture more noise and specific patterns in the training data, worsening test performance.

Full explanation →

1565

MCQhard

A company has an AWS Glue ETL job that reads data from an Amazon RDS for MySQL table and writes to Amazon S3 in Parquet format. The job runs daily and processes 500 GB of data. Recently, the job has been failing with memory errors during the write phase. The data schema is wide (200 columns). Which change should a data engineer make to the Glue job to resolve the memory issue?

A.Increase the number of DPUs for the Glue job.

B.Change the output format from Parquet to CSV.

C.Use the JDBC connection with fetchSize parameter.

D.Configure the write operation with 'groupSize' to limit records per file.

AnswerD

Limiting records per file reduces the memory needed for buffering during writes.

Why this answer

Option B is correct. Using the 'groupSize' or 'maxRecordsPerFile' option in Glue's DynamicFrame writer can control the number of records per Parquet file, reducing memory pressure. Option A is wrong because increasing DPUs may help but is a costlier solution.

Option C is wrong because JDBC connection is for reading, not writing. Option D is wrong because using CSV is less efficient and doesn't address memory.

Full explanation →

1566

MCQhard

A company is building a real-time fraud detection system using Amazon SageMaker. The model is a gradient boosting classifier trained on 500 GB of transactional data. The inference endpoint is deployed as a SageMaker real-time endpoint using an ml.c5.9xlarge instance. The model is serialized using the native format of the framework (XGBoost). The endpoint receives about 100 requests per second with an average payload size of 10 KB. The company observes that the endpoint's latency is around 200 ms, but they need under 100 ms. The data scientist profiles the endpoint and finds that the model inference time is 50 ms, but the remaining time is spent on data preprocessing and serialization/deserialization. The preprocessing involves converting JSON input to a NumPy array and then to a DMatrix. Which action is most likely to reduce latency to meet the requirement?

A.Use a more efficient serialization format such as Apache Arrow or Protocol Buffers for the input data

B.Switch to SageMaker Batch Transform to process requests in batches

C.Use a larger instance type such as ml.c5.18xlarge

D.Reduce the number of trees in the model

AnswerA

Reducing serialization/deserialization overhead directly addresses the bottleneck.

Why this answer

Option D is correct. By using SageMaker Batch Transform, the company can process requests in batches, reducing per-request overhead. However, the requirement is for real-time, so this may not be suitable.

Option A is wrong because larger instances may not reduce preprocessing overhead. Option B is wrong because reducing model complexity could hurt accuracy. Option C is wrong, but it's a plausible approach: using a more efficient serialization format (e.g., Protocol Buffers) can reduce deserialization time.

Actually, option C is correct: using a more efficient data format reduces preprocessing time. Option D is wrong because batch transform is for asynchronous, not real-time. The correct answer should be C.

Let me re-evaluate: The stem says 'remaining time is spent on data preprocessing and serialization/deserialization.' Using a more efficient serialization format (e.g., Protobuf instead of JSON) can reduce overhead. Option A: upgrading instance may not help if the bottleneck is serialization. Option B: reducing model complexity may affect accuracy.

Option D: batch transform is not real-time. So C is best.

Full explanation →

1567

MCQhard

A SageMaker training job log shows the exhibit. The training job fails immediately after starting. The training data is supposed to be provided via Pipe mode from S3. What is the most likely cause?

A.The input data channel is not properly configured

B.The instance type does not have enough memory

C.The S3 bucket has insufficient permissions

D.The training script is using File mode instead of Pipe mode

E.The hyperparameters are incorrectly specified

AnswerA

The training job is looking for data at /opt/ml/input/data/training, but Pipe mode should provide a pipe.

Why this answer

Option B is correct because the error indicates that the data channel path is incorrect or not configured. Option A is wrong because if the file mode was used, the path would exist. Option C is wrong because S3 permissions would cause a different error.

Option D is wrong because a wrong instance type would not cause this error. Option E is wrong because hyperparameters would not cause a missing file error.

Full explanation →

1568

MCQhard

A data scientist is trying to list objects in an S3 bucket named 'my-bucket' using the AWS CLI command: `aws s3 ls s3://my-bucket/`. The command fails with an access denied error. The IAM policy attached to the scientist's role is shown in the exhibit. What is the most likely cause of the failure?

A.The condition on the ListBucket action requires all objects to have the tag 'data-type'='training', which may not be satisfied.

B.The IAM policy does not include the s3:ListBucket action.

C.The policy does not grant access to the bucket because it uses 'my-bucket' instead of the full ARN.

D.The condition should use 'StringLike' instead of 'StringEquals'.

AnswerA

The condition on ListBucket is problematic and may cause denial.

Why this answer

Option B is correct because the ListBucket action is conditioned on the s3:ExistingObjectTag/data-type condition, which requires that all objects in the bucket have the tag 'data-type' set to 'training'. However, the ListBucket action applies to the bucket, not individual objects, and the condition might not be evaluated correctly, but more importantly, the condition on ListBucket is unusual; typically conditions on ListBucket should be avoided or use different keys. However, the most direct issue is that the condition must be satisfied for the request, and if the bucket has objects without that tag, the ListBucket action is denied.

Option A is wrong because the policy includes s3:ListBucket on the bucket ARN. Option C is wrong because the condition uses StringEquals, not a request parameter. Option D is wrong because the bucket exists and the user has permission if the condition is met.

Full explanation →

1569

MCQhard

A company has deployed a machine learning model on Amazon SageMaker for real-time inference. The endpoint uses a single ml.c5.xlarge instance. Recently, the traffic has increased, and the endpoint is returning HTTP 503 (Service Unavailable) errors during peak hours. The CloudWatch metrics show that the CPU utilization is consistently above 90% during peak times, and the Invocations metric shows that requests are being throttled. The data science team has already optimized the model to reduce inference time by 20%, but the errors persist. The company needs to resolve the issue without increasing costs significantly. Which course of action should be taken?

A.Change the instance type to a larger size, such as ml.c5.2xlarge

B.Switch to batch transform to process requests in batches

C.Use spot instances to reduce costs and add more instances

D.Configure auto-scaling for the endpoint to add instances based on CPU utilization

AnswerD

Auto-scaling adds instances only when needed, handling peak traffic and reducing costs during low traffic.

Why this answer

Option A is correct because adding auto-scaling based on CPU utilization or invocations will dynamically adjust the number of instances to handle the load, reducing errors without incurring costs during low traffic. Option B is wrong because increasing to a larger instance type will increase costs even during low traffic. Option C is wrong because switching to batch transform is for offline processing, not real-time.

Option D is wrong because using spot instances could lead to interruptions and does not solve the capacity issue.

Full explanation →

1570

Multi-Selecthard

Which TWO of the following are valid reasons to use a sample of the data during exploratory data analysis instead of the full dataset? (Select TWO.)

Select 2 answers

A.Remove bias from the original dataset

B.Ensure rare events are captured in the analysis

C.Improve model accuracy by reducing noise

D.Reduce memory usage and computation time

E.Enable interactive data visualization with large datasets

AnswersD, E

Sampling allows faster iteration with smaller data.

Why this answer

Options A and D are correct. Sampling reduces memory usage and speeds up interactive analysis. Option B is wrong because sampling can miss rare events.

Option C is wrong because model accuracy typically decreases with less data. Option E is wrong because sampling does not remove bias from the original data.

Full explanation →

1571

MCQhard

A team is analyzing a dataset with many categorical features. They notice that one feature has 1,000 unique values but a long tail where most values appear only once. Which encoding method is most appropriate to avoid overfitting?

A.Target encoding

B.Label encoding

C.One-hot encoding

D.Count encoding

AnswerD

Count encoding replaces categories with their frequency, reducing dimensionality and handling rare values.

Why this answer

Option C is correct because count encoding uses frequency counts, which can capture information for rare categories without creating high dimensionality. One-hot encoding (A) would create 1,000 columns. Target encoding (B) can cause overfitting.

Label encoding (D) implies ordinality.

Full explanation →

1572

MCQhard

A company is using Amazon SageMaker to host a model that performs real-time inference. The model receives around 100 requests per second with occasional spikes up to 500 requests per second. The current endpoint uses 2 ml.m5.large instances. During spikes, latency increases significantly, and some requests time out. What is the MOST cost-effective solution to handle the spikes without losing requests?

A.Replace the instances with a single larger instance type, such as ml.m5.4xlarge

B.Use an Amazon SQS queue to buffer incoming requests and process them asynchronously

C.Use AWS Lambda with a provisioned concurrency to handle the spikes

D.Configure SageMaker managed scaling with a target tracking policy and add a buffer based on the average spike duration

AnswerD

Managed scaling with a buffer allows proactive scaling to handle spikes.

Why this answer

Option C is correct because adding a buffer to the autoscaling policy allows the endpoint to scale proactively before the spike fully hits, while managed scaling adjusts instances based on demand. Option A (increase instance size) is less cost-effective than scaling out. Option B (SQS) adds latency.

Option D (Lambda) may not be suitable for real-time inference.

Full explanation →

1573

MCQmedium

A data scientist is using SageMaker to train a deep learning model. The training script uses TensorFlow and runs on a single p3.2xlarge instance. The scientist wants to reduce training time by using multiple GPUs. What should the scientist do?

A.Increase the instance count to 4 without changing the script.

B.Modify the training script to use Horovod for distributed training.

C.Switch to PyTorch framework.

D.Use SageMaker Managed Spot Training.

AnswerB

Horovod enables multi-GPU and multi-instance distributed training.

Why this answer

Modifying the training script to use Horovod (Option B) is required for distributed training across multiple GPUs. Option A (increasing instance type) only uses more GPUs on one machine if the script supports it. Option C (using Spot) does not add GPUs.

Option D (changing to PyTorch) is unnecessary.

Full explanation →

1574

MCQmedium

During training of a deep learning model on a GPU instance in SageMaker, the training job fails with an insufficient memory error. Which step should be taken first to resolve this issue?

A.Add dropout layers

B.Use a smaller learning rate

C.Use gradient clipping

D.Reduce the batch size

AnswerD

Smaller batch size reduces GPU memory footprint.

Why this answer

Option B is correct because reducing the batch size directly decreases GPU memory usage. Option A is wrong because it reduces training time but not memory per step. Option C is wrong because it addresses vanishing gradients, not memory.

Option D is wrong because it reduces overfitting, not memory.

Full explanation →

1575

MCQeasy

A data scientist is building a regression model to predict house prices. The dataset includes features such as square footage, number of bedrooms, year built, and location. After training a linear regression model, the data scientist notices that the residuals have a clear pattern when plotted against predicted values: they increase with predicted values. The model also has high RMSE. Which action should the data scientist take to improve the model?

A.Remove outliers from the dataset.

B.Use L1 regularization (Lasso) to reduce overfitting.

C.Apply a log transformation to the target variable.

D.Add interaction terms between features.

AnswerC

Log transformation can stabilize variance and linearize the relationship, reducing the residual pattern.

Why this answer

Option C is correct because a pattern in residuals indicates non-linearity, and transforming the target variable (e.g., log transformation) can stabilize variance and linearize relationships. Option A is wrong because adding interaction terms may not fix heteroscedasticity. Option B is wrong because removing outliers is not the root cause.

Option D is wrong because regularization reduces overfitting but does not address non-linearity.

Full explanation →

Page 21 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →