Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 901–975

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 13 of 24

901

Multi-Selecthard

A company uses Amazon SageMaker to build a text classification model using a pre-trained BERT model. The dataset contains 10,000 labeled documents. The model is overfitting: training accuracy is 99%, validation accuracy is 85%. Which TWO of the following are most likely to help reduce overfitting? (Choose TWO.)

Select 2 answers

A.Add more transformer layers to the model

B.Increase the dropout rate during fine-tuning

C.Increase the batch size

D.Use a larger pre-trained BERT model

E.Decrease the learning rate

AnswersB, E

Dropout is a regularization technique that randomly drops units, reducing overfitting.

Why this answer

Increasing dropout during fine-tuning adds regularization. Decreasing the learning rate can help the model converge to a better solution and prevent overfitting to the training set. Increasing batch size can sometimes regularize but is not as effective as dropout.

Adding more layers increases model capacity and overfitting. Using a larger pre-trained model also increases capacity.

Full explanation →

902

MCQmedium

An IAM policy attached to a SageMaker execution role is shown in the exhibit. When a data scientist tries to create a training job that writes logs to CloudWatch Logs, the job fails. What is the MOST likely reason?

A.The policy does not specify the SageMaker API version

B.The S3 bucket policy denies access to the training job

C.The policy lacks permissions for CloudWatch Logs actions

D.The policy has an implicit deny for SageMaker actions

AnswerC

Training jobs need logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents.

Why this answer

Option B is correct because the policy does not include CloudWatch Logs permissions. Option A is incorrect because the actions are allowed for all resources. Option C is incorrect because the S3 actions are allowed.

Option D is incorrect because there is no deny statement.

Full explanation →

903

MCQmedium

A company is using Amazon SageMaker to train a deep learning model. The training job is failing with an error 'CUDA out of memory'. The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model architecture and batch size are appropriate for this instance size. What is the most likely cause of this error?

A.Reduce the number of epochs.

B.Increase the number of GPUs by using a distributed training instance type.

C.Enable automatic mixed precision (AMP) training to reduce memory usage.

D.Use a smaller instance type to force lower memory usage.

AnswerC

AMP uses FP16 where possible, cutting memory usage roughly in half, which often resolves out-of-memory errors.

Why this answer

Option C is correct because enabling automatic mixed precision (AMP) training reduces GPU memory usage by storing tensors in half-precision (FP16) where possible, while keeping critical operations in full precision (FP32). This directly addresses the 'CUDA out of memory' error on an ml.p3.2xlarge instance (16 GB GPU memory) without changing the model architecture or batch size, which are already appropriate.

Exam trap

The trap here is that candidates may incorrectly assume the solution is to reduce epochs (Option A) or scale out to more GPUs (Option B), when the root cause is memory exhaustion per GPU, which is best addressed by mixed precision training to halve the memory footprint without altering the model or batch size.

How to eliminate wrong answers

Option A is wrong because reducing the number of epochs does not affect peak GPU memory usage during training; it only changes the total training time, not the memory footprint per batch. Option B is wrong because increasing the number of GPUs via distributed training (e.g., ml.p3.16xlarge) does not reduce per-GPU memory consumption; it distributes the model across GPUs but each GPU still needs to hold its portion of the data and gradients, and the error is on a single GPU instance. Option D is wrong because using a smaller instance type would reduce available GPU memory (e.g., ml.p3.xlarge has only 8 GB), making the out-of-memory error worse, not better.

Full explanation →

904

MCQhard

A company uses AWS Glue to run ETL jobs that transform data from Amazon RDS for MySQL to Amazon S3. The current job runs daily and takes 3 hours to process 100 GB of data. The company expects data volume to grow 10x in the next year. They need to reduce job runtime and cost. Which approach should they take?

A.Use S3 Select with Glue to filter data before transformation.

B.Use parallel reads with pushdown predicates in the Glue job's source connection, and write the output in columnar format (Parquet) partitioned by date.

C.Increase the number of Glue DPUs to 100 and enable job bookmarking.

D.Use Amazon Redshift Spectrum to perform transformations in place on S3.

AnswerB

Parallel reads with partition pushdown reduce load on RDS and speed up extraction; Parquet with partitioning reduces storage and query costs.

Why this answer

Option D is correct because parallel reads from RDS with pushdown predicates reduce the load on the source and speed up extraction; using columnar formats like Parquet reduces storage and scanning costs in Athena. Option A is wrong because increasing DPUs without changing the extraction method may not help if the bottleneck is the source database. Option B is wrong because S3 Select is for server-side filtering, not for Glue jobs.

Option C is wrong because Redshift Spectrum is for querying data in S3, not for transforming it.

Full explanation →

905

MCQeasy

A data scientist is training a binary classification model on imbalanced data (95% negative, 5% positive). Which metric is most appropriate for evaluating model performance?

A.R-squared

B.Mean Squared Error (MSE)

C.Area Under the ROC Curve (AUC-ROC)

D.Accuracy

AnswerC

AUC-ROC measures the model's ability to distinguish between classes regardless of threshold, suitable for imbalanced data.

Why this answer

AUC-ROC is the most appropriate metric for imbalanced binary classification because it evaluates the model's ability to distinguish between positive and negative classes across all classification thresholds, without being biased by the 95% negative majority. It measures the trade-off between true positive rate and false positive rate, making it robust to class imbalance.

Exam trap

The trap here is that candidates often default to accuracy as the primary metric, not realizing that with severe class imbalance, accuracy can be artificially high and completely mask poor performance on the minority class.

How to eliminate wrong answers

Option A is wrong because R-squared is a regression metric that measures the proportion of variance explained by the model, and it is not applicable to binary classification problems. Option B is wrong because Mean Squared Error (MSE) is a regression loss function that penalizes large errors quadratically and does not provide meaningful evaluation for classification tasks, especially with imbalanced data. Option D is wrong because accuracy would be misleadingly high (95%) by simply predicting the majority class for all instances, failing to capture the model's performance on the rare positive class.

Full explanation →

906

MCQhard

A company uses Amazon SageMaker to train a text classification model. The training data is stored in S3 and contains sensitive personally identifiable information (PII). The company must ensure that the data is encrypted at rest in S3 and that the encryption key is managed by the company's own hardware security module (HSM). Which configuration should be used?

A.Use S3 server-side encryption with S3-managed keys (SSE-S3)

B.Use client-side encryption with the encryption key stored in the HSM

C.Use S3 server-side encryption with customer-provided keys (SSE-C) and store the keys in the HSM

D.Use S3 server-side encryption with AWS KMS managed keys (SSE-KMS) with a customer managed key

AnswerC

SSE-C allows customers to provide their own keys, which can be stored in an HSM.

Why this answer

Option A is correct because SSE-C allows customers to provide their own encryption keys, which can be stored in an HSM. Option B (SSE-S3) uses AWS-managed keys. Option C (SSE-KMS with a customer managed key) uses KMS, not HSM.

Option D (client-side encryption) requires managing encryption in the application.

Full explanation →

907

Multi-Selecthard

Which THREE of the following are valid approaches for deploying a machine learning model to an Amazon SageMaker endpoint for real-time inference?

Select 3 answers

A.Use a SageMaker Inference Pipeline with multiple containers

B.Use a pre-built SageMaker container with built-in algorithms

C.Use Amazon EMR to host the model

D.Deploy the model as an AWS Lambda function

E.Bring your own Docker container

AnswersA, B, E

Inference pipelines allow chaining of preprocessing and prediction containers.

Why this answer

Option A is correct because SageMaker Inference Pipelines allow you to chain multiple containers (e.g., preprocessing, prediction, postprocessing) into a single endpoint, enabling complex workflows for real-time inference. This is achieved by defining a sequence of Docker containers in the model definition, where each container's output is passed as input to the next, all within the same SageMaker endpoint.

Exam trap

The trap here is that candidates might confuse Amazon EMR's model serving capabilities (e.g., using Spark MLlib) with SageMaker's managed inference, or assume Lambda can handle large model artifacts despite its payload and timeout constraints.

Full explanation →

908

MCQhard

A data scientist is training a binary classifier to detect network intrusions. The dataset has 1,000 features and 10 million samples, but only 0.1% are positive (intrusions). The scientist uses XGBoost with scale_pos_weight set to 100. The model achieves a recall of 0.90 and precision of 0.05 on the test set. The business requires precision of at least 0.50 while maintaining recall above 0.80. Which technique should the scientist apply?

A.Switch to a random forest classifier with class weights

B.Randomly undersample the majority class to achieve 1:1 ratio

C.Tune the decision threshold on validation data to maximize F1 score

D.Increase scale_pos_weight to 500

AnswerC

Threshold tuning directly controls precision-recall trade-off.

Why this answer

Option B (post-training threshold tuning) adjusts the decision threshold to trade off precision and recall. Option A (increase scale_pos_weight) will further increase recall but decrease precision. Option C (undersample majority) can help but may reduce recall.

Option D (use random forest) may not achieve required precision.

Full explanation →

909

MCQeasy

A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?

A.Amazon Athena

B.Amazon EMR with Spark

C.AWS Glue

D.AWS Data Pipeline

AnswerC

AWS Glue is a serverless ETL service that can perform the transformation efficiently.

Why this answer

AWS Glue is the correct choice because it provides a fully managed, serverless ETL service that can automatically convert CSV files from S3 into Parquet format using its built-in Spark engine. It is cost-effective as you only pay for the resources consumed during the job execution, and it integrates directly with data warehouses like Amazon Redshift for loading transformed data.

Exam trap

The trap here is that candidates confuse Amazon Athena's ability to query Parquet files with the ability to transform CSV into Parquet, overlooking that Athena is a query engine, not an ETL service, while Glue is purpose-built for serverless data transformation.

How to eliminate wrong answers

Option A is wrong because Amazon Athena is an interactive query service that can query CSV and Parquet files directly in S3, but it does not perform ETL transformations or convert file formats; it is for ad-hoc analysis, not data transformation. Option B is wrong because Amazon EMR with Spark requires provisioning and managing clusters, which is not serverless; it incurs costs for running EC2 instances even when idle, making it less cost-effective for occasional transformations. Option D is wrong because AWS Data Pipeline is a workflow orchestration service that can move and transform data, but it is not serverless (it relies on EC2 instances for task runners) and is primarily designed for scheduled data movement, not optimized for converting CSV to Parquet with built-in Spark capabilities.

Full explanation →

910

MCQmedium

A company uses Amazon Kinesis Data Firehose to ingest streaming data and deliver it to an S3 bucket. The data is in JSON format with a timestamp field. The data science team wants to query the data using Athena with partitioning by year/month/day. How should the S3 data be organized?

A.Configure Firehose to use dynamic partitioning with custom prefix

B.Store data in a single prefix and use Athena's 'partition projection' feature

C.Use AWS Glue crawler to partition the data after delivery

D.Use Amazon EMR to partition the data after delivery

AnswerA

Firehose dynamic partitioning creates directories based on record fields or timestamps.

Why this answer

Kinesis Firehose can partition data using custom prefixes like 'year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/'. This creates Hive-style partitions that Athena can automatically discover.

Full explanation →

911

MCQeasy

A data scientist needs to understand the distribution of a numeric feature in a dataset stored in Amazon S3. Which AWS service can be used to run a quick exploratory query without setting up a server?

A.Amazon Redshift

B.Amazon EMR

C.Amazon Athena

D.AWS Glue

AnswerC

Athena is serverless and allows SQL queries directly on S3 data.

Why this answer

Option C is correct because Amazon Athena allows serverless SQL queries on data in S3. Option A (Amazon EMR) requires cluster setup; Option B (AWS Glue) is for ETL; Option D (Amazon Redshift) is a data warehouse.

Full explanation →

912

MCQhard

A company is deploying a real-time inference endpoint using SageMaker. The model is a large deep learning model (5 GB) with strict latency requirements (< 100 ms per request). The team expects bursty traffic with up to 1000 requests per second. Which configuration best meets the latency and throughput requirements?

A.Deploy an ml.p3.2xlarge instance with automatic scaling based on a custom metric like 'InvocationsPerInstance'

B.Use a multi-model endpoint with ml.c5.4xlarge instances

C.Use SageMaker Serverless Inference with a memory size of 6 GB

D.Deploy a single ml.p3.16xlarge instance with a production variant

AnswerA

GPU instances handle large models; automatic scaling with custom metrics provides elasticity.

Why this answer

Option A is correct because deploying on an ml.p3.2xlarge instance with automatic scaling based on 'InvocationsPerInstance' allows the endpoint to handle bursty traffic up to 1000 requests per second while maintaining sub-100 ms latency. The GPU-accelerated p3 instance provides the necessary compute for a 5 GB deep learning model, and custom scaling on invocations per instance ensures that additional instances are provisioned quickly during traffic spikes without over-provisioning.

Exam trap

The trap here is that candidates often assume a single large instance (like ml.p3.16xlarge) can handle high throughput, but they overlook the need for horizontal scaling to manage bursty traffic without latency degradation.

How to eliminate wrong answers

Option B is wrong because multi-model endpoints share a single instance across multiple models, which can lead to contention and increased latency for a large 5 GB model under bursty traffic, and the ml.c5.4xlarge instances lack GPU acceleration, making them unsuitable for deep learning inference at high throughput. Option C is wrong because SageMaker Serverless Inference has a cold start latency that can exceed 100 ms, especially for a 5 GB model, and its maximum concurrency is limited, making it unable to handle 1000 requests per second with strict latency requirements. Option D is wrong because a single ml.p3.16xlarge instance, while powerful, cannot handle bursty traffic of 1000 requests per second without scaling; a single instance will be overwhelmed, causing latency to spike above 100 ms, and it lacks the elasticity needed for bursty workloads.

Full explanation →

913

MCQmedium

A machine learning team is deploying a model using Amazon SageMaker. The model inference code runs on GPUs and requires a custom container. The team wants to minimize cold start latency. Which SageMaker hosting option should they use?

A.Use a multi-model endpoint with GPU instances.

B.Use a serverless inference endpoint.

C.Use a real-time endpoint with multiple production variants for redundancy.

D.Use a real-time endpoint with a single production variant using a GPU instance.

AnswerD

Real-time endpoints with GPU instances minimize cold start latency for custom containers.

Why this answer

Multi-model endpoints are designed to host multiple models on the same endpoint and can reduce cold starts when models are loaded on demand, but for a single model with GPU requirement, multi-model endpoints do not support GPU. Real-time endpoints with a single variant and GPU instance are the standard choice for low-latency inference. Serverless inference does not support GPU.

Multi-variant endpoints are for A/B testing. Batch transform is for offline inference.

Full explanation →

914

MCQhard

A data scientist is training a model using SageMaker's built-in XGBoost algorithm. The dataset has 500 features and 1 million rows. The training job is taking too long. The scientist wants to reduce training time without sacrificing accuracy. Which action is LIKELY to be most effective?

A.Use a smaller instance type to reduce time

B.Reduce the number of trees in XGBoost

C.Use a larger instance type with more vCPUs

D.Apply Principal Component Analysis (PCA) to reduce the number of features

AnswerD

PCA reduces dimensionality, speeding up training while retaining most information.

Why this answer

Option A (Reduce the number of features by applying PCA) is correct because it reduces dimensionality, speeding up training. Option B (Increase the number of instances) may not be cost-effective. Option C (Use a smaller instance) may reduce time but also accuracy.

Option D (Reduce the number of trees) may reduce accuracy.

Full explanation →

915

MCQeasy

A company wants to use SageMaker to host multiple models behind a single endpoint to reduce costs. Which SageMaker feature should they use?

A.SageMaker Elastic Inference

B.SageMaker inference pipeline

C.SageMaker batch transform

D.SageMaker multi-container endpoints

E.SageMaker Multi-Model Endpoints

AnswerE

Multi-Model Endpoints host multiple models on the same endpoint.

Why this answer

Option C is correct because SageMaker Multi-Model Endpoints allow multiple models on the same endpoint. Option A (multi-container) is for different containers, not multiple models. Option B (batch transform) is offline.

Option D (inference pipeline) chains containers. Option E (Elastic Inference) accelerates inference.

Full explanation →

916

MCQhard

A data scientist is training a model using Amazon SageMaker. The training dataset is 500 GB and is stored in S3. The data scientist wants to use Pipe input mode to stream data directly from S3 to the training container. However, the training job fails with an error indicating that the container cannot read the data. What is the most likely cause?

A.The training instance does not have enough memory

B.The IAM role does not have s3:GetObject permission

C.The data is compressed and Pipe mode cannot handle compressed data

D.The S3 bucket is in a different Region

E.The training algorithm does not support Pipe mode

AnswerE

Not all algorithms support Pipe input; they need to read from a pipe.

Why this answer

Option D is correct because Pipe mode requires the training algorithm to support reading from a FIFO pipe (e.g., via stdin) rather than random access files. Option A (S3 bucket) is accessible. Option B (instance type) is not specific to Pipe mode.

Option C (IAM role) is needed but if permissions are correct, the issue is algorithm support. Option E (compression) is not a problem for Pipe mode.

Full explanation →

917

MCQmedium

A data scientist is analyzing a time-series dataset and wants to check for stationarity. Which EDA technique is most appropriate?

A.Plot the autocorrelation function (ACF).

B.Use time-series cross-validation.

C.Perform the Augmented Dickey-Fuller (ADF) test.

D.Create a scatter plot of the series against its lag.

AnswerC

ADF test formally tests for unit root (non-stationarity).

Why this answer

The Augmented Dickey-Fuller (ADF) test is a formal statistical hypothesis test specifically designed to check for stationarity in a time series. It tests the null hypothesis that a unit root is present, indicating non-stationarity, against the alternative of stationarity. This makes it the most appropriate EDA technique for directly assessing stationarity.

Exam trap

AWS often tests the distinction between visual EDA techniques (like ACF plots) and formal statistical tests (like ADF), trapping candidates who confuse diagnostic plots with hypothesis testing for stationarity.

How to eliminate wrong answers

Option A is wrong because plotting the autocorrelation function (ACF) is a visual diagnostic for identifying autocorrelation patterns and model order (e.g., AR or MA terms), but it does not provide a formal statistical test for stationarity. Option B is wrong because time-series cross-validation is a model evaluation technique used to assess predictive performance, not a method for testing stationarity. Option D is wrong because a scatter plot of the series against its lag can reveal linear relationships and autocorrelation, but it lacks a formal hypothesis test and cannot definitively confirm or reject stationarity.

Full explanation →

918

Multi-Selecthard

A company is training a deep learning model on SageMaker using multiple GPUs. The training is slow due to inefficient data loading. Which TWO actions can improve I/O performance?

Select 2 answers

A.Use instance store volumes for data.

B.Increase the instance count to a single large instance.

C.Use Pipe mode input for training data.

D.Use Amazon EBS volumes attached to training instances.

E.Use Amazon EFS as a shared file system.

AnswersC, E

Pipe mode streams data directly from S3, reducing I/O bottlenecks.

Why this answer

Options A and D are correct. Using Pipe mode streams data directly from S3, avoiding local disk writes. Using Amazon EFS can also provide shared, high-throughput storage.

Option B is incorrect because SageMaker does not support EBS volumes as input directly; you would need to use FSx or EFS. Option C is incorrect because instance store volumes are ephemeral and not suitable for persistent data. Option E is incorrect because a single large instance may not improve I/O parallelism.

Full explanation →

919

MCQeasy

A data scientist is training a decision tree classifier and notices that the model performs well on training data but poorly on test data. Which technique should the data scientist use to address this issue?

A.Use a different split criterion

B.Prune the tree

C.Apply L1 regularization

D.Increase tree depth

AnswerB

Pruning reduces overfitting.

Why this answer

Pruning the tree reduces overfitting by removing branches that have little statistical significance or that capture noise in the training data. This technique improves generalization to unseen test data, which directly addresses the symptom of high training accuracy and low test accuracy.

Exam trap

Cisco often tests the misconception that regularization techniques like L1/L2 apply universally, when in fact they are specific to models with learnable weights (e.g., linear regression, neural networks) and not to tree-based models.

How to eliminate wrong answers

Option A is wrong because using a different split criterion (e.g., Gini impurity vs. entropy) changes how splits are selected but does not inherently reduce model complexity or overfitting; it may still produce a deep, overfitted tree. Option C is wrong because L1 regularization is a technique for linear models (e.g., Lasso regression) and is not directly applicable to decision trees, which do not have coefficients to penalize. Option D is wrong because increasing tree depth makes the model more complex, exacerbating overfitting rather than fixing it.

Full explanation →

920

MCQhard

An S3 event notification triggers an AWS Lambda function when a new object is created. The Lambda function parses the event and processes the object. The function is failing with a timeout error for large objects. Which approach should be used to handle large objects efficiently?

A.Increase the Lambda function timeout to 15 minutes

B.Use an SQS queue to buffer event notifications and configure Lambda with a batch window

C.Stream events to Amazon Kinesis Data Streams and process with Lambda

D.Use AWS Step Functions to orchestrate the processing

AnswerB

SQS decouples events and allows Lambda to process in batches, reducing timeout risk.

Why this answer

Option B is correct because S3 event notifications can be sent to an SQS queue, and Lambda can process messages in batches with longer timeouts. Option A is wrong because increasing timeout alone doesn't solve the issue of large objects; Option C is wrong because Kinesis is not needed; Option D is wrong because Step Functions adds complexity.

Full explanation →

921

MCQeasy

A data scientist is analyzing a dataset with 500 features and 10,000 samples. After running a correlation matrix, they find that many feature pairs have correlation >0.95. What is the most appropriate next step to improve model performance?

A.Collect more training data to reduce the impact of correlated features.

B.Increase the regularization parameter in the model.

C.Apply principal component analysis (PCA) to reduce dimensionality.

D.Remove all features with correlation above 0.95.

AnswerC

PCA reduces multicollinearity by transforming correlated features into orthogonal components.

Why this answer

Option A is correct because high correlation indicates multicollinearity, which can be addressed by dimensionality reduction techniques like PCA. Option B is wrong because adding more data does not fix multicollinearity. Option C is wrong because removing all correlated features may discard useful information.

Option D is wrong because increasing regularization can help but is not the most appropriate first step for a large number of correlated features.

Full explanation →

922

MCQhard

A team is performing exploratory data analysis on a dataset containing 10 million records stored in Amazon S3. They want to sample the data efficiently to build a representative subset for initial modeling. Which sampling method should they use to minimize bias and ensure the sample reflects the population distribution?

A.Stratified random sampling

B.Simple random sampling

C.Systematic sampling

D.Reservoir sampling

AnswerA

Stratified sampling ensures representation from all strata, reducing bias.

Why this answer

Option D is correct because stratified random sampling ensures that each subgroup (stratum) is proportionally represented, which is important for imbalanced data. Option A is wrong because simple random sampling may miss rare subgroups. Option B is wrong because systematic sampling can introduce bias if there is periodicity.

Option C is wrong because reservoir sampling is for streaming data, not for static datasets.

Full explanation →

923

MCQhard

A data pipeline uses AWS Glue to transform data from Amazon RDS to Amazon S3. The team wants to ensure that only new or updated records are processed in each run, minimizing cost and time. Which AWS Glue feature should be used?

A.Use Glue triggers to run the job on a schedule.

B.Use Glue partition pruning to filter data.

C.Use Glue crawlers to detect new data.

D.Enable Glue Job Bookmarks.

AnswerD

Job Bookmarks maintain state and process only new or changed data.

Why this answer

Option D is correct because Job Bookmarks track processed data and enable incremental processing. Option A is wrong because Partitioning organizes data but doesn't track changes. Option B is wrong because Crawlers discover schema but don't process incrementally.

Option C is wrong because Triggers schedule jobs but don't enable incremental processing.

Full explanation →

924

MCQeasy

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transaction records. The dataset has missing values in the 'age' column and outliers in the 'amount' column. Which combination of techniques should the engineer use to handle these issues during EDA?

A.Impute missing age values with the median and cap outliers in 'amount' using the interquartile range (IQR) method.

B.Remove rows with missing age and apply log transformation to 'amount'.

C.Impute missing age values with a constant (e.g., 0) and cap outliers using mean ± 3*std.

D.Impute missing age values with the mean and remove outliers in 'amount' using z-score.

AnswerA

Median is robust; IQR handles outliers.

Why this answer

Option A is correct because median imputation is robust to outliers, and IQR-based capping is standard for outlier handling. Option B is wrong because mean imputation is sensitive to outliers. Option C is wrong because removing rows with missing age may lose data.

Option D is wrong because z-score with mean/std is also sensitive to outliers.

Full explanation →

925

MCQmedium

Refer to the exhibit. An IAM policy is attached to an IAM role used by a SageMaker training job. The training job fails with an access denied error when trying to write model artifacts to an S3 bucket. What is the most likely cause?

A.The IAM role does not have permission to write to the S3 bucket

B.The training job is trying to write to a different S3 bucket

C.The IAM role does not have permission to read the training data

D.The IAM role does not have permission to create training jobs

AnswerA

The policy lacks s3:PutObject, so writing model artifacts is denied.

Why this answer

Option D is correct because the policy only allows s3:GetObject, not s3:PutObject, so the training job cannot write artifacts. Option A is wrong because the policy allows sagemaker:CreateTrainingJob. Option B is wrong because the policy allows s3:GetObject for the training data.

Option C is wrong because the policy does not restrict the S3 bucket; it allows GetObject on a specific path.

Full explanation →

926

MCQhard

A data scientist is training a deep learning model for image segmentation using a U-Net architecture. The model overfits severely. The scientist tries L2 regularization, dropout, and data augmentation, but validation loss remains high while training loss approaches zero. Which additional strategy is most likely to reduce overfitting?

A.Implement early stopping based on validation loss

B.Increase the batch size

C.Use a larger learning rate

D.Add more convolutional layers to increase model capacity

AnswerA

Early stopping prevents overfitting by stopping training before the model starts to memorize the training data.

Why this answer

Early stopping monitors validation loss and halts training when it stops improving, directly addressing overfitting by preventing the model from memorizing noise after it has learned generalizable features. Since the training loss is near zero but validation loss remains high, the model has already started overfitting, and early stopping can cut training at the point just before overfitting worsens.

Exam trap

AWS often tests the misconception that increasing regularization (L2, dropout, augmentation) is always sufficient, but the trap here is that when those techniques fail, early stopping is the next logical step because it directly stops the overfitting process at the optimal point, whereas the other options either increase capacity or destabilize training.

How to eliminate wrong answers

Option B is wrong because increasing batch size typically reduces gradient noise and can lead to sharper minima, which often worsens generalization and overfitting, not reduces it. Option C is wrong because using a larger learning rate can cause the optimizer to overshoot minima, leading to unstable training and potentially higher validation loss, but it does not specifically target the overfitting problem when training loss is already near zero. Option D is wrong because adding more convolutional layers increases model capacity, which exacerbates overfitting when the model already has enough capacity to memorize the training data.

Full explanation →

927

MCQmedium

Refer to the exhibit. A data scientist is trying to create a SageMaker training job but receives an access denied error. The IAM policy shown is attached to the user. What is the likely issue?

A.The policy does not allow sagemaker:CreateTrainingJob

B.The policy is missing s3:ListBucket permission

C.The policy does not allow sagemaker:CreateModel

D.The policy does not allow s3:PutObject

AnswerB

SageMaker needs to list objects in the bucket.

Why this answer

Option C is correct because the policy allows s3:GetObject and s3:PutObject on the bucket, but the training job also needs s3:ListBucket to read the objects. Option A is wrong because the policy has sagemaker:CreateTrainingJob on all resources. Option B is wrong because the policy allows sagemaker actions.

Option D is wrong because the policy allows s3:PutObject, but the error is about access denied, not about upload.

Full explanation →

928

Multi-Selecteasy

A data engineer is building a data pipeline that ingests streaming data from Amazon Kinesis Data Streams, transforms the data using AWS Lambda, and stores the results in Amazon S3. The engineer needs to ensure that each record is processed exactly once and in order. Which TWO approaches should the engineer consider? (Choose TWO.)

Select 2 answers

A.Configure the Lambda function with a reserved concurrency of 1 and a batch size of 1 to process records sequentially.

B.Use the Kinesis Producer Library (KPL) with a sequence number for each record.

C.Use Amazon SQS FIFO queues to decouple Kinesis and Lambda, ensuring ordering and exactly-once delivery.

D.Enable S3 Event Notifications to trigger Lambda for each object.

E.Use Kinesis Data Firehose to buffer and deliver data to S3, then use Lambda to process.

AnswersA, B

This ensures in-order processing per shard, and with proper idempotency, exactly-once can be achieved.

Why this answer

Option A is correct because Kinesis Data Streams supports exactly-once processing within a shard using the sequence number. Option E is correct because AWS Lambda can be configured with a reserved concurrency of 1 and a batch size of 1 to process records in order. Option B is incorrect because S3 does not provide ordering guarantees.

Option C is incorrect because Kinesis Data Firehose does not guarantee exactly-once delivery. Option D is incorrect because SQS FIFO does not integrate directly with Kinesis.

Full explanation →

929

Multi-Selectmedium

A data scientist is deploying a machine learning model on Amazon SageMaker for real-time inference. The model requires low-latency predictions and must be able to handle up to 1000 requests per second. Which TWO actions should the data scientist take to ensure the endpoint can meet the performance requirements? (Choose 2.)

Select 2 answers

A.Use a multi-model endpoint to host multiple models on the same instance.

B.Enable data capture to Amazon S3 for model monitoring and retraining.

C.Use a serverless inference endpoint to automatically scale.

D.Configure an auto scaling policy for the endpoint based on invocation metrics.

E.Deploy the model on a single large instance (e.g., ml.p3.16xlarge).

AnswersB, D

Data capture logs predictions for monitoring and retraining, which is a best practice.

Why this answer

Option B is correct because enabling data capture to S3 allows model monitoring and retraining. Option D is correct because auto scaling adjusts instances based on load. Option A is wrong because serverless inference has a cold start and max concurrency limits unsuitable for 1000 TPS.

Option C is wrong because increasing instance size alone may not be cost-effective and auto scaling is better. Option E is wrong because multi-model endpoints share resources and may cause contention.

Full explanation →

930

MCQhard

A data scientist is training a binary classification model using SageMaker XGBoost and notices that training loss decreases but validation loss increases after a few epochs. Which action should the data scientist take to address this issue?

A.Increase the number of rounds

B.Set early stopping based on validation loss

C.Increase the learning rate

D.Increase the maximum tree depth

AnswerB

Stops training when validation loss stops improving.

Why this answer

The increasing validation loss while training loss decreases is a classic sign of overfitting. Setting early stopping based on validation loss halts training when the validation loss stops improving, preventing the model from memorizing noise in the training data. SageMaker XGBoost's `early_stopping_rounds` parameter monitors the evaluation metric on the validation set and stops training if no improvement is seen for a specified number of rounds.

Exam trap

AWS often tests the misconception that increasing model complexity (more rounds, deeper trees, higher learning rate) always improves performance, when in fact these actions worsen overfitting when validation loss diverges from training loss.

How to eliminate wrong answers

Option A is wrong because increasing the number of rounds would continue training further, exacerbating overfitting and making validation loss worse. Option C is wrong because increasing the learning rate makes the model converge faster but does not address overfitting; it can actually cause the model to overshoot optimal minima and worsen validation loss. Option D is wrong because increasing the maximum tree depth allows trees to grow deeper, capturing more complex patterns and increasing the risk of overfitting, which is the opposite of what is needed.

Full explanation →

931

Multi-Selecthard

A data engineer is designing a data pipeline that ingests data from a relational database into a data lake on Amazon S3. The data must be incrementally loaded daily. Which TWO AWS services can be used together to achieve this?

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Streams

C.Amazon Athena

D.Amazon Redshift

E.AWS Database Migration Service (DMS)

AnswersA, E

Glue can use job bookmarks for incremental loads.

Why this answer

AWS Database Migration Service (DMS) can perform continuous replication or scheduled tasks to move data from a relational database to S3. AWS Glue can also connect to databases and run incremental ETL jobs using job bookmarks. Option A (Athena) is for querying, Option B (Kinesis) is for streaming, Option E (Redshift) is a data warehouse.

Full explanation →

932

MCQeasy

A machine learning engineer needs to deploy a TensorFlow model to a SageMaker endpoint. The model expects a specific input format. The engineer has the model artifacts stored in an S3 bucket. Which step is REQUIRED to deploy the model?

A.Register the model in SageMaker Model Registry.

B.Create a SageMaker training job to re-train the model.

C.Save the model as a SavedModel format.

D.Create a SageMaker Model object using the TensorFlow serving image.

AnswerD

A SageMaker Model object is required to specify the container and artifact location for deployment.

Why this answer

Creating a SageMaker Model object (Option C) is required to specify the image and artifact location, which is then used to deploy an endpoint. Option A (saving as SavedModel) is already done. Option B (Registering in Model Registry) is optional.

Option D (creating a training job) is not needed for deployment.

Full explanation →

933

MCQeasy

A data scientist is building a regression model to predict house prices. The dataset contains many features, some of which are highly correlated. The model is overfitting. Which regularization technique should the scientist use to penalize large coefficients and perform feature selection?

A.L2 regularization (Ridge)

B.L1 regularization (Lasso)

C.Elastic Net regularization

D.Dropout

AnswerB

L1 regularization can zero out coefficients, performing feature selection.

Why this answer

L1 regularization (Lasso) adds a penalty equal to the absolute value of the coefficients, which can shrink some coefficients to zero, performing feature selection. L2 regularization (Ridge) penalizes squared coefficients but does not zero them out. Elastic Net combines both.

Dropout is for neural networks. Option A: L1 regularization is correct. Option B: L2 regularization does not perform feature selection.

Option C: Elastic Net combines both but L1 alone is simpler for feature selection. Option D: Dropout is not applicable to linear regression.

Full explanation →

934

Multi-Selecthard

A data scientist is analyzing a dataset and suspects the presence of outliers that could affect the mean and standard deviation. Which TWO methods are robust to outliers for measuring central tendency and dispersion?

Select 2 answers

A.Interquartile range (IQR)

B.Range

C.Standard deviation

D.Median

E.Mean

AnswersA, D

IQR is robust to outliers.

Why this answer

Median and interquartile range (IQR) are robust to outliers. Mean and standard deviation are sensitive to outliers. Range is also sensitive.

Full explanation →

935

MCQhard

A SageMaker endpoint creation fails with the above CloudWatch Logs excerpt. What is the MOST likely cause?

A.The S3 bucket containing the model artifacts has incorrect permissions

B.The inference script has a syntax error

C.The instance type does not have enough memory to load the model

D.The model file is too large and takes longer than 300 seconds to load

AnswerD

The timeout indicates the model loading exceeds the default 300 seconds.

Why this answer

Option C is correct because the model file may be too large to load within the default timeout. Option A (missing inference script) would cause a different error. Option B (incorrect S3 permissions) would appear earlier.

Option D (instance type) would cause resource issues but not a timeout on loading.

Full explanation →

936

MCQhard

A data scientist is training a neural network on Amazon SageMaker. The network has many layers and the training is very slow. The scientist suspects that the gradients are vanishing. Which technique is most specifically designed to mitigate the vanishing gradient problem?

A.Use gradient clipping.

B.Use batch normalization.

C.Use data augmentation.

D.Use dropout layers.

AnswerB

Batch normalization reduces internal covariate shift and helps mitigate vanishing gradients.

Why this answer

Batch normalization helps by normalizing the activations, which reduces the problem of vanishing/exploding gradients. Dropout is for regularization. Data augmentation increases data.

Gradient clipping deals with exploding gradients, not vanishing.

Full explanation →

937

MCQhard

A company is deploying a real-time inference endpoint using SageMaker. The model has a high memory footprint and requires GPU acceleration. Which instance type and configuration should be used to minimize cost while meeting latency requirements?

A.ml.p3.2xlarge with 1 GPU

B.ml.g4dn.xlarge with 1 GPU

C.ml.c5.xlarge with no GPU

D.ml.p3.16xlarge with 8 GPUs

AnswerA

Good balance of GPU and memory for high-memory models at reasonable cost.

Why this answer

Option C is correct because the p3.2xlarge provides GPU acceleration with sufficient memory for a high-memory model and is cost-effective for real-time inference. Option A is incorrect because ml.c5.xlarge does not have GPU. Option B is incorrect because p3.16xlarge is too large and expensive.

Option D is incorrect because ml.g4dn.xlarge has less memory and may not be suitable for high memory footprint.

Full explanation →

938

Multi-Selecteasy

Which TWO AWS services can be used to deploy a trained model for serverless inference? (Select TWO.)

Select 2 answers

A.AWS Lambda with a container image

B.Amazon SageMaker Serverless Inference

C.Amazon SageMaker batch transform

D.Amazon Elastic Container Service (ECS) with Fargate

E.Amazon EC2 instances

AnswersA, B

Serverless compute for small models.

Why this answer

SageMaker Serverless Inference automatically scales. Lambda can serve models if they fit within its limits. Option C is wrong because SageMaker batch transform is not serverless real-time.

Option D is wrong because ECS is not serverless (requires management). Option E is wrong because EC2 is not serverless.

Full explanation →

939

MCQmedium

An ML team is using Amazon SageMaker to train a model. They notice that the training job is taking longer than expected and the CloudWatch metrics show high GPU utilization but low CPU utilization. Which action is MOST likely to improve training speed?

A.Use SageMaker Pipe mode to stream data from S3 to reduce I/O bottleneck

B.Switch to a CPU-only instance to avoid GPU overhead

C.Increase the number of training instances

D.Use a larger GPU instance with more GPU memory

AnswerA

Pipe mode reduces time spent on data loading, allowing GPU to be more utilized.

Why this answer

Option B is correct because high GPU utilization indicates the GPU is busy, but low CPU may indicate a bottleneck in data loading; using Pipe mode can reduce I/O wait. Option A (increase instance count) may help if the job is parallelizable but not if the bottleneck is data loading. Option C (increase GPU memory) does not address data loading.

Option D (use CPU instance) would slow down training.

Full explanation →

940

Multi-Selectmedium

A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)

Select 2 answers

A.Replace missing values with the mode of the feature.

B.Identify and remove outliers from the feature.

C.Use multiple imputation to fill in the missing values.

D.Delete all rows that contain missing values for this feature.

E.Drop the entire feature from the dataset.

AnswersC, D

Multiple imputation creates several plausible imputed datasets and combines results.

Why this answer

Option C is correct because multiple imputation is a robust statistical technique that accounts for uncertainty in missing values by creating multiple complete datasets, analyzing each, and pooling results. This is particularly appropriate for a dataset with 10,000 rows and 20 features, as it preserves the sample size and avoids bias that simpler methods might introduce.

Exam trap

AWS often tests the misconception that mode imputation (Option A) is a safe default for missing data, but it ignores feature relationships and can distort distributions, whereas multiple imputation is preferred for non-trivial missingness.

Full explanation →

941

MCQhard

A company is using SageMaker to train a model with a large dataset that is stored in S3. The training job is taking a long time due to high I/O latency. The team has already converted the data to RecordIO format. What should they do next to reduce I/O latency?

A.Use SageMaker fast file mode

B.Use multiple training instances

C.Use Amazon FSx for Lustre as the training data source

D.Shuffle the data before training

E.Use Pipe mode to stream the RecordIO data

AnswerE

Pipe mode avoids disk I/O by streaming data directly from S3.

Why this answer

Option A is correct because using Pipe mode with RecordIO streams data directly, reducing I/O. Option B (shuffle) does not reduce I/O. Option C (FSx for Lustre) provides high-performance file system but adds complexity.

Option D (fast file mode) is still File mode. Option E (multiple instances) may increase throughput but not per-instance I/O.

Full explanation →

942

MCQmedium

A company is using Amazon Athena to query a data lake in S3. Queries are slow and expensive. The data is stored as JSON. Which action will improve query performance and reduce cost?

A.Compress the JSON files using gzip

B.Partition the data by date

C.Convert the data to Parquet format

D.Increase the number of Athena workers

AnswerC

Parquet is columnar, reducing scanned data and improving performance.

Why this answer

Option C is correct because converting to columnar formats like Parquet reduces scan volume and improves performance. Option A is wrong because increasing workers is not applicable to Athena (serverless); Option B is wrong because compressing with gzip reduces size but still requires full scan; Option D is wrong because partitioning helps but columnar format is more impactful.

Full explanation →

943

MCQhard

Refer to the exhibit. A data scientist wants to use SageMaker to train a model using data stored in 'my-bucket'. The training job fails with an access denied error. What is the MOST likely cause?

A.The bucket uses AWS KMS key encryption instead of AES256

B.The bucket name in the policy does not match the actual bucket name

C.The training job is not requesting server-side encryption with AES256

D.The bucket is publicly accessible but the IAM role lacks permissions

AnswerC

The policy denies PutObject if encryption is not AES256, so the job must include the encryption header.

Why this answer

Option C is correct because SageMaker training jobs require server-side encryption with AES256 when the S3 bucket uses default encryption with AES256. If the training job does not explicitly request SSE-S3 (AES256) in its S3 data source configuration, the S3 service denies access, resulting in an 'access denied' error even if the IAM role has full S3 permissions.

Exam trap

The trap here is that candidates often assume 'access denied' always means an IAM permissions issue, but AWS S3 can return 'Access Denied' for encryption policy violations when the request does not match the bucket's default encryption settings.

How to eliminate wrong answers

Option A is wrong because AWS KMS key encryption would cause a different error (e.g., KMS access denied) rather than a generic 'access denied' error, and the question does not mention KMS key permissions. Option B is wrong because a bucket name mismatch would produce a 'NoSuchBucket' error, not an 'access denied' error. Option D is wrong because if the bucket is publicly accessible, the training job would not need IAM role permissions for read access; the error would be a different type (e.g., 403 Forbidden) if the role lacked permissions, but the scenario points to an encryption mismatch.

Full explanation →

944

MCQmedium

A company runs a machine learning pipeline on Amazon SageMaker. The pipeline consists of three steps: data preprocessing (using a custom container), training (using a built-in algorithm), and model evaluation (using a custom container). The pipeline is orchestrated using AWS Step Functions. Recently, the pipeline has been failing intermittently at the model evaluation step with a 'TimeoutError'. The evaluation step runs a Python script that loads the trained model and a test dataset from S3, computes metrics, and writes results back to S3. The step is configured with a timeout of 600 seconds. The test dataset size has grown over time. The data science team suspects that the timeout is due to the increased data size. They want a solution that minimizes changes to the existing infrastructure and avoids increasing the timeout arbitrarily. Which approach should the team take?

A.Increase the timeout to 1200 seconds and use a larger instance type for the evaluation step.

B.Increase the timeout to 1800 seconds to accommodate the larger dataset.

C.Modify the evaluation script to process the test dataset in parallel batches, and use multiprocessing to distribute the workload within the same container.

D.Switch the evaluation step to use the 'ml.m5.4xlarge' instance type for more memory and compute.

AnswerC

Reduces wall-clock time without increasing timeout or instance size.

Why this answer

Option C is correct because it addresses the root cause—the evaluation script's inability to process the growing dataset within the 600-second timeout—by parallelizing the workload within the same container. This approach minimizes infrastructure changes (no instance type or timeout increase) and leverages Python's multiprocessing to reduce wall-clock time, directly tackling the 'TimeoutError' without arbitrary timeout extensions.

Exam trap

The trap here is that candidates often default to scaling up infrastructure (larger instances or higher timeouts) instead of optimizing the code, which is a classic 'throw hardware at the problem' misconception that the MLS-C01 exam tests by rewarding efficient, cost-conscious solutions.

How to eliminate wrong answers

Option A is wrong because increasing both timeout and instance type is an over-engineered solution that introduces unnecessary cost and complexity, and it does not address the underlying inefficiency in processing the dataset sequentially. Option B is wrong because simply increasing the timeout to 1800 seconds is a temporary band-aid that does not fix the performance bottleneck; as the dataset continues to grow, the timeout will need to be increased again, leading to an unsustainable pattern. Option D is wrong because switching to a larger instance type (ml.m5.4xlarge) only provides more memory and compute but does not change the sequential processing logic; the script will still take the same amount of time (or only marginally less) and may still hit the timeout if the dataset is large enough.

Full explanation →

945

MCQeasy

A startup is using SageMaker to train a model using the built-in XGBoost algorithm. The training job runs successfully but the resulting model performs poorly on the test data. The data scientist suspects overfitting. The training data is relatively small (10,000 rows). Which action should be taken to reduce overfitting?

A.Decrease the number of trees (num_round) to 50

B.Increase the learning rate to 0.3

C.Increase the number of trees (num_round) to 500

D.Use a larger instance type

AnswerA

Fewer trees reduce overfitting.

Why this answer

Option C is correct because reducing the number of trees (or early stopping) reduces overfitting. Option A is wrong because increasing trees increases overfitting. Option B is wrong because increasing learning rate may cause divergence.

Option D is wrong because using a larger instance does not affect overfitting.

Full explanation →

946

MCQmedium

A data scientist is using Amazon SageMaker to train a neural network. The training job fails with the error 'ResourceLimitExceeded: The account-level service limit for ml.p3.8xlarge for training job usage is 0.' What is the most likely cause and solution?

A.The training job is using spot instances; switch to on-demand instances.

B.The instance type is not available in the current region; switch to a different region.

C.The account has not requested a limit increase for ml.p3.8xlarge; submit a limit increase request via AWS Support.

D.The instance type is too large; use a smaller instance type like ml.m5.large.

AnswerC

ResourceLimitExceeded indicates the current limit is zero; a limit increase is needed.

Why this answer

The error message explicitly states that the account-level service limit for ml.p3.8xlarge for training job usage is 0, which means the account has not been granted any capacity for that instance type. AWS enforces service quotas (limits) per account per region, and for GPU-intensive instances like ml.p3.8xlarge, the default limit is often 0 unless a limit increase request has been submitted and approved. Therefore, the correct solution is to request a limit increase via AWS Support.

Exam trap

The trap here is that candidates may confuse a service limit error with instance availability or spot instance issues, but the specific phrase 'limit is 0' directly points to an unrequested quota increase, not a regional or pricing model problem.

How to eliminate wrong answers

Option A is wrong because the error is about a service limit of 0, not about spot instance availability; switching to on-demand instances would still fail because the limit applies to both spot and on-demand training jobs. Option B is wrong because the error message does not indicate regional unavailability; it specifically cites a limit of 0, meaning the instance type exists in the region but the account has no quota. Option D is wrong because the error is not about instance size or resource exhaustion; using a smaller instance type like ml.m5.large would avoid the GPU limit but does not address the root cause of the ml.p3.8xlarge limit being 0.

Full explanation →

947

MCQhard

A company runs a real-time fraud detection pipeline using Amazon Kinesis Data Analytics. The pipeline reads from a Kinesis data stream, performs sliding window aggregations, and writes results to a DynamoDB table. The application is experiencing high latency during peak hours. Which action would MOST effectively reduce latency?

A.Enable DynamoDB auto scaling to handle write spikes.

B.Decrease the parallelism level in the Kinesis Data Analytics application.

C.Increase the number of shards in the Kinesis data stream.

D.Increase the sliding window size to reduce computational frequency.

AnswerC

More shards increase parallelism and reduce processing backlog.

Why this answer

Option A is correct because increasing the number of shards increases parallelism in both the stream and the Kinesis Data Analytics application. Option B is wrong because DynamoDB auto scaling helps with writes but not streaming latency. Option C is wrong because a larger window increases latency.

Option D is wrong because decreasing parallelism reduces throughput.

Full explanation →

948

MCQmedium

During EDA, a data scientist finds that two features have a Pearson correlation coefficient of 0.95. What is the primary concern when using these features together in a linear regression model?

A.The model will underfit because of redundant information

B.Heteroscedasticity will be introduced

C.The model will overfit due to redundant features

D.Multicollinearity will make coefficient estimates unstable

AnswerD

High correlation between predictors leads to multicollinearity, increasing standard errors.

Why this answer

Option B is correct because high correlation indicates multicollinearity, which can destabilize coefficient estimates in linear regression. Option A is wrong because overfitting is not directly caused by correlation. Option C is wrong because high correlation does not cause underfitting.

Option D is wrong because heteroscedasticity is unrelated to correlation.

Full explanation →

949

MCQmedium

A data scientist is tuning a neural network on a small dataset and observes that the training loss decreases but validation loss increases after a few epochs. Which technique should be applied to mitigate this issue?

A.Add dropout layers to the model.

B.Increase the learning rate.

C.Remove regularization terms from the loss function.

D.Increase the number of epochs.

AnswerA

Dropout randomly drops neurons, reducing overfitting.

Why this answer

Overfitting occurs when validation loss increases. Dropout is a regularization technique that reduces overfitting. Learning rate decay may help convergence but not specifically for overfitting.

Batch normalization helps training stability. Data augmentation is useful but not always applicable.

Full explanation →

950

MCQhard

A data scientist is analyzing a dataset with high cardinality categorical features (e.g., user IDs with millions of unique values). They want to visualize the relationship between these categorical features and a continuous target variable. Which approach is most effective for EDA?

A.Group rare categories into an 'Other' category and use box plots

B.Apply one-hot encoding and use scatter plots

C.Use a bar chart with all categories on x-axis

D.Remove the categorical features from analysis

E.Apply feature hashing and visualize the hashed values

AnswerA

Grouping reduces cardinality and box plots effectively show relationship with target.

Why this answer

For high cardinality categorical features, grouping rare categories into an 'Other' category reduces cardinality and allows meaningful visualizations like box plots. Option A is wrong because removing the feature loses information. Option B is wrong because one-hot encoding creates too many columns and is not suitable for visualization.

Option D is wrong because visualizing millions of categories is not feasible. Option E is wrong because feature hashing is for modeling, not EDA visualization.

Full explanation →

951

MCQmedium

A data scientist is training a binary classification model on imbalanced data (95% negative, 5% positive). The model achieves 99% accuracy on the test set but fails to detect any positive cases. Which metric should the scientist focus on to evaluate model performance?

A.Accuracy

B.Recall

C.RMSE

D.Precision

AnswerB

Recall measures the proportion of actual positives correctly identified.

Why this answer

Option B is correct because recall (true positive rate) measures the ability to find all positive samples, which is critical for imbalanced datasets where accuracy can be misleading. Option A is wrong because accuracy is high but misleading. Option C is wrong because precision alone doesn't capture the missed positives.

Option D is wrong because RMSE is for regression.

Full explanation →

952

MCQeasy

A data scientist is training a deep learning model for image classification using Amazon SageMaker. The training job is taking too long. The data scientist wants to use distributed training across multiple GPUs to speed up the process. Which SageMaker feature should the data scientist use?

A.SageMaker Distributed Training Libraries

B.SageMaker Managed Spot Training

C.SageMaker Hyperparameter Tuning

D.SageMaker Automatic Model Tuning

AnswerA

Distributed training libraries enable training across multiple GPUs, reducing wall-clock time.

Why this answer

SageMaker Distributed Training Libraries provide optimized implementations of data parallelism and model parallelism that automatically partition the model and data across multiple GPUs, reducing training time for deep learning models. This is the correct choice because the question specifically asks for a feature to enable distributed training across multiple GPUs, which is exactly what these libraries are designed for.

Exam trap

The trap here is that candidates often confuse cost-saving features (like Spot Training) with performance-optimization features (like distributed training), or they mistakenly think hyperparameter tuning can parallelize a single training job across GPUs.

How to eliminate wrong answers

Option B is wrong because SageMaker Managed Spot Training reduces cost by using spare EC2 capacity, not by distributing training across multiple GPUs; it does not inherently speed up training. Option C is wrong because SageMaker Hyperparameter Tuning automates the search for optimal hyperparameters, but it does not distribute a single training job across multiple GPUs. Option D is wrong because SageMaker Automatic Model Tuning is another name for hyperparameter tuning (same as option C) and does not provide distributed training capabilities.

Full explanation →

953

MCQeasy

A data scientist is training a binary classification model for fraud detection. The dataset is highly imbalanced with only 1% fraudulent transactions. The model currently achieves 99% accuracy but only catches 5% of actual fraud cases. Which metric should the data scientist focus on to better evaluate model performance?

A.Precision

B.Accuracy

C.Root Mean Squared Error (RMSE)

D.Recall

AnswerD

Recall measures the ability to find all positive samples, which is crucial for fraud detection.

Why this answer

In fraud detection with highly imbalanced data (1% fraud), accuracy is misleading because a model can achieve 99% accuracy by simply predicting 'not fraud' for all transactions. Recall (true positive rate) measures the proportion of actual fraud cases correctly identified, which is critical when the cost of missing fraud is high. The model currently catches only 5% of fraud, so improving recall is the primary goal to reduce false negatives.

Exam trap

AWS often tests the misconception that accuracy is always the best metric, but in imbalanced classification, recall or precision-recall curves are more informative, and candidates must recognize that high accuracy can mask poor minority class performance.

How to eliminate wrong answers

Option A is wrong because precision measures the proportion of predicted fraud cases that are actually fraud, which is not the primary concern when the model misses 95% of actual fraud; precision focuses on false positives, not false negatives. Option B is wrong because accuracy is dominated by the majority class (99% non-fraud) and does not reflect the model's poor performance on the minority fraud class; a model can have high accuracy while failing to detect fraud. Option C is wrong because RMSE is a regression metric that measures the average magnitude of errors in continuous predictions, not suitable for evaluating binary classification performance, especially with imbalanced classes.

Full explanation →

954

Multi-Selectmedium

Which THREE of the following are common issues that can be identified during exploratory data analysis? (Select THREE.)

Select 3 answers

A.Multicollinearity between features

B.High latency in API endpoints

C.Gradient vanishing in neural networks

D.Class imbalance in the target variable

E.Missing values in features

AnswersA, D, E

High correlation between features can be detected via correlation matrix.

Why this answer

Multicollinearity occurs when two or more features in a dataset are highly correlated, meaning they contain redundant information. During exploratory data analysis (EDA), correlation matrices and variance inflation factor (VIF) calculations can reveal this issue, which can destabilize linear regression models and inflate coefficient standard errors.

Exam trap

Cisco often tests the boundary between data-level issues (EDA) and model training issues, so candidates mistakenly select gradient vanishing (a deep learning optimization problem) or API latency (an operational concern) as EDA findings.

Full explanation →

955

Multi-Selectmedium

A data scientist is performing EDA on a dataset with both numeric and categorical features. Which TWO techniques are appropriate for visualizing the relationship between a numeric feature and a binary categorical target?

Select 2 answers

A.Histogram

B.Stacked bar chart

C.Violin plot grouped by target

D.Box plot grouped by target

E.Scatter plot

AnswersC, D

Violin plots show distribution and density across categories.

Why this answer

Option A (box plot) shows distribution of a numeric feature across categories. Option C (violin plot) combines box plot and density. Option B is wrong because bar charts are for categorical vs categorical.

Option D is wrong because histograms show distribution of a single variable. Option E is wrong because scatter plots are for two numeric variables.

Full explanation →

956

MCQhard

A data scientist is working with a dataset that has imbalanced classes (1% positive). They want to explore the data before modeling. Which visualization technique is most appropriate to understand the distribution of features with respect to the target class?

A.Box plots grouped by class

B.Parallel coordinates plot

C.Histograms overlaid by class

D.Scatter plot matrix

AnswerB

Correct: Parallel coordinates can display multiple features and highlight class separations.

Why this answer

Option B is correct because parallel coordinates plot can show feature patterns for minority vs majority class in high dimensions. Option A is wrong because scatter plot matrices become cluttered with many features. Option C is wrong because histograms are univariate and do not show interaction.

Option D is wrong because box plots are univariate.

Full explanation →

957

MCQmedium

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance. Which action will the notebook be able to perform?

A.Create a training job

B.Create a model

C.Read data from S3

D.Invoke a SageMaker endpoint

AnswerD

The policy explicitly allows sagemaker:InvokeEndpoint.

Why this answer

Option C is correct because the policy allows sagemaker:InvokeEndpoint. Option A is wrong because sagemaker:CreateTrainingJob is not allowed. Option B is wrong because s3:GetObject is not allowed.

Option D is wrong because sagemaker:CreateModel is not allowed.

Full explanation →

958

MCQeasy

A data scientist is exploring a dataset with a column 'transaction_date'. They want to create features for day of week and month. What is the correct AWS service to schedule a recurring ETL job for this transformation?

A.Amazon Athena

B.AWS Glue

C.Amazon SageMaker

D.AWS Lambda

AnswerB

Glue is a managed ETL service.

Why this answer

Option C is correct because AWS Glue is a serverless ETL service. Option A is wrong because SageMaker is for ML modeling. Option B is wrong because Athena is for querying.

Option D is wrong because Lambda is for serverless functions, but not a full ETL service.

Full explanation →

959

MCQeasy

A data scientist is exploring a dataset and wants to identify outliers in a numerical feature. The feature is not normally distributed. Which technique is robust to non-normal distributions?

A.Compute the Median Absolute Deviation (MAD) and flag values with MAD > 3.

B.Use the IQR method: flag values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.

C.Calculate the Z-score and flag values with |Z| > 3.

D.Flag values more than 3 standard deviations from the mean.

AnswerB

Does not assume normality; uses robust quartiles.

Why this answer

The Interquartile Range (IQR) method does not assume normality and uses percentiles. Option A (Z-score) assumes normality. Option C (mean ± 3σ) also assumes normality.

Option D (MAD) is robust but Z-score based; IQR is more common for non-normal.

Full explanation →

960

MCQmedium

A company is deploying a real-time fraud detection model using Amazon SageMaker. The model must make predictions in under 100 milliseconds. The data scientist uses a pre-trained XGBoost model and deploys it to a SageMaker endpoint with an ml.c5.xlarge instance. After load testing, the average latency is 150 ms. Which action should the data scientist take to reduce latency?

A.Reduce the number of trees in the XGBoost model

B.Deploy multiple instances behind a load balancer

C.Enable SageMaker Neo to compile the model for the target instance

D.Use a larger instance type to increase compute capacity

AnswerC

Neo optimization can reduce inference latency by optimizing the model for the hardware.

Why this answer

Option C is correct because SageMaker Neo optimizes trained models for the target hardware platform by compiling them into an efficient runtime. This reduces inference latency without changing the model architecture, making it ideal for meeting the sub-100ms requirement when the current latency is 150ms on an ml.c5.xlarge instance.

Exam trap

The trap here is that candidates often confuse scaling out (Option B) or scaling up (Option D) with latency reduction, but these primarily address throughput or resource contention, not the per-request inference time on a single instance.

How to eliminate wrong answers

Option A is wrong because reducing the number of trees in the XGBoost model would degrade model accuracy and is not a targeted latency optimization technique; it may also not achieve the required latency reduction without significant accuracy loss. Option B is wrong because deploying multiple instances behind a load balancer improves throughput and availability but does not reduce per-request latency; it may even add network overhead. Option D is wrong because using a larger instance type increases compute capacity but does not guarantee lower latency for a single inference request; it may also increase cost without addressing the root cause of model execution inefficiency.

Full explanation →

961

Multi-Selecteasy

Which TWO of the following are valid methods for handling missing values in a dataset before training a machine learning model?

Select 2 answers

A.Remove rows that contain missing values

B.Use a decision tree algorithm that handles missing values internally

C.Increase the number of trees in a random forest

D.Replace missing values with zero

E.Impute missing values with the mean of the column

AnswersA, E

If the proportion of missing data is small, dropping rows is a valid option.

Why this answer

Option A is correct because removing rows with missing values (listwise deletion) is a straightforward and valid method when the missing data is random and the dataset is large enough that the loss of rows does not significantly reduce statistical power or introduce bias. This approach ensures that only complete cases are used for training, avoiding the need to estimate missing values.

Exam trap

Cisco often tests the misconception that decision tree algorithms inherently handle missing values without any preprocessing, but in practice, they require explicit handling (e.g., surrogate splits) and do not automatically resolve missing data for all model training scenarios.

Full explanation →

962

MCQmedium

A data scientist is using Amazon SageMaker to train a model and wants to use a custom Docker container for training. The container requires access to a private Amazon ECR repository. Which IAM role configuration is needed?

A.Attach an IAM policy to the SageMaker execution role that allows ecr:GetDownloadUrlForLayer, ecr:BatchGetImage, and ecr:GetAuthorizationToken for the ECR repository.

B.Use the AWS account owner's IAM role as the SageMaker execution role.

C.Create a new IAM user with ECR access and store credentials in SageMaker.

D.Add a bucket policy to the ECR repository allowing access from the SageMaker execution role.

AnswerA

These permissions allow SageMaker to pull the container image.

Why this answer

Option C is correct because the SageMaker execution role must have an IAM policy that allows ecr:GetDownloadUrlForLayer, ecr:BatchGetImage, and ecr:GetAuthorizationToken for the ECR repository. Option A is wrong because the ECR repository does not grant permissions to the SageMaker role via bucket policies. Option B is wrong because SageMaker does not use the ECR repository owner's role.

Option D is wrong because the SageMaker role is the appropriate entity to grant permissions.

Full explanation →

963

MCQmedium

A machine learning engineer is deploying a custom XGBoost model for real-time inference on Amazon SageMaker. The model was trained using the SageMaker XGBoost built-in algorithm. The endpoint is deployed with an ml.m5.large instance and is receiving around 50 requests per second. The engineer notices that the endpoint's latency is around 200 ms, but the requirement is under 100 ms. The model's serialized format is a .tar.gz file. The engineer wants to reduce inference latency without modifying the model or retraining. What should the engineer do?

A.Configure SageMaker Debugger to optimize the inference code.

B.Use SageMaker Elastic Inference to attach an accelerator.

C.Use SageMaker Neo to compile the model for the target instance.

D.Use SageMaker Batch Transform instead of a real-time endpoint.

AnswerC

Neo optimizes model for faster inference on specific hardware.

Why this answer

Option C is correct because SageMaker Neo optimizes trained models for target hardware, reducing latency. Option A is wrong because SageMaker Batch Transform is for batch, not real-time. Option B is wrong because SageMaker Debugger monitors training, not inference.

Option D is wrong because Elastic Inference attaches GPU acceleration for deep learning, not XGBoost which is tree-based.

Full explanation →

964

Multi-Selecthard

A company is training a deep learning model for object detection using Amazon SageMaker. The training job is taking too long. Which THREE actions can reduce training time?

Select 3 answers

A.Use distributed training with multiple GPUs

B.Use a larger instance type with more vCPUs

C.Use SageMaker managed spot training

D.Use a smaller batch size initially and increase gradually (warm-up)

E.Increase the number of epochs

AnswersA, C, D

Distributed training parallelizes the workload, reducing training time.

Why this answer

Option A is correct because distributed training with multiple GPUs (e.g., using SageMaker's distributed data parallelism or model parallelism) splits the workload across multiple devices, reducing wall-clock time per epoch. This leverages Horovod or SageMaker's own distributed training libraries to synchronize gradients efficiently, directly addressing the long training time.

Exam trap

AWS often tests the misconception that increasing CPU cores (Option B) or epochs (Option E) will speed up training, when in reality deep learning is GPU-bound and more epochs increase time.

Full explanation →

965

Multi-Selecthard

Which TWO techniques are used to handle missing values in a dataset before training? (Choose 2.)

Select 2 answers

A.Mean or median imputation.

B.Min-max scaling.

C.Removing rows or columns with missing values.

D.One-hot encoding.

E.Principal component analysis (PCA).

AnswersA, C

Imputation replaces missing values with central tendency.

Why this answer

Option B is correct because imputation fills missing values. Option D is correct because removing rows/columns with missing data is a valid approach. Option A is wrong because scaling is for numerical features, not missing values.

Option C is wrong because one-hot encoding is for categorical variables. Option E is wrong because PCA is for dimensionality reduction.

Full explanation →

966

MCQeasy

A company uses Amazon SageMaker to train a linear regression model on a dataset with 10 million rows and 50 features. The training job takes 8 hours to complete. A data scientist wants to reduce the training time to under 2 hours without changing the dataset size or the model algorithm. The SageMaker instance type currently used is ml.m5.2xlarge. Which action should the data scientist take to achieve the desired training time?

A.Change the instance type to ml.p3.2xlarge (GPU instance).

B.Change the instance type to ml.m5.4xlarge (double the vCPUs and memory).

C.Reduce the number of features from 50 to 25.

D.Use SageMaker's distributed training with 4 ml.m5.2xlarge instances.

AnswerD

Distributed training parallelizes computation across instances, significantly reducing training time.

Why this answer

Option A is correct because using a distributed training approach with multiple ml.m5.2xlarge instances will parallelize the computation, reducing wall-clock time. Option B (increasing to ml.m5.4xlarge) provides more compute but not enough to reduce time from 8 to 2 hours (only 2x improvement). Option C (changing to ml.p3.2xlarge with GPU) is not optimal for linear regression, which is CPU-bound.

Option D (reducing features) changes the dataset and is not allowed.

Full explanation →

967

MCQhard

A data scientist is using Amazon Athena to query a CSV file stored in S3. The above error occurs. What is the most likely cause?

A.The CSV file uses a different delimiter than comma.

B.The CSV file is missing a header row.

C.The CSV file is too large for Athena to process.

D.The CSV file has inconsistent number of columns in some rows.

AnswerD

The error indicates row 1502 has 5 fields while header has 4.

Why this answer

Option A is correct because the error clearly states that a row has more fields than the header. Option B is wrong because the error is about field count mismatch, not encoding. Option C is wrong because the error mentions row number, but the issue is field count.

Option D is wrong because the header is present and read correctly.

Full explanation →

968

MCQmedium

A data scientist is training a neural network on a dataset with 1 million images. The training loss decreases steadily but the validation loss starts to increase after 10 epochs. Which action should the scientist take to improve generalization?

A.Implement early stopping

B.Add more layers to the network

C.Reduce the learning rate

D.Increase the number of epochs

AnswerA

Early stopping prevents overfitting by halting training when validation loss increases.

Why this answer

Increasing validation loss indicates overfitting. Early stopping halts training when validation loss stops improving, preventing overfitting. Increasing epochs would worsen overfitting.

Reducing learning rate might help but early stopping directly addresses the issue. Adding more layers could increase overfitting. Option A: Early stopping is correct.

Option B: Increasing epochs would worsen overfitting. Option C: Reducing learning rate might help but not as directly. Option D: Adding more layers could increase overfitting.

Full explanation →

969

MCQeasy

A data scientist wants to use a linear regression model to predict house prices. After training, the model shows high bias and low variance. Which action would most likely improve the model's performance?

A.Add polynomial features to capture non-linear relationships.

B.Increase L2 regularization strength.

C.Use a simpler model, such as linear regression without interaction terms.

D.Reduce the amount of training data.

AnswerA

Increasing model complexity reduces bias by better fitting the data.

Why this answer

Option C is correct because high bias indicates underfitting, and increasing model complexity (e.g., adding polynomial features or using a more complex algorithm) can reduce bias. Option A is wrong because adding L2 regularization increases bias. Option B is wrong because reducing training data can increase variance but not necessarily reduce bias.

Option D is wrong because using a simpler model would increase bias.

Full explanation →

970

MCQeasy

A data scientist has a dataset with 500 features and wants to reduce dimensionality for visualization. Which technique is most appropriate for identifying the two components that capture the most variance?

A.t-Distributed Stochastic Neighbor Embedding (t-SNE)

B.Linear Discriminant Analysis (LDA)

C.Principal Component Analysis (PCA)

D.K-means clustering

AnswerC

PCA projects data onto directions of maximum variance.

Why this answer

Option A is correct because PCA is designed to find principal components that maximize variance. Option B is wrong because t-SNE is for visualization but not variance-based; it focuses on preserving local structure. Option C is wrong because LDA is supervised and requires labels.

Option D is wrong because K-means is clustering, not dimensionality reduction.

Full explanation →

971

MCQmedium

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents only 5% of the data. The model currently achieves 95% accuracy but only 10% recall on the positive class. Which metric should the scientist focus on to improve the model's ability to detect the positive class?

A.Recall

B.Accuracy

C.Precision

D.AUC-ROC

AnswerA

Recall measures the proportion of actual positives correctly identified.

Why this answer

Option B is correct because recall measures the ability to find all positive samples, which is the key issue in this imbalanced dataset. Option A is wrong because accuracy is misleading when classes are imbalanced. Option C is wrong because precision does not directly address the low recall.

Option D is wrong because AUC-ROC is a global metric that may not reflect the improvement in recall.

Full explanation →

972

Multi-Selecthard

A machine learning team is using Amazon SageMaker to train a model with a large dataset stored in S3. The training job is taking too long. Which THREE of the following actions can reduce training time? (Choose three.)

Select 3 answers

A.Decrease the batch size.

B.Use a GPU instance with more powerful GPUs.

C.Use distributed training with multiple instances.

D.Use Pipe input mode instead of File mode for the training data.

E.Increase the batch size.

AnswersB, C, D

Faster GPUs reduce computation time.

Why this answer

Using a GPU instance with more powerful GPUs (Option B) reduces training time because it increases the parallel compute capacity for matrix operations, which are the core of deep learning. Amazon SageMaker allows you to select instances like p3.16xlarge with NVIDIA V100 GPUs, which offer significantly higher FLOPS compared to smaller GPU instances, directly accelerating model training.

Exam trap

The trap here is that candidates often confuse batch size adjustments as a primary performance lever, but the exam tests understanding that hardware upgrades (GPU power), parallelism (distributed training), and data streaming (Pipe mode) are the most direct and reliable methods to reduce training time in SageMaker.

Full explanation →

973

Multi-Selecthard

A company is deploying a machine learning model using Amazon SageMaker. The model must be updated frequently without downtime. Which TWO strategies can achieve this? (Choose two.)

Select 2 answers

A.Update the model artifact on the existing endpoint.

B.Delete the existing endpoint and create a new one.

C.Use blue/green deployment with endpoint variants.

D.Use rolling update with multiple instances.

E.Use canary deployment by gradually shifting traffic.

AnswersC, E

Traffic is shifted gradually.

Why this answer

Options B and D are correct. Blue/green deployment and canary deployment both allow zero-downtime updates by routing traffic gradually. Option A is wrong because deleting and recreating the endpoint causes downtime.

Option C is wrong because updating the model directly on the endpoint is not supported without creating a new endpoint. Option E is wrong because SageMaker does not support rolling updates natively.

Full explanation →

974

MCQhard

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint for real-time inference. The model is a large ensemble that requires 4 GB of memory. The engineer wants to minimize cost while ensuring the endpoint can handle up to 100 concurrent requests with a latency under 200 ms. Which instance configuration is most appropriate?

A.Two ml.t3.medium instances behind a load balancer.

B.One ml.c5.xlarge instance with auto-scaling up to 2 instances.

C.One ml.m5.2xlarge instance.

D.One ml.p3.2xlarge instance.

AnswerB

ml.c5.xlarge has 4 GB memory, cost-effective, and auto-scaling handles load.

Why this answer

Option B is correct because the ml.c5.xlarge instance provides sufficient compute (4 vCPUs, 8 GB memory) for the 4 GB model, and auto-scaling up to 2 instances allows handling 100 concurrent requests with low latency while minimizing cost during low traffic. The ml.c5 family is optimized for compute-intensive inference, and auto-scaling ensures the endpoint scales out only when needed, avoiding over-provisioning.

Exam trap

The trap here is that candidates often choose a single large instance (like ml.m5.2xlarge) thinking it simplifies management, but auto-scaling with a smaller instance type is more cost-effective and still meets latency requirements under variable load.

How to eliminate wrong answers

Option A is wrong because two ml.t3.medium instances (2 vCPUs, 4 GB memory each) are burstable and may not sustain the required 200 ms latency under load, as t3 instances use CPU credits and can throttle under sustained high concurrency. Option C is wrong because one ml.m5.2xlarge instance (8 vCPUs, 32 GB memory) is over-provisioned for the 4 GB model and 100 concurrent requests, leading to higher cost without benefit. Option D is wrong because one ml.p3.2xlarge instance (8 vCPUs, 61 GB memory, GPU) is designed for GPU-accelerated workloads like deep learning, not for a large ensemble model that only needs 4 GB memory, making it unnecessarily expensive.

Full explanation →

975

MCQmedium

A data scientist is analyzing a dataset with 500 features and notices that many features are highly correlated. Which AWS service can be used to automatically reduce dimensionality by identifying and removing redundant features before training a model?

A.AWS Glue

B.Amazon SageMaker Data Wrangler

C.Amazon QuickSight

D.Amazon Athena

AnswerB

Provides built-in transformations including correlation analysis and dimensionality reduction.

Why this answer

Amazon SageMaker Data Wrangler provides built-in transformations including correlation analysis and dimensionality reduction. Option A is wrong because QuickSight is for visualization, not feature reduction. Option B is wrong because Glue is for ETL but lacks automatic dimensionality reduction.

Option D is wrong because Athena is a query service.

Full explanation →

Page 13 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →