Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1426–1500

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 20 of 24

1426

MCQmedium

A team is deploying a SageMaker endpoint for a model that was trained with scikit-learn. The endpoint receives spikes in traffic during business hours. The team wants to minimize cost while ensuring availability during spikes. Which endpoint configuration is MOST appropriate?

A.Use SageMaker Serverless Inference

B.Use a production variant endpoint with auto-scaling based on CPU utilization

C.Use a multi-model endpoint with a single instance type

D.Deploy a single large instance that can handle peak load

AnswerB

Auto-scaling handles traffic spikes efficiently.

Why this answer

Option B is correct because a production variant endpoint with auto-scaling based on CPU utilization allows the SageMaker endpoint to dynamically adjust the number of instances in response to traffic spikes, ensuring availability during business hours while minimizing cost by scaling down during off-peak periods. This approach is ideal for a scikit-learn model, which is CPU-bound, making CPU utilization a relevant and effective scaling metric.

Exam trap

The trap here is that candidates often confuse serverless inference with cost optimization for predictable spikes, overlooking that auto-scaling with a relevant metric like CPU utilization provides both cost efficiency and availability for scheduled traffic patterns.

How to eliminate wrong answers

Option A is wrong because SageMaker Serverless Inference is designed for intermittent or unpredictable traffic patterns with low latency requirements, but it can incur cold start latency and is not optimal for consistent daily spikes during business hours, potentially leading to higher costs or performance issues. Option C is wrong because a multi-model endpoint with a single instance type does not provide auto-scaling; it hosts multiple models on a single instance, which cannot handle traffic spikes by itself and would still require scaling mechanisms to ensure availability. Option D is wrong because deploying a single large instance that can handle peak load results in over-provisioning and higher costs during off-peak hours, as the instance remains fully running regardless of actual traffic, contradicting the goal of minimizing cost.

Full explanation →

1427

Multi-Selecteasy

Which TWO of the following are examples of unsupervised learning tasks?

Select 2 answers

A.Classifying emails as spam or not spam

B.Dimensionality reduction using PCA

C.Sentiment analysis of product reviews

D.Clustering customer segments

E.Predicting house prices

AnswersB, D

PCA reduces features without labels.

Why this answer

Principal Component Analysis (PCA) is an unsupervised learning technique used for dimensionality reduction. It works by identifying the directions (principal components) that maximize variance in the data, without requiring any labeled target variable. This makes it a classic example of unsupervised learning, as the algorithm learns patterns solely from the input features.

Exam trap

Cisco often tests the distinction between supervised and unsupervised learning by presenting tasks that seem intuitive (like clustering) but pairing them with tasks that require labeled outputs (like classification or regression), so candidates must recognize that any task involving a target variable is supervised.

Full explanation →

1428

MCQmedium

A company is building a recommender system using matrix factorization. The dataset contains user-item interactions. The model is trained on a large dataset, but the recommendations for new users are poor. Which approach would MOST effectively address this cold-start problem?

A.Incorporate user demographic features as side information

B.Switch to item-based collaborative filtering only

C.Increase the number of latent factors in the model

D.Use only implicit feedback signals for training

AnswerA

Side information helps generalize to new users by leveraging metadata.

Why this answer

Matrix factorization models learn latent factors only from user-item interactions. For new users with no history, the model cannot compute a meaningful latent vector, leading to poor recommendations. Incorporating user demographic features as side information allows the model to initialize or infer latent factors for new users based on their attributes, directly addressing the cold-start problem.

Exam trap

The trap here is that candidates may think increasing latent factors or switching to implicit feedback improves generalization, but neither addresses the fundamental lack of user interaction data for new users.

How to eliminate wrong answers

Option B is wrong because switching to item-based collaborative filtering still relies on user-item interactions and does not solve the cold-start problem for new users with no history. Option C is wrong because increasing the number of latent factors may improve model capacity but does not provide any information about new users, so it cannot mitigate the cold-start issue. Option D is wrong because using only implicit feedback signals does not introduce any new user attributes; it still requires historical interactions to generate recommendations, leaving the cold-start problem unresolved.

Full explanation →

1429

MCQmedium

A data scientist is training a neural network on image data using TensorFlow with GPU instances on SageMaker. The training is slow because the GPU utilization is low. The data pipeline uses tf.data with a large number of preprocessing operations. Which action would most likely increase GPU utilization?

A.Increase the learning rate to converge faster.

B.Increase the prefetch buffer size in the tf.data pipeline.

C.Reduce the batch size to speed up each step.

D.Increase the number of CPU instances in the training job.

E.Use smaller image sizes to reduce computation.

AnswerB

Prefetching overlaps CPU data preparation with GPU computation, improving GPU utilization.

Why this answer

Option D is correct because prefetching allows the CPU to prepare batches while the GPU is computing, reducing idle time. Option A (reduce batch size) may decrease utilization. Option B (increase CPU instances) may not help if bottleneck is data pipeline.

Option C (use smaller images) reduces computation but may not improve utilization percentage. Option E (increase learning rate) does not affect data throughput.

Full explanation →

1430

MCQeasy

A data scientist runs a SQL query on an Amazon Athena table and notices that the query scans a large amount of data. Which approach would reduce the amount of data scanned without changing the SQL logic?

A.Partition the table on a column that is frequently used in WHERE clauses.

B.Convert the data from CSV to JSON format.

C.Store the data in Parquet format without partitioning.

D.Use GZIP compression on the data files.

AnswerA

Partitioning prunes data and reduces scanned bytes.

Why this answer

Partitioning the table on a frequently filtered column limits the data scanned to relevant partitions. Option A is wrong because compressing reduces storage but not scan size unless combined with columnar format. Option C is wrong because converting to JSON does not reduce scan.

Option D is wrong because Parquet is columnar and can reduce scan, but without partitioning, Athena still scans entire columns.

Full explanation →

1431

MCQhard

A machine learning engineer is analyzing a dataset with high cardinality categorical features. They want to reduce the number of categories by grouping rare categories into an 'Other' category. Which Amazon SageMaker processing job capability is best suited for this task?

A.Amazon SageMaker Processing

B.Amazon SageMaker Data Wrangler

C.AWS Glue Studio

D.Amazon SageMaker Autopilot

AnswerA

Processing jobs allow custom scripts for flexible data transformation.

Why this answer

Option D is correct because Amazon SageMaker Processing jobs can run custom scripts that can handle high cardinality categorical features using libraries like pandas. Option A is wrong because SageMaker Data Wrangler is a visual tool, but it may not be as flexible for custom grouping logic. Option B is wrong because SageMaker Autopilot is for automated ML, not for custom data processing.

Option C is wrong because AWS Glue Studio is a visual ETL tool, but it may not be as tightly integrated with SageMaker.

Full explanation →

1432

MCQhard

A company is using SageMaker to host a model that makes predictions on streaming data from Amazon Kinesis. The model must provide predictions with sub-second latency. Which approach should the company use?

A.Use SageMaker asynchronous inference with a Kinesis trigger

B.Use a SageMaker real-time endpoint and invoke it from an AWS Lambda function that is triggered by Kinesis

C.Use Amazon Kinesis Data Analytics with a built-in ML model

D.Use SageMaker batch transform to process batches of records from Kinesis

AnswerB

Real-time endpoint plus Lambda provides sub-second latency.

Why this answer

Option C is correct: SageMaker real-time endpoint provides sub-second latency with Kinesis via Lambda. Option A (batch transform) is not real-time. Option B (async inference) has higher latency.

Option D (Kinesis Analytics) does not use SageMaker.

Full explanation →

1433

Multi-Selecteasy

A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest sensor data. The data must be processed in real-time, and the results must be stored in Amazon DynamoDB. Which TWO AWS services can be used together to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon Athena

B.Amazon Kinesis Data Analytics

C.AWS Glue

D.Amazon S3

E.AWS Lambda

AnswersB, E

Kinesis Data Analytics can process streaming data in real-time.

Why this answer

Option B is correct because Kinesis Data Analytics can process data in real-time using SQL or Flink. Option E is correct because Lambda can consume from Kinesis and write to DynamoDB. Option A is wrong because Glue is for batch ETL, not real-time.

Option C is wrong because S3 is not real-time storage. Option D is wrong because Athena is for querying, not processing.

Full explanation →

1434

MCQmedium

A company is deploying a model to an Amazon SageMaker endpoint for real-time inference. The model requires a GPU for low-latency predictions. Which instance type should be chosen?

A.ml.c5.xlarge

B.ml.r5.2xlarge

C.ml.g4dn.xlarge

D.ml.m5.large

AnswerC

GPU instance suitable for inference.

Why this answer

SageMaker GPU instances (ml.p3, ml.p4, ml.g4dn, ml.g5) are designed for GPU inference. Option D is correct. Option A is wrong because ml.c5 is CPU only.

Option B is wrong because ml.m5 is CPU only. Option C is wrong because ml.r5 is CPU only.

Full explanation →

1435

MCQeasy

A company is using SageMaker to train a text classification model using a built-in BlazingText algorithm. The dataset has 500,000 documents, each labeled with one of 10 categories. The training time is taking longer than expected. The data scientist wants to speed up training without increasing cost. The training job is using a single ml.m4.xlarge instance. The code uses default hyperparameters. Which change is MOST likely to reduce training time? A. Use a larger instance type, such as ml.m4.4xlarge. B. Increase the learning rate significantly. C. Use SageMaker Managed Spot Training. D. Use the 'mode' hyperparameter set to 'batch_skipgram' instead of 'supervised'. The company has a fixed budget and wants to minimize cost while reducing training time. Which option should the data scientist choose?

A.Increase the learning rate significantly

B.Use the 'mode' hyperparameter set to 'batch_skipgram' instead of 'supervised'

C.Use SageMaker Managed Spot Training

D.Use a larger instance type, such as ml.m4.4xlarge

AnswerC

Spot instances reduce cost, allowing more resources for same budget.

Why this answer

Option C is the best because spot instances can reduce cost and training time is not affected; they can use the same instance type at lower cost, allowing them to use more instances if needed. Option A increases cost. Option B may cause the model not to converge.

Option D changes the problem to unsupervised, not appropriate.

Full explanation →

1436

MCQeasy

Refer to the exhibit. A data scientist lists files in an S3 bucket. The dataset is split into train, test, and validation sets. What is the most likely issue with this data split?

A.The files are not partitioned by date.

B.The training file is missing a header row.

C.The training set is smaller than the test set, which is unusual.

D.The test file should be in JSON format.

AnswerC

Typically training set is largest.

Why this answer

Option C is correct because the training set (1024 bytes) is smaller than the test set (2048 bytes), which is unusual. Typically training set should be larger. Option A (missing header) cannot be inferred; Option B (CSV format) is fine; Option D (partitioning) is not evident.

Full explanation →

1437

MCQmedium

A data scientist is analyzing a dataset containing customer reviews. The data scientist wants to understand the most common words used in positive and negative reviews. Which AWS service is most suitable for this task?

A.Amazon Rekognition

B.Amazon Comprehend

C.Amazon Polly

D.Amazon Transcribe

AnswerB

Comprehend provides sentiment analysis and key phrase extraction.

Why this answer

Option B is correct because Amazon Comprehend can perform sentiment analysis and extract key phrases. Option A is wrong because Amazon Rekognition is for image/video analysis. Option C is wrong because Amazon Polly is a text-to-speech service.

Option D is wrong because Amazon Transcribe is for speech-to-text.

Full explanation →

1438

MCQmedium

A data scientist is analyzing a dataset with a target variable that is highly imbalanced (only 1% positive class). The goal is to build a binary classifier. During exploratory data analysis, which metric is MOST appropriate to evaluate the performance of different sampling strategies before model training?

A.Root Mean Squared Error (RMSE)

B.Area Under the Receiver Operating Characteristic Curve (AUC ROC)

C.F1 score

D.Accuracy

AnswerB

AUC ROC is threshold-independent and robust to class imbalance.

Why this answer

Option D is correct because the AUC ROC curve is independent of class distribution and provides a robust measure of separability between classes. Option A is wrong because accuracy is misleading for imbalanced data. Option B is wrong because RMSE is for regression.

Option C is wrong because F1 score depends on a threshold and can be affected by sampling.

Full explanation →

1439

MCQmedium

Refer to the exhibit. A data scientist is using Amazon SageMaker Ground Truth to label a dataset. The output manifest file references S3 objects with metadata. The scientist notices that a training job using the labeled data yields poor accuracy. What is the most likely issue?

A.The labeled dataset has missing labels for some records.

B.The training data is in an incorrect format for the algorithm.

C.The IAM role used for training does not have permissions to read the manifest file.

D.The data distribution differs significantly between the training set and the real-world inference data.

AnswerB

If the data format does not match the algorithm's expectations, training may complete but produce poor results.

Why this answer

The metadata shows 'sagemaker-import-job': 'true', which indicates the object was imported from a SageMaker import job. However, that metadata is not relevant. The content length is 1 GB, which is large.

The poor accuracy could be due to many reasons. But the exhibit shows a head-object response, which doesn't directly indicate a problem. However, the question implies that the metadata might be incorrect.

Actually, the metadata 'sagemaker-import-job' is set by Ground Truth when importing data. But if the data is not properly labeled, the manifest might be wrong. Option D (data distribution shift) is plausible.

Option B (incorrect IAM permissions) would cause access errors. Option C (incorrect data format) could cause issues. Option A (missing labels) is a common Ground Truth issue.

But the exhibit doesn't show labels. I think the most likely is that the training data is not representative because the labeling job might have introduced bias. However, I'll choose D (data distribution shift between training and inference).

But the question is about the labeled data. Maybe the issue is that the metadata indicates the data was imported but not labeled? Actually, Ground Truth output manifest includes labels. The head-object shows the raw data object, not the manifest.

The scientist is looking at the source data. The poor accuracy could be because the data is not properly preprocessed. I'll choose B (incorrect IAM permissions) because if the training job cannot read the manifest, it would fail, but accuracy is poor, not failure.

So not that. Option A: missing labels – if the manifest is missing labels, training would fail. Option C: incorrect data format – if the data format is wrong, training might run but produce poor results.

That is plausible. I'll go with C.

Full explanation →

1440

Multi-Selecthard

A data engineering team is migrating on-premises Hadoop workloads to AWS. The workloads include batch processing using Apache Spark and interactive SQL queries. The data is stored in HDFS. Which TWO AWS services should be used to replace HDFS and provide a scalable, durable storage layer? (Choose TWO.)

Select 2 answers

A.Amazon EMR with EMRFS

B.Amazon S3

C.Amazon EBS

D.Amazon FSx for Lustre

E.Amazon RDS

AnswersA, B

EMRFS allows EMR to use S3 as a replacement for HDFS.

Why this answer

Amazon S3 is the primary storage for data lakes, replacing HDFS. Amazon EMR with EMRFS can access S3 as if it were HDFS. Option B (Amazon FSx for Lustre) is a high-performance file system but not a direct HDFS replacement for durability.

Option C (Amazon EBS) is block storage, not scalable. Option D (Amazon RDS) is relational database. Option E (Amazon DynamoDB) is NoSQL.

Full explanation →

1441

MCQhard

A data scientist is training a neural network for a multi-class classification problem with 100 classes. The model uses a softmax output layer and cross-entropy loss. During training, the loss decreases steadily but the accuracy on the validation set plateaus early. Which of the following is the most likely cause?

A.Batch size is too large

B.The model is overfitting the training data

C.Number of epochs is too small

D.Learning rate is too high

AnswerB

Overfitting occurs when the model learns training data noise, causing training loss to keep decreasing while validation performance stagnates.

Why this answer

When the validation accuracy plateaus early while training loss continues to decrease, it indicates that the model is memorizing the training data rather than learning generalizable patterns. This is classic overfitting, where the softmax output layer produces high-confidence predictions for training samples but fails to generalize to unseen validation data, causing cross-entropy loss to drop on the training set while validation accuracy stagnates.

Exam trap

AWS often tests the distinction between overfitting and underfitting by pairing a decreasing training loss with a plateauing validation metric, tricking candidates into choosing learning rate or epoch issues when the real problem is memorization.

How to eliminate wrong answers

Option A is wrong because a batch size that is too large typically leads to slower convergence or poorer generalization, not a plateau in validation accuracy while training loss decreases; it would more likely cause both losses to be high or unstable. Option C is wrong because too few epochs would cause both training and validation accuracy to be low and still improving, not a plateau in validation accuracy alone. Option D is wrong because a learning rate that is too high usually causes the loss to diverge or oscillate, not a steady decrease in training loss with a plateau in validation accuracy.

Full explanation →

1442

Multi-Selecteasy

Which TWO AWS services can be used to schedule and orchestrate a data pipeline that includes multiple steps such as data extraction, transformation, and loading? (Choose 2.)

Select 2 answers

A.AWS Lambda

B.AWS Glue

C.Amazon Managed Workflows for Apache Airflow (MWAA)

D.AWS Step Functions

E.Amazon CloudWatch Events

AnswersC, D

MWAA is a managed orchestration service for data pipelines.

Why this answer

AWS Step Functions is a serverless orchestration service that can coordinate multiple AWS services into workflows. Amazon Managed Workflows for Apache Airflow (MWAA) is a managed version of Apache Airflow for orchestrating pipelines. Option A is wrong because Lambda is a compute service, not an orchestrator.

Option C is wrong because CloudWatch Events is for event scheduling, not complex orchestration. Option D is wrong because Glue is an ETL service, not an orchestrator, though it can be part of a pipeline.

Full explanation →

1443

MCQmedium

A data engineering team uses Amazon EMR with Spark to transform large datasets in S3. The team notices that the Spark jobs on the EMR cluster are failing with out-of-memory errors. The cluster uses instance types with moderate memory. Which configuration change would MOST effectively reduce memory pressure without increasing cost?

A.Use a larger EMR cluster with more instances of the same type.

B.Increase the number of executor cores to improve parallelism.

C.Switch from Java serialization to Kryo serialization.

D.Enable shuffle compression and set spark.shuffle.compress to true.

AnswerD

Compression reduces the amount of data stored in memory during shuffles.

Why this answer

Option C is correct because enabling compression reduces the amount of data shuffled over the network and stored in memory, thus reducing memory usage. Option A is wrong because increasing the number of executor cores may increase parallelism but does not directly reduce memory per task. Option B is wrong because using more instances would increase cost.

Option D is wrong because Kryo serialization reduces object size but is not as effective as compression for shuffle data.

Full explanation →

1444

MCQeasy

A DevOps engineer created a SageMaker notebook instance using the Terraform configuration shown. The notebook instance is in a VPC with a public subnet. However, the notebook instance cannot access the internet. What is the most likely cause?

A.The role_arn is incorrect or missing permissions.

B.The instance type ml.t2.medium does not support internet access.

C.The subnet does not have a route to an internet gateway.

D.The direct_internet_access parameter is set to 'Enabled' but should be 'Disabled'.

AnswerC

Without a route to an internet gateway, the notebook cannot access the internet despite the setting.

Why this answer

Option C is correct because a SageMaker notebook instance in a VPC with a public subnet requires a route to an internet gateway (IGW) in the subnet's route table to access the internet. Without that route, traffic from the notebook cannot reach the internet, even if `direct_internet_access` is enabled. The Terraform configuration likely omitted the route to the IGW, causing the connectivity failure.

Exam trap

The trap here is that candidates often confuse `direct_internet_access` with the actual network routing requirement, assuming the parameter alone controls internet access, when in reality it only controls whether the notebook uses a public or private subnet, and the subnet must still have proper routing to the internet gateway.

How to eliminate wrong answers

Option A is wrong because the `role_arn` being incorrect or missing permissions would cause API failures (e.g., unable to create the notebook or access SageMaker resources), not a lack of internet connectivity from the notebook instance itself. Option B is wrong because the instance type `ml.t2.medium` fully supports internet access; SageMaker notebook instances of any type can reach the internet when properly configured. Option D is wrong because setting `direct_internet_access` to 'Enabled' is the correct setting for allowing internet access; setting it to 'Disabled' would intentionally block internet access, which is the opposite of what is needed.

Full explanation →

1445

MCQhard

A machine learning engineer is deploying a model that predicts loan defaults. The model uses features like income, credit score, and debt-to-income ratio. After deployment, the model's performance degrades over time. Which concept best describes this phenomenon?

A.Data drift

B.Concept drift

C.Overfitting

D.Model drift

AnswerD

Model drift is the degradation of model performance over time.

Why this answer

Option C is correct because model drift occurs when the statistical properties of the target variable change over time. Option A is wrong because concept drift is a broader term. Option B is wrong because data drift refers to changes in input distribution.

Option D is wrong because overfitting is not a time-dependent degradation.

Full explanation →

1446

MCQeasy

A company is using Amazon SageMaker to train a linear regression model. The data scientist notices that the training loss is decreasing but the validation loss has started to increase after a few epochs. What is the most likely cause?

A.The model is underfitting the training data.

B.There is data leakage from the validation set into the training set.

C.The model is overfitting the training data.

D.The learning rate is too high.

AnswerC

Decreasing training loss with increasing validation loss is a classic sign of overfitting.

Why this answer

When training loss decreases but validation loss increases, the model is overfitting to the training data. This is a classic sign of overfitting. Underfitting would show both losses high.

Learning rate too high would cause divergence. Data leakage would cause both losses to be artificially low.

Full explanation →

1447

MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should they use for scheduling?

A.AWS Lambda

B.Amazon CloudWatch Events

C.Amazon Simple Queue Service (SQS)

D.AWS Step Functions

AnswerB

CloudWatch Events can trigger Glue jobs on a schedule.

Why this answer

Amazon CloudWatch Events (now part of Amazon EventBridge) can trigger an AWS Glue ETL job on a schedule using a cron or rate expression. This is the native, serverless scheduling service for running jobs at fixed intervals, such as every hour, without needing to manage any infrastructure.

Exam trap

The trap here is that candidates may confuse AWS Lambda as a scheduler because it can be used to run code on a schedule via CloudWatch Events, but the question asks for the service used for scheduling, not for executing the scheduled action.

How to eliminate wrong answers

Option A is wrong because AWS Lambda is a compute service for running code in response to events, not a scheduling service; while Lambda can be used to trigger Glue jobs via custom code, it requires additional setup and is not the direct scheduling mechanism. Option C is wrong because Amazon SQS is a message queue service for decoupling application components, not a scheduler; it cannot natively trigger Glue jobs on a time-based schedule. Option D is wrong because AWS Step Functions is a workflow orchestration service that can coordinate multiple AWS services, but it is not designed for simple time-based scheduling; using it solely for hourly triggers would be over-engineering and incur unnecessary complexity and cost.

Full explanation →

1448

MCQmedium

Refer to the exhibit. A data scientist plans to read this CSV file into memory for exploratory data analysis using pandas. The instance has 8 GB of RAM. What is the MOST likely issue the scientist will encounter?

A.The file contains too many rows for pandas to handle

B.The file is too large to load into memory on this instance

C.The file is not in CSV format despite the ContentType

D.The file is not accessible because of insufficient permissions

AnswerB

1 GB CSV file may require >8 GB RAM when loading into pandas.

Why this answer

Option B is correct because the file size is approximately 1 GB (1073741824 bytes = 1 GB), and pandas typically requires 3-5x the file size in memory for CSV parsing, which would exceed the 8 GB RAM. Option A is wrong because the file is accessible (HTTP 200). Option C is wrong because the content type is text/csv, which is correct.

Option D is wrong because the metadata indicates 10 million rows, but the file size is the main issue.

Full explanation →

1449

MCQmedium

Refer to the exhibit. An IAM policy is attached to a data engineering role. The role is used by an AWS Glue ETL job that reads from 'raw/' and writes to 'processed/'. The job fails with an access denied error when trying to write to 'processed/'. What is the likely cause?

A.The Deny statement on s3:DeleteObject prevents overwriting objects.

B.The role is not correctly attached to the Glue job.

C.The policy does not allow both s3:GetObject and s3:PutObject on the same resource.

D.The policy specifies incorrect ARN for the 'processed' folder.

AnswerA

If the job tries to overwrite an existing object, it needs DeleteObject permission.

Why this answer

Option B is correct because the Deny statement explicitly denies s3:DeleteObject on all objects, but PutObject operations on existing objects may require DeleteObject permission for overwrite (depending on the object versioning configuration). However, for new objects, PutObject should work. The more common issue is that the policy does not allow 's3:ListBucket' at the bucket level, which is needed for certain operations.

Actually, the error is likely due to missing 's3:ListBucket' permission. But among options, B is most plausible: the Deny on DeleteObject may interfere if the job attempts to overwrite existing objects. Option A is wrong because the policy allows both.

Option C is wrong because the resource matches. Option D is wrong because the role is attached correctly.

Full explanation →

1450

MCQeasy

A team has a dataset with 500 features and wants to reduce dimensionality. During EDA, they compute the variance of each feature. Which finding would most likely lead to feature removal?

A.Some features have high correlation with each other

B.Some features have negative covariance with the target

C.Some features have very high variance

D.Some features have near-zero variance

AnswerD

Why B is correct

Why this answer

Option B is correct because near-zero variance features provide little information and can be removed. Option A is wrong because high variance is often useful. Option C is wrong because high correlation between two features might warrant removal of one, but variance is not the direct indicator.

Option D is wrong because negative covariance is still informative.

Full explanation →

1451

MCQhard

A company uses Amazon SageMaker to host a model for real-time inference. The model is a large ensemble that takes 2 seconds to load into memory. To reduce cold start latency, the data scientist uses SageMaker's managed warm pools. However, they notice that during a sudden traffic spike, new instances still experience high latency. What is the BEST way to ensure consistently low latency for all requests?

A.Use a larger instance type to reduce model loading time.

B.Configure auto scaling based on the number of active invocations to maintain a buffer of warmed instances.

C.Reduce the number of instances to minimize cold start frequency.

D.Switch to SageMaker Serverless Inference.

AnswerB

Auto scaling with a buffer ensures that new instances are provisioned ahead of demand, reducing cold start impact.

Why this answer

Option C is correct. Using auto scaling with a target metric keeps a buffer of warm instances. Option A is wrong because increasing instance size reduces per-request time but cold start still occurs.

Option B is wrong because Serverless Inference has its own cold start issues. Option D is wrong because scaling down aggressively increases cold starts.

Full explanation →

1452

MCQmedium

A data scientist runs the above AWS CLI command. What does the command do?

A.It lists objects larger than 1,000,000 bytes under the data/ prefix.

B.It counts the number of objects larger than 1 MB.

C.It lists objects created after January 2023.

D.It lists objects larger than 1 MB in size.

AnswerA

The --query filters Size > '1000000', which is 1,000,000 bytes.

Why this answer

Option B is correct. The command lists objects in the bucket under prefix 'data/' that are greater than 1,000,000 bytes (Size > '1000000') and returns their keys. Option A is wrong because size is in bytes, not 1 MB (1 MB = 1,048,576 bytes).

Option C is wrong because it returns keys, not counts. Option D is wrong because it filters by size, not date.

Full explanation →

1453

MCQeasy

A retail company uses Amazon SageMaker to train a model for product demand forecasting. The dataset contains daily sales data for 10,000 products over 3 years. The data includes features like price, promotions, holidays, and seasonality. The data scientist uses a linear regression model and gets an RMSE of 50 units. However, the business requires more accurate forecasts, especially for products with high variability. The scientist notices that the residuals show a pattern: the model underestimates demand during promotional periods. Which approach should the scientist take to improve the model?

A.Add interaction features between promotion and other variables.

B.Collect more historical data for training.

C.Use a deep learning model like LSTM.

D.Remove promotion features to simplify the model.

AnswerA

Interaction terms capture combined effects.

Why this answer

Option D is correct because adding interaction terms between promotion and other features can capture the promotional effect. Option A is wrong because more data may not fix the bias. Option B is wrong because moving to deep learning may be overkill.

Option C is wrong because excluding promotions removes important information.

Full explanation →

1454

MCQeasy

A data scientist is training a binary classification model on a highly imbalanced dataset (99% negative class, 1% positive class). The model currently achieves 99% accuracy but only identifies 0.5% of true positives. Which metric should the data scientist focus on to improve model performance?

A.Precision

B.Root Mean Squared Error (RMSE)

C.Recall

D.Accuracy

AnswerC

Recall measures the ability to find all positive samples, which is crucial for imbalanced data.

Why this answer

Recall (sensitivity) measures the proportion of actual positives correctly identified, which is critical when the dataset is highly imbalanced (99% negative, 1% positive) and the model fails to detect most positives (only 0.5% true positives). Improving recall directly addresses the model's inability to capture the minority class, even if it reduces precision or accuracy. In binary classification with severe class imbalance, accuracy is misleading because a model can achieve 99% accuracy by simply predicting the majority class, as seen here.

Exam trap

The trap here is that candidates see 99% accuracy and assume the model is performing well, failing to recognize that accuracy is a deceptive metric in imbalanced datasets, while recall directly measures the model's ability to find the rare positive class.

How to eliminate wrong answers

Option A is wrong because precision focuses on the proportion of predicted positives that are actually positive, which does not address the low true positive rate (0.5%); improving precision could even further reduce recall by making the model more conservative. Option B is wrong because Root Mean Squared Error (RMSE) is a regression metric that measures the average magnitude of errors in continuous predictions, not applicable to binary classification outcomes like true positive identification. Option D is wrong because accuracy is already 99% and is a poor metric for imbalanced datasets; optimizing for accuracy encourages the model to predict the majority class (negative) for all instances, which is exactly why only 0.5% of positives are found.

Full explanation →

1455

MCQeasy

A data scientist is reviewing a dataset and notices that the distribution of a numerical feature is heavily right-skewed with a long tail. Which visualization is most appropriate to assess the distribution?

A.Box plot

B.Line chart

C.Scatter plot

D.Histogram with a logarithmic scale on the x-axis

AnswerD

Log scale helps visualize skewed distributions.

Why this answer

Option B is correct because a histogram with a log scale can handle skewed data. Option A is wrong because a box plot shows quartiles but not the full distribution shape. Option C is wrong because a scatter plot is for two variables.

Option D is wrong because a line chart is for time series.

Full explanation →

1456

Multi-Selectmedium

A data scientist is analyzing a dataset with a binary target variable. The dataset has 50,000 rows and 200 features. The data scientist wants to identify which features are most predictive. Which TWO methods are appropriate for feature selection during EDA?

Select 2 answers

A.Mutual information

B.Chi-squared test

C.LSTM neural network

D.K-means clustering

E.Principal Component Analysis (PCA)

AnswersA, B

Can measure dependency between features and target.

Why this answer

Chi-squared test is for categorical features and mutual information can handle both. Option C is wrong because PCA is unsupervised. Option D is wrong because k-means is clustering.

Option E is wrong because LSTM is a model, not for EDA.

Full explanation →

1457

MCQhard

An ML engineer is performing EDA on a dataset of customer transactions. The dataset has 1 million rows and 20 columns, including a 'transaction_amount' column. The engineer notices that 5% of the transaction amounts are negative, which are data entry errors. The rest are positive. Which approach is most appropriate for handling these negative values during EDA?

A.Impute the negative values with the median of positive transaction amounts.

B.Remove rows with negative transaction amounts from the dataset.

C.Take the absolute value of the negative transaction amounts.

D.Cap the negative values at zero.

AnswerB

Removing erroneous data points cleans the dataset without introducing bias.

Why this answer

Option D is correct because the negative values are errors and likely distort the distribution; removing them is straightforward and valid. Option A is wrong because taking absolute values would incorrectly treat errors as legitimate high values. Option B is wrong because negative values are not missing, so imputation is not appropriate.

Option C is wrong because capping may still retain erroneous values.

Full explanation →

1458

MCQhard

A data scientist is setting up a SageMaker training job and has attached this IAM policy to the execution role. The training job fails with an access denied error when trying to write to the output path 's3://my-bucket/output/model.tar.gz'. What additional permission is needed?

A.s3:ListBucket

B.s3:GetObject for the output path

C.s3:DeleteObject

D.iam:PassRole on the role itself

AnswerA

SageMaker requires ListBucket permission to access the bucket.

Why this answer

The training job fails because SageMaker needs to verify that the output S3 bucket exists before writing to it. The s3:ListBucket permission is required to list the contents of the bucket (or confirm its existence) as part of the write operation. Without this permission, the service cannot validate the bucket, resulting in an access denied error even if s3:PutObject is allowed.

Exam trap

The trap here is that candidates assume only s3:PutObject is needed for writing to S3, but AWS services like SageMaker often require s3:ListBucket to verify the bucket exists before performing write operations.

How to eliminate wrong answers

Option B is wrong because s3:GetObject is a read permission used for retrieving objects, not for writing output; the training job needs write access (s3:PutObject) to create the model artifact. Option C is wrong because s3:DeleteObject is unrelated to writing output; it is used for removing objects, and the training job does not need to delete anything. Option D is wrong because iam:PassRole is required to pass the execution role to the SageMaker service, but the question states the role is already attached to the training job, so this permission is not missing; the error occurs specifically at the S3 write step.

Full explanation →

1459

MCQhard

A data scientist is trying to upload a CSV file to an S3 bucket using the AWS CLI without specifying server-side encryption. The upload fails with an AccessDenied error. Based on the bucket policy exhibit, what is the most likely cause?

A.The upload request did not specify the required server-side encryption.

B.The bucket does not exist.

C.The data scientist does not have any permissions to the bucket.

D.The data scientist used the wrong AWS region.

AnswerA

The condition requires s3:x-amz-server-side-encryption to be AES256.

Why this answer

Option B is correct because the policy requires that all PutObject requests include server-side encryption with AES256. Option A is wrong because the policy allows GetObject and PutObject, but with a condition. Option C is wrong because the condition is on the encryption, not on the bucket.

Option D is wrong because the error is AccessDenied, not NoSuchBucket.

Full explanation →

1460

MCQeasy

A machine learning engineer is using Amazon SageMaker to deploy a model for real-time inference. The model must respond within 100 milliseconds. The initial deployment uses a single ml.m5.large instance, but latency is too high. Which change should the engineer make to reduce latency?

A.Switch to a compute-optimized instance like ml.c5.2xlarge.

B.Use batch transform instead of real-time endpoint.

C.Deploy to a single ml.t2.medium instance to reduce cost.

D.Deploy the model on a multi-model endpoint.

AnswerA

Compute-optimized instances provide higher CPU performance, reducing prediction latency.

Why this answer

Option A is correct because using a more powerful instance reduces inference time. Option B is wrong because multi-model endpoint can lead to resource contention. Option C is wrong because batch transforms are for offline predictions.

Option D is wrong because scaling down reduces resources.

Full explanation →

1461

MCQmedium

A company is building a binary classifier to detect fraudulent transactions. The dataset is highly imbalanced (99% legitimate, 1% fraudulent). Which metric is most appropriate for evaluating the model?

A.Accuracy

B.Mean Squared Error

C.F1-score

D.Area Under the ROC Curve (AUC-ROC)

AnswerC

F1-score considers both precision and recall, suitable for imbalanced data.

Why this answer

Precision and recall (or F1-score) are more informative for imbalanced datasets than accuracy, because a model predicting all legitimate would achieve 99% accuracy but be useless. F1-score balances precision and recall.

Full explanation →

1462

Multi-Selecteasy

A company is using Amazon SageMaker to train a model. Which TWO metrics should be used to evaluate a binary classification model?

Select 2 answers

A.Accuracy

B.Perplexity

C.AUC

D.F1 score

E.Mean Absolute Error

AnswersC, D

AUC is a standard metric for binary classification.

Why this answer

AUC (Area Under the ROC Curve) is a threshold-independent metric that measures the model's ability to distinguish between positive and negative classes across all classification thresholds. For binary classification in SageMaker, AUC is robust to class imbalance and provides a single scalar value representing overall model performance, making it a standard evaluation metric.

Exam trap

The trap here is that candidates often pick Accuracy (A) as a default metric without considering class imbalance, or confuse regression metrics like MAE (E) with classification evaluation, while perplexity (B) is a distractor from NLP contexts.

Full explanation →

1463

MCQeasy

A machine learning engineer is deploying a model to Amazon SageMaker for real-time inference. The model requires low latency and must handle variable traffic patterns. Which SageMaker feature should the engineer use to automatically scale the number of instances based on demand?

A.SageMaker automatic scaling

B.Amazon EC2 Auto Scaling

C.Elastic Inference

D.SageMaker Batch Transform

AnswerA

SageMaker integrates with Application Auto Scaling to scale the number of instances based on demand.

Why this answer

SageMaker automatic scaling (Application Auto Scaling) is the correct feature because it allows the engineer to define scaling policies (e.g., based on CPU utilization or request latency) that automatically adjust the number of instances behind a SageMaker endpoint in response to real-time traffic patterns. This ensures low latency by maintaining sufficient capacity during spikes and reducing costs during lulls, without manual intervention.

Exam trap

The trap here is that candidates confuse Amazon EC2 Auto Scaling (which scales EC2 instances in an Auto Scaling group) with SageMaker automatic scaling (which scales SageMaker endpoint instances via Application Auto Scaling), leading them to pick B even though it does not directly apply to SageMaker endpoints.

How to eliminate wrong answers

Option B (Amazon EC2 Auto Scaling) is wrong because it operates at the EC2 instance level, not at the SageMaker endpoint level; SageMaker endpoints are managed services that require Application Auto Scaling with a specific SageMaker scalable target (e.g., variant.DesiredInstanceCount). Option C (Elastic Inference) is wrong because it accelerates inference by attaching a GPU accelerator to an instance, but it does not handle scaling of instances based on demand—it only reduces latency for deep learning models. Option D (SageMaker Batch Transform) is wrong because it is designed for offline, asynchronous batch predictions on large datasets, not for real-time inference with variable traffic patterns.

Full explanation →

1464

MCQhard

A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?

A.Configure the SageMaker training job to use Pipe mode, which streams data directly from S3 without downloading to the instance's local storage.

B.Use S3 Transfer Acceleration to speed up the data transfer from S3 to the training instance.

C.Use larger EC2 instances with more vCPUs and memory to speed up the training process.

D.Enable Elastic Fabric Adapter (EFA) on the training instances to improve network throughput.

AnswerA

Pipe mode reduces start-up time by streaming data, and it is cost-effective as it avoids EBS volume costs associated with File mode.

Why this answer

Pipe mode in SageMaker streams training data directly from Amazon S3 to the training algorithm without first downloading it to the instance's local storage. This eliminates the data download step, significantly reducing startup time and overall training time for large datasets like 10 TB. It is the most cost-effective because it avoids the need for larger instances or additional data transfer acceleration services.

Exam trap

The trap here is that candidates often confuse Pipe mode with File mode, assuming both require downloading data, or they over-engineer the solution by choosing expensive network accelerators or larger instances when the simplest streaming approach is both faster and cheaper.

How to eliminate wrong answers

Option B is wrong because S3 Transfer Acceleration is designed to speed up uploads to S3 over long distances, not downloads from S3 to SageMaker training instances, and it incurs additional costs without addressing the core issue of download time. Option C is wrong because using larger EC2 instances with more vCPUs and memory does not reduce the data download time; it only increases compute capacity, which may not help if the bottleneck is I/O from downloading data. Option D is wrong because Elastic Fabric Adapter (EFA) improves inter-node network communication for distributed training, but it does not accelerate data transfer from S3 to the instance, which is the primary bottleneck here.

Full explanation →

1465

Multi-Selectmedium

Which TWO AWS services can be used to move data from an on-premises database to Amazon S3 on a recurring schedule without writing custom code? (Choose 2.)

Select 2 answers

A.AWS Glue

B.AWS Snowball Edge

C.AWS Database Migration Service (AWS DMS)

D.Amazon Athena

E.Amazon Kinesis Data Firehose

AnswersA, C

Glue can run scheduled ETL jobs from JDBC sources to S3.

Why this answer

AWS Database Migration Service (DMS) can perform continuous replication from on-premises databases to S3. AWS Glue can run scheduled ETL jobs to pull data from on-premises sources via JDBC and write to S3. Option A is wrong because Snowball is a one-time physical transfer.

Option D is wrong because Kinesis Data Firehose is for streaming, not scheduled batch. Option E is wrong because Athena is a query service, not a data movement tool.

Full explanation →

1466

MCQhard

Refer to the exhibit. An ML engineer attaches this IAM policy to a user. The user wants to invoke the SageMaker endpoint my-endpoint from an EC2 instance with public IP 52.1.1.1. What will happen?

A.The invocation fails because the user does not have permission to create an endpoint.

B.The invocation is denied because the Deny statement applies to all resources.

C.The invocation is allowed because the source IP is not in the denied ranges.

D.The invocation is denied because the user is not in a VPC.

AnswerC

The Deny condition does not match the public IP, so Allow prevails.

Why this answer

The policy has an Allow for InvokeEndpoint on the specific endpoint, but also a Deny with a condition that denies InvokeEndpoint if the source IP is within private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16). The user's IP is 52.1.1.1, which is a public IP, not in the deny condition. Therefore, the Allow takes effect, and the invocation is permitted.

Note: Deny always overrides Allow, but the condition does not match, so Deny is not applied.

Full explanation →

1467

Multi-Selectmedium

A company uses AWS Glue to run ETL jobs. The data engineer wants to monitor job performance and troubleshoot failures. Which THREE AWS services or features should they use together? (Choose three.)

Select 3 answers

A.AWS Glue job bookmarks

B.Amazon S3 Event Notifications

C.Amazon Athena

D.Amazon CloudWatch Logs

E.Amazon CloudWatch metrics

AnswersA, D, E

Job bookmarks track processed data and help identify failures.

Why this answer

Correct options: A, C, E. CloudWatch Logs stores Glue job logs, CloudWatch metrics tracks job performance metrics, and Glue job bookmarks track processed data. Option B (Athena) is for querying, not monitoring.

Option D (S3 Events) can trigger jobs but not monitor them.

Full explanation →

1468

Matchingmedium

Match each SageMaker optimization technique to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Train across multiple GPUs or instances

Hyperparameter optimization with Bayesian search

Use spot instances for cost savings

Stream data directly from S3 for faster training

Monitor training and detect issues

Why these pairings

These techniques improve training efficiency and cost.

Full explanation →

1469

Multi-Selecthard

You are deploying a custom Docker container for a SageMaker model that requires a specific NVIDIA CUDA version. Which THREE steps must you take to ensure the container runs correctly on SageMaker?

Select 3 answers

A.Define a health check endpoint

B.Use SageMaker Batch Transform

C.Include the SageMaker inference toolkit in the container

D.Choose a GPU instance type for the endpoint

E.Set the container's entry point to the inference script

AnswersC, D, E

Required for SageMaker to interface with the container.

Why this answer

Options A, B, and D are correct. Option A: Container must support the SageMaker inference toolkit. Option B: Must set the correct entry point.

Option D: Must run on a GPU instance. Option C is wrong because batch transform is not required. Option E is wrong because health check is recommended but not mandatory.

Full explanation →

1470

Multi-Selectmedium

A data scientist is training a neural network for a multi-class classification problem. The model is overfitting. Which TWO of the following techniques can help reduce overfitting? (Choose two.)

Select 2 answers

A.Increase the number of hidden layers.

B.Add dropout layers after the hidden layers.

C.Decrease the learning rate.

D.Add L2 regularization to the loss function.

E.Reduce the batch size.

AnswersB, D

Dropout randomly drops units during training, reducing co-adaptation.

Why this answer

Option B is correct because dropout layers randomly deactivate a fraction of neurons during training, which prevents the network from relying too heavily on any single neuron and forces it to learn more robust features. This reduces co-adaptation among neurons and is a standard regularization technique to combat overfitting in neural networks.

Exam trap

AWS often tests the distinction between techniques that reduce overfitting (regularization) versus those that improve training dynamics (learning rate, batch size), leading candidates to mistakenly select options like decreasing the learning rate or reducing batch size as primary overfitting solutions.

Full explanation →

1471

Multi-Selectmedium

Which TWO of the following are valid ways to reduce query costs in Amazon Athena? (Choose 2)

Select 2 answers

A.Use UNLOAD to export query results to S3

B.Partition the data in S3

C.Increase the query timeout limit

D.Use columnar storage formats like Parquet

E.Enable encryption at rest on S3

AnswersB, D

Partitioning limits data scanned per query.

Why this answer

Option A (partitioning) reduces data scanned, lowering cost. Option C (using columnar formats) also reduces scanned data. Option B is wrong because increasing limit doesn't reduce cost.

Option D is wrong because encryption doesn't reduce cost. Option E is wrong because unload is for exporting, not cost reduction.

Full explanation →

1472

Multi-Selecthard

A company is deploying a machine learning model using Amazon SageMaker. The model requires GPUs for inference. Which THREE configurations can the company use to meet this requirement? (Choose THREE.)

Select 3 answers

A.SageMaker Serverless Inference

B.Real-time endpoints with ml.p3 instance types

C.SageMaker Batch Transform with ml.p3 instances

D.SageMaker Studio

E.SageMaker Elastic Inference (EI)

AnswersB, C, E

Real-time endpoints support GPU instances like ml.p3.

Why this answer

Real-time endpoints support GPU instances. Batch transform also supports GPU. Elastic Inference (option C) provides GPU acceleration without a full GPU instance.

Option B (Serverless) does not support GPU. Option D (SageMaker Studio) is an IDE, not for inference.

Full explanation →

1473

MCQhard

Refer to the exhibit. The training job 'my-job' failed with the error 'Unable to pull image from ECR'. What is the most likely cause?

A.The IAM role does not have permission to pull images from the ECR repository.

B.The instance type ml.m5.large does not support custom images.

C.The S3 bucket for training data is in a different account.

D.The role ARN is incorrect.

AnswerA

Without ecr:GetDownloadUrlForLayer and BatchGetImage, the pull fails.

Why this answer

The error 'Unable to pull image from ECR' indicates that the SageMaker training job could not retrieve the custom Docker image stored in Amazon ECR. The most likely cause is that the IAM role associated with the training job lacks the `ecr:GetDownloadUrlForLayer` and `ecr:BatchGetImage` permissions required to pull images from the ECR repository. Without these permissions, SageMaker cannot authenticate and download the container image, even if the repository and image exist.

Exam trap

Cisco often tests the misconception that any IAM role with basic SageMaker permissions can pull images from ECR, but the trap here is that the role must have explicit ECR permissions (not just SageMaker permissions) to download the container image, and candidates may incorrectly blame the instance type or S3 bucket location instead.

How to eliminate wrong answers

Option B is wrong because the instance type ml.m5.large fully supports custom images; SageMaker allows custom Docker images on any supported instance type, including ml.m5.large, as long as the image is compatible with the instance architecture. Option C is wrong because the S3 bucket being in a different account would cause a different error (e.g., 'Access Denied' or 'Bucket not found') and would not affect the ability to pull an image from ECR, which is a separate service. Option D is wrong because an incorrect role ARN would result in a validation error when submitting the job (e.g., 'Invalid IAM Role ARN'), not a runtime error during image pull; the job would fail to start, not fail mid-execution with an ECR pull error.

Full explanation →

1474

MCQhard

Refer to the exhibit. A data scientist is training a PyTorch model on a SageMaker ml.p3.2xlarge instance (16 GB GPU memory). The training fails with the shown error. Which change should the scientist make to resolve the error?

A.Reduce the batch size in the training script.

B.Increase the number of instances to 2.

C.Use SageMaker Managed Spot Training.

D.Increase the number of epochs.

AnswerA

Smaller batch size reduces GPU memory consumption.

Why this answer

Reducing the batch size (Option D) reduces GPU memory usage per iteration and can resolve OOM errors. Option A (increase instance count) does not reduce per-GPU memory. Option B (use Spot) does not help.

Option C (increase epochs) will still fail.

Full explanation →

1475

Multi-Selecthard

A company uses a SageMaker endpoint for real-time inference. They need to ensure high availability during deployment updates. Which THREE steps achieve this? (Choose 3)

Select 3 answers

A.Use a single instance to save costs

B.Use blue/green deployment with a new endpoint configuration

C.Configure multiple instances behind the endpoint

D.Delete the old endpoint before creating the new one

E.Use Canary or Linear traffic shifting in SageMaker

AnswersB, C, E

Blue/green allows traffic switch after new version is healthy.

Why this answer

Blue/green deployment, multiple instances, and traffic shifting are standard practices for zero-downtime updates.

Full explanation →

1476

Multi-Selectmedium

Which THREE evaluation metrics are appropriate for a multi-class classification problem? (Choose 3.)

Select 3 answers

A.Confusion matrix.

B.Accuracy.

C.Mean squared error.

D.Precision-recall curve.

E.F1 score (macro/micro).

AnswersA, B, E

Confusion matrix provides per-class performance.

Why this answer

Option A is correct because accuracy is common for multi-class. Option C is correct because confusion matrix gives detailed per-class performance. Option D is correct because F1 score macro/micro averaging is used.

Option B is wrong because mean squared error is for regression. Option E is wrong because precision-recall curve is typically for binary.

Full explanation →

1477

MCQhard

A data scientist is performing exploratory data analysis on a high-dimensional dataset with 500 features. The scientist wants to visualize the data in 2D to check for clusters. Which dimensionality reduction technique should the scientist use that preserves global structure and is computationally efficient for large datasets?

A.t-SNE

B.Linear Discriminant Analysis (LDA)

C.PCA

D.UMAP

AnswerC

PCA is linear, fast, and preserves global variance.

Why this answer

Option C is correct because PCA is linear, fast, and preserves global variance structure. Option A is wrong because t-SNE is non-linear, slower, and focuses on local structure. Option B is wrong because UMAP can be slow and is also non-linear.

Option D is wrong because LDA is supervised and requires labels.

Full explanation →

1478

MCQmedium

A data scientist is deploying a SageMaker model using CloudFormation. The stack creation fails with the above error. What is the MOST likely cause?

A.The Docker image has not been pushed to the ECR repository

B.The IAM role does not have permissions to access ECR

C.The model name is incorrect

D.The instance type specified in the endpoint configuration is not available

AnswerA

The error clearly states the image does not exist in ECR.

Why this answer

The error indicates that SageMaker cannot find the Docker image specified in the `PrimaryContainer` of the model definition. CloudFormation creates the SageMaker model by referencing an ECR image URI; if that image has not been pushed to the specified ECR repository, the model creation fails immediately. This is the most common cause when the stack creation fails with an error about a missing or inaccessible image.

Exam trap

The trap here is that candidates confuse a missing image (resource not found) with an IAM permissions error, but the error message for a missing image is distinct and occurs at a different stage of the API call.

How to eliminate wrong answers

Option B is wrong because an IAM role lacking ECR permissions would produce an access denied or authorization error, not a 'not found' error for the image. Option C is wrong because an incorrect model name would cause a different error (e.g., 'Model not found') only when referencing an existing model, not during creation. Option D is wrong because an unavailable instance type would cause a resource allocation failure at the endpoint creation step, not during model creation.

Full explanation →

1479

MCQhard

A data scientist is training a model using Amazon SageMaker with a custom Docker container. The training job fails with an error: 'Resource exhausted: Out of memory'. The training data is stored in S3. What should the data scientist do to resolve this issue?

A.Increase the instance memory by selecting a larger instance type.

B.Increase the EBS volume size attached to the training instance.

C.Use Pipe mode for data loading instead of File mode.

D.Reduce the batch size in the training script.

AnswerA

Larger instance provides more memory.

Why this answer

Option B is correct because 'Out of memory' indicates the instance does not have enough memory. Increasing the instance memory resolves the issue. Option A is wrong because using Pipe mode streams data directly from S3 and can reduce memory usage, but the error is about memory exhaustion, not data loading.

Option C is wrong because EBS volume size does not affect memory. Option D is wrong because reducing batch size might help but is not the primary fix; increasing instance memory directly addresses the issue.

Full explanation →

1480

Multi-Selecthard

A company is deploying a machine learning model on Amazon SageMaker. The model needs to be updated frequently with new versions. The team wants to minimize downtime and test the new model version before routing all traffic to it. Which TWO strategies should be used together?

Select 2 answers

A.Use a rolling update strategy.

B.Use a multi-model endpoint.

C.Use Amazon SageMaker A/B testing.

D.Use Amazon SageMaker canary deployment.

E.Use Amazon SageMaker blue/green deployment.

AnswersD, E

Canary deployment sends a small percentage of traffic to the new version.

Why this answer

Option A (blue/green deployment) and Option D (canary deployment) are correct. Blue/green allows a new version to be deployed alongside the old one, and canary deployment routes a small percentage of traffic to the new version for testing. Option B is wrong because A/B testing in SageMaker is typically done with production variants but does not inherently include canary routing; canary is a specific feature.

Option C is wrong because rolling update is not a native SageMaker feature for endpoints. Option E is wrong because multi-model endpoints host multiple models but do not facilitate traffic shifting for updates.

Full explanation →

1481

MCQeasy

A CloudFormation stack creation failed. The SageMaker endpoint resource shows CREATE_FAILED. What is the most likely issue?

A.The IAM role used by CloudFormation lacks permissions to create endpoints.

B.The S3 bucket 'my-bucket' does not contain the object 'model.tar.gz'.

C.The SageMaker endpoint configuration is invalid.

D.The instance type specified for the endpoint is not available in the region.

AnswerB

The error states the model data is not accessible, likely because the object does not exist.

Why this answer

The correct answer is B because a CREATE_FAILED status on a SageMaker endpoint resource during CloudFormation stack creation most commonly indicates that the model artifact specified in the Model definition cannot be located. SageMaker requires the S3 bucket and object path (e.g., 's3://my-bucket/model.tar.gz') to exist and be accessible at the time of model creation. If the object is missing, the model resource fails, cascading to the endpoint creation failure.

Exam trap

The trap here is that candidates often assume endpoint failures are always due to configuration or permissions, but the most common root cause in CloudFormation deployments is a missing S3 artifact, which is a prerequisite that is easy to overlook.

How to eliminate wrong answers

Option A is wrong because if the IAM role lacked permissions, CloudFormation would typically fail with an access denied error on the role itself, not specifically on the endpoint resource with CREATE_FAILED; the role is validated before resource creation. Option C is wrong because an invalid endpoint configuration would produce a validation error during stack creation, but the question states the endpoint resource shows CREATE_FAILED, which implies the configuration was accepted but the underlying model or instance caused failure. Option D is wrong because an unavailable instance type would result in a resource creation error with a specific message about insufficient capacity or unavailability, not a generic CREATE_FAILED on the endpoint; CloudFormation would report a different error code.

Full explanation →

1482

Multi-Selecteasy

A company wants to centralize logging from multiple AWS accounts and on-premises servers. The logs must be stored cost-effectively and be searchable. Which TWO services should be used? (Choose TWO.)

Select 2 answers

A.Amazon CloudWatch Logs

B.Amazon Redshift

C.Amazon Athena

D.Amazon Kinesis Data Streams

E.Amazon S3

AnswersC, E

Athena can query logs directly on S3.

Why this answer

Amazon S3 is a cost-effective storage for log archives, and Amazon Athena allows querying the logs directly on S3 without loading into a database. Option A (CloudWatch Logs) is for real-time monitoring, not long-term storage. Option C (Redshift) is more expensive.

Option E (Kinesis) is for streaming, not storage.

Full explanation →

1483

MCQeasy

A machine learning team is using Amazon SageMaker to train a linear regression model. The team notices that the training loss decreases rapidly initially but then plateaus at a high value. What is the MOST likely cause?

A.The model uses batch normalization

B.The learning rate is set too low

C.The model is over-regularized with L2 regularization

D.The learning rate is set too high

AnswerD

A high learning rate can cause the loss to fluctuate or plateau after an initial drop.

Why this answer

A learning rate set too high causes the optimizer to take excessively large steps, overshooting the minimum of the loss function. This results in rapid initial decrease as the model makes large corrections, but then the loss plateaus at a high value because the parameters oscillate around the optimum without converging. In SageMaker's linear regression (typically using stochastic gradient descent), a high learning rate prevents fine-grained convergence, leading to a high plateau.

Exam trap

The trap here is that candidates often associate a plateau in loss with a learning rate that is too low (underfitting), but the rapid initial decrease followed by a high plateau is a classic sign of a learning rate that is too high, causing divergence or oscillation.

How to eliminate wrong answers

Option A is wrong because batch normalization is not typically used in linear regression models; it is a technique for deep neural networks to stabilize training by normalizing layer inputs, and it would not cause a high plateau. Option B is wrong because a learning rate set too low would cause the loss to decrease very slowly from the start, not rapidly initially and then plateau at a high value. Option C is wrong because over-regularization with L2 regularization would cause the loss to be high from the beginning due to large penalty terms, and the loss would not decrease rapidly initially; it would remain high throughout training.

Full explanation →

1484

MCQeasy

A data scientist wants to use Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is expected to take several hours. Which storage option should be used to minimize data loading time and cost?

A.Attach an Amazon EBS volume with the dataset pre-loaded

B.Use File mode to copy data to the training instance's local storage

C.Use Pipe mode to stream data directly from S3 during training

D.Mount an Amazon EFS file system to the training instance

AnswerC

Pipe mode streams data on the fly, reducing startup time and cost.

Why this answer

Option C is correct because Pipe mode streams data directly from S3 without downloading, minimizing time and cost. Option A (File mode) downloads entire dataset, increasing time and cost. Option B (Amazon EFS) is unnecessary and adds complexity.

Option D (Amazon EBS) is not directly integrated with SageMaker.

Full explanation →

1485

MCQhard

A company is deploying a model that predicts customer churn. The model's recall for the churn class is 0.9, but precision is 0.4. The business cost of false positives is high. Which strategy would MOST likely improve precision without significantly harming recall?

A.Collect more data for the churn class

B.Use a different algorithm such as Random Forest

C.Decrease the decision threshold for the churn class

D.Increase the decision threshold for the churn class

AnswerD

Higher threshold reduces false positives, improving precision, though recall may drop slightly.

Why this answer

Adjusting the decision threshold to require a higher probability before predicting churn can reduce false positives (increase precision) but may lower recall. The goal is to find a threshold that balances both. Using more aggressive regularization or different algorithms may not directly control the trade-off.

Full explanation →

1486

Multi-Selecthard

A company is using Amazon Kinesis Data Streams with a Lambda consumer. The Lambda function writes results to an S3 bucket. The team wants to ensure that each record is processed exactly once and in order. Which TWO configurations should the team implement? (Choose 2.)

Select 2 answers

A.Set the batch size to 1

B.Increase the Lambda function's reserved concurrency

C.Set the parallelization factor to 1

D.Configure a dead-letter queue for failed records

E.Enable S3 bucket versioning to track duplicates

AnswersC, E

This ensures a single Lambda instance processes each shard, maintaining order.

Why this answer

Option B (Enable parallelization factor) and Option E (Use S3 bucket versioning) are correct. Option A is wrong because batch size does not affect ordering. Option C is wrong because concurrency limits don't ensure ordering.

Option D is wrong because DLQ does not affect ordering.

Full explanation →

1487

MCQhard

A company runs a streaming data pipeline using Amazon Kinesis Data Streams with 10 shards. The pipeline ingests sensor data from thousands of devices. Each device sends a JSON payload every 5 seconds. The payload size is approximately 2 KB. The data is consumed by a fleet of EC2 instances running a custom Java application that uses the Kinesis Client Library (KCL). Over the past week, the company has observed that the consumer application is experiencing increased latency, and the Kinesis stream's 'GetRecords.IteratorAgeMilliseconds' CloudWatch metric is consistently above 10 seconds. The company has verified that the EC2 instances have sufficient CPU and memory resources. The KCL application is configured with 10 workers, one per shard. The application processes each record by performing a simple transformation and writing to Amazon DynamoDB. The DynamoDB table has sufficient write capacity and is not throttling. The company wants to reduce the iterator age to under 2 seconds. Which action should the company take?

A.Replace Kinesis Data Streams with Amazon Kinesis Data Firehose

B.Increase the write capacity of the DynamoDB table

C.Increase the number of shards in the Kinesis stream to 20

D.Increase the number of KCL workers to 20

AnswerC

More shards increase the number of concurrent consumers and reduce iterator age.

Why this answer

Option A is correct because the consumer is bottlenecked by the number of shards; increasing shards allows more parallelism and reduces latency. Option B is wrong because the issue is not DynamoDB capacity. Option C is wrong because switching to Firehose would change the architecture and may not support the custom transformation.

Option D is wrong because increasing worker count beyond the number of shards does not help due to KCL's design.

Full explanation →

1488

MCQeasy

A company uses SageMaker to train a model, but the training job fails due to insufficient memory. What is the most cost-effective way to resolve this?

A.Use a larger instance type with more memory

B.Use Spot Instances to reduce cost

C.Reduce the batch size in the training script

D.Switch to distributed training across multiple instances

AnswerA

Larger instance types provide more memory, directly solving the issue.

Why this answer

Option A is correct because increasing instance memory addresses the issue directly. Option B is wrong because reducing batch size may not solve memory issues if the model itself is large. Option C is wrong because Spot Instances have no memory advantage.

Option D is wrong because distributed training adds complexity and cost.

Full explanation →

1489

MCQmedium

During EDA, a data scientist finds that a feature 'age' has 30% missing values. The dataset has 100,000 rows. Which imputation strategy is most robust if the data is not missing at random (MNAR) and the missingness is related to the age value itself?

A.Impute with the mean age

B.Impute with a random sample from the observed ages

C.Impute with the median age

D.Create a separate category indicating missingness and impute with a placeholder

AnswerD

Creating a missing category captures the information that the value is missing, which is informative under MNAR.

Why this answer

Option C is correct because missingness related to the value itself means that the missing data are systematically different; creating a 'missing' category allows the model to learn the pattern. Option A is wrong because mean imputation reduces variance and ignores the systematic difference. Option B is wrong because median imputation has similar issues.

Option D is wrong because random imputation introduces noise without capturing the missingness pattern.

Full explanation →

1490

MCQhard

A financial services company uses Amazon SageMaker to train a fraud detection model. The training data is stored in an S3 bucket encrypted with AWS KMS. The SageMaker training job is configured to use a custom Docker container that reads data from S3 and writes model artifacts back to S3. The training job fails with the error: 'Unable to write model artifact to s3://my-bucket/output/model.tar.gz. Access Denied.' The IAM role used by the training job has the following permissions: s3:GetObject and s3:PutObject on the bucket, and kms:Decrypt on the KMS key. The training job is not using a VPC. What is the MOST likely cause of the failure?

A.The S3 bucket is in a different region than the training job

B.The IAM role does not have kms:GenerateDataKey permission on the KMS key

C.The S3 bucket requires S3 Batch Operations for writing artifacts

D.The IAM role does not have s3:PutObject permission on the output bucket

AnswerB

Required for writing encrypted objects.

Why this answer

Option B is correct because SageMaker training jobs need kms:GenerateDataKey permission to write encrypted objects. Option A is wrong because the role already has s3:PutObject. Option C is wrong because the job is not in a VPC, so no VPC endpoint issues.

Option D is wrong because S3 Batch Operations is irrelevant.

Full explanation →

1491

Multi-Selecthard

A data scientist is analyzing a large dataset of images stored in Amazon S3. The dataset is used to train a computer vision model. Which THREE EDA steps are appropriate for this image dataset?

Select 3 answers

A.Compute the distribution of image dimensions (height and width).

B.Check for corrupted or unreadable image files.

C.Decompose the time series of image timestamps to detect seasonality.

D.Visualize a sample of images from each class to verify labels.

E.Perform tokenization and stop word removal on image filenames.

AnswersA, B, D

Important for resizing and batching.

Why this answer

Checking label distribution, identifying corrupt images, and analyzing image dimensions are standard EDA for images. Option B (time series decomposition) is irrelevant. Option E (text preprocessing) is for text data.

Full explanation →

1492

MCQmedium

Refer to the exhibit. A SageMaker training job uses an IAM role with this policy. The training job writes output to s3://my-bucket/output/. Which statement about the policy is true?

A.The Allow statement allows all PutObject requests regardless of encryption

B.The training job can write output objects only if server-side encryption with SSE-S3 is used

C.The Deny statement blocks all PutObject requests

D.The GetObject permission requires the object to be encrypted with SSE-S3

AnswerB

Deny requires AES256 encryption.

Why this answer

Option C is correct because the Deny statement blocks PutObject without SSE-S3 (AES256). Option A is wrong because Deny with condition allows PutObject when condition is met. Option B is wrong because Deny overrides Allow.

Option D is wrong because GetObject is allowed without encryption requirement.

Full explanation →

1493

MCQhard

A data scientist is building a multi-class classification model with 10 classes. The dataset has 100,000 samples. After training a random forest with 100 trees, the model achieves 85% accuracy on the test set. However, the data scientist notices that for one rare class (1% of data), recall is only 5%. Which technique is MOST likely to improve recall for the rare class without significantly reducing overall accuracy?

A.Increase the number of trees to 500

B.Apply SMOTE to oversample the rare class in the training data

C.Use stratified sampling only for the test set

D.Reduce the decision threshold for the rare class to 0.1

AnswerB

SMOTE creates synthetic samples for the minority class.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the rare class by interpolating between existing minority instances, which directly addresses the class imbalance. This increases the model's exposure to the rare class during training, improving recall without discarding data or significantly altering the overall class distribution, thus preserving overall accuracy.

Exam trap

Cisco often tests the misconception that increasing model complexity (more trees) or adjusting thresholds post-training can fix class imbalance, when in fact the root cause is the skewed training data distribution, which requires a data-level technique like SMOTE.

How to eliminate wrong answers

Option A is wrong because increasing the number of trees in a random forest primarily reduces variance and improves generalization, but it does not address class imbalance; recall for a rare class will remain low if the training data is skewed. Option C is wrong because stratified sampling on the test set only ensures the test set reflects the original class distribution, which does nothing to improve the model's ability to learn the rare class during training. Option D is wrong because reducing the decision threshold for the rare class to 0.1 would increase recall but at the cost of dramatically increasing false positives, which would significantly reduce overall accuracy, especially since the rare class is only 1% of the data.

Full explanation →

1494

MCQeasy

A data scientist is building a binary classification model to predict whether a customer will subscribe to a service. The dataset contains 20 features, including categorical variables with high cardinality (e.g., zip code with 10,000 unique values). The scientist uses a logistic regression model and obtains a training AUC of 0.85 and a test AUC of 0.60. The scientist suspects overfitting due to high cardinality features. Which approach should the scientist use to address this issue?

A.Apply label encoding to the zip code feature

B.Remove the zip code feature entirely

C.Apply target encoding with smoothing to the zip code feature

D.Apply one-hot encoding to the zip code feature

AnswerC

Target encoding reduces cardinality and can improve generalization.

Why this answer

Option C (target encoding with smoothing) reduces cardinality while preserving predictive power. Option A (one-hot encoding) increases dimensionality drastically. Option B (label encoding) may introduce ordinality issues.

Option D (remove zip code) may lose important information.

Full explanation →

1495

MCQeasy

A machine learning team is reviewing a dataset for a regression problem. They notice that the target variable has a right-skewed distribution. Which transformation should they consider applying to the target variable to improve model performance?

A.Apply StandardScaler to the target variable.

B.Apply MinMaxScaler to the target variable.

C.Apply log transformation to the target variable.

D.Apply one-hot encoding to the target variable.

AnswerC

Log transformation reduces right skewness.

Why this answer

Log transformation is commonly applied to right-skewed data to make it more normally distributed, which can improve model performance. Option A (StandardScaler) is for scaling, not skewness. Option B (MinMaxScaler) also doesn't address skewness.

Option D (One-hot encoding) is for categorical variables.

Full explanation →

1496

MCQeasy

A data scientist is using a decision tree algorithm for a classification task. The tree is very deep and achieves 100% accuracy on the training set but performs poorly on the test set. Which technique should the data scientist use to improve generalization?

A.Add more features to the dataset.

B.Reduce the number of training samples.

C.Prune the decision tree.

D.Increase the maximum depth of the tree.

AnswerC

Pruning reduces tree complexity and improves generalization.

Why this answer

A deep decision tree that achieves 100% training accuracy but poor test accuracy is overfitting the training data. Pruning the tree removes branches that have little statistical power, reducing complexity and improving generalization to unseen data.

Exam trap

The trap here is that candidates may confuse overfitting with underfitting and choose to increase model complexity (Option D) or add features (Option A), when the correct remedy for overfitting is to reduce complexity through pruning.

How to eliminate wrong answers

Option A is wrong because adding more features typically increases the risk of overfitting by giving the tree more opportunities to memorize noise. Option B is wrong because reducing the number of training samples exacerbates overfitting by providing less data for the tree to learn generalizable patterns. Option D is wrong because increasing the maximum depth would make the tree even deeper and more complex, worsening overfitting rather than improving generalization.

Full explanation →

1497

MCQeasy

A data scientist is using Amazon SageMaker to train a linear regression model. The training data contains missing values. Which preprocessing step should be applied before training?

A.Ignore missing values; linear regression can handle them.

B.Impute missing values with the mean of the column.

C.Replace missing values with zeros.

D.Remove all rows containing missing values.

AnswerB

Imputation is a common technique to handle missing data.

Why this answer

Option B is correct because linear regression models in Amazon SageMaker cannot handle missing values natively; they require complete numerical input. Imputing missing values with the column mean is a standard preprocessing technique that preserves the overall distribution and avoids introducing bias, ensuring the SageMaker built-in Linear Learner algorithm can train without errors.

Exam trap

The trap here is that candidates may assume linear regression can inherently handle missing values (Option A) due to its statistical robustness, but AWS SageMaker's implementation requires complete data, and ignoring missing values will cause runtime errors or silent model degradation.

How to eliminate wrong answers

Option A is wrong because linear regression algorithms, including SageMaker's Linear Learner, do not accept missing values in the training data; they will either fail or produce incorrect results if missing values are present. Option C is wrong because replacing missing values with zeros can significantly distort the data distribution and model coefficients, especially if the missingness is not random, leading to biased estimates. Option D is wrong because removing all rows with missing values can drastically reduce the dataset size, potentially discarding valuable information and introducing selection bias, which is particularly problematic in small or imbalanced datasets.

Full explanation →

1498

Multi-Selecthard

During EDA of a dataset for a regression problem, a data scientist notices that the target variable has a right-skewed distribution. Which THREE transformations are appropriate to address this skewness? (Choose THREE.)

Select 3 answers

A.Log transformation

B.StandardScaler (z-score normalization)

C.Box-Cox transformation

D.Yeo-Johnson transformation

E.Min-Max scaling

AnswersA, C, D

Log transformation compresses large values, reducing right skew.

Why this answer

Options A, C, and E are correct. Log transformation, Box-Cox, and Yeo-Johnson are common transformations for right-skewed data. Option B is wrong because StandardScaler only standardizes, does not reduce skewness.

Option D is wrong because MinMax scaling does not affect skewness.

Full explanation →

1499

MCQmedium

A company stores sensor data in Amazon S3. A data scientist wants to explore the data using SQL without moving it. Which AWS service should they use?

A.Amazon EMR

B.Amazon Redshift

C.Amazon QuickSight

D.Amazon Athena

AnswerD

Athena queries data directly in S3 using SQL.

Why this answer

Amazon Athena is the correct choice because it is a serverless interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL without any data movement or infrastructure management. Athena uses Presto under the hood and charges only for the data scanned per query, making it ideal for ad-hoc exploratory analysis on sensor data stored in S3.

Exam trap

The trap here is that candidates often confuse Amazon Athena with Amazon EMR or Redshift, thinking they need a full cluster or data warehouse for SQL queries, but Athena is specifically designed for serverless, direct S3 querying with no data movement.

How to eliminate wrong answers

Option A is wrong because Amazon EMR is a managed big data platform that requires provisioning and managing clusters (e.g., Hadoop, Spark), which involves moving or processing data in a separate compute layer, not querying it directly in S3 with SQL without setup. Option B is wrong because Amazon Redshift is a data warehouse that requires loading data from S3 into its own storage before querying, violating the 'without moving it' requirement. Option C is wrong because Amazon QuickSight is a business intelligence (BI) visualization tool, not a SQL query engine; it can connect to Athena but cannot directly run SQL queries on S3 data on its own.

Full explanation →

1500

Multi-Selecthard

A data scientist is analyzing a dataset with many missing values. The scientist wants to decide on an imputation strategy. Which THREE considerations are important for choosing the imputation method?

Select 3 answers

A.The mechanism of missingness (MCAR, MAR, MNAR).

B.The class imbalance of the target variable.

C.The percentage of missing values in each feature.

D.The distribution of the feature (e.g., skewed, normal).

E.The feature importance according to a random forest model.

AnswersA, C, D

Determines whether imputation is valid.

Why this answer

Missing data mechanism (MCAR/MAR/MNAR), proportion of missing values, and feature distribution (skewness, outliers) all affect imputation choice. Option A (feature importance) is not directly relevant. Option D (class imbalance) is for classification targets.

Full explanation →

Page 20 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →