Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1651–1725

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 23 of 24

1651

MCQeasy

A data scientist is using Amazon SageMaker to train a linear regression model. The training data contains 100 features and 1 million rows. The scientist notices that the model is overfitting, with training R² of 0.99 and validation R² of 0.65. The scientist has already tried adding L2 regularization and reducing the number of features. Which additional technique should the scientist try to reduce overfitting?

A.Increase the amount of training data

B.Increase the batch size

C.Increase the learning rate

D.Add more features

AnswerA

More data helps the model generalize better.

Why this answer

Option C is correct. Adding more training data can help reduce overfitting by providing a more representative sample. Option A is wrong because increasing model complexity (more features) would worsen overfitting.

Option B is wrong because increasing learning rate may cause instability. Option D is wrong because increasing batch size may not help and could lead to overfitting.

Full explanation →

1652

MCQmedium

A company uses Kinesis Data Streams to ingest real-time sensor data. The data is consumed by a Lambda function that writes to DynamoDB. During peak hours, the Lambda function throws ProvisionedThroughputExceededException. The team wants to decouple the write operation and improve resilience. What should they do?

A.Use Kinesis Firehose as a consumer of the stream, with a Lambda transformation to write to DynamoDB, and enable error handling.

B.Increase the Lambda function's reserved concurrency and provision more DynamoDB write capacity.

C.Place the Lambda function's output into an Amazon SQS queue, and have a second Lambda function write to DynamoDB.

D.Use Kinesis Data Analytics to process the stream and write results directly to DynamoDB.

AnswerA

Firehose buffers data, retries on failures, and decouples the producer from DynamoDB writes.

Why this answer

Option A is correct because Kinesis Firehose can buffer data and write to DynamoDB via Lambda, providing retry and decoupling. Option B is wrong because SQS does not integrate directly with Kinesis. Option C is wrong because Kinesis Data Analytics does not write to DynamoDB directly.

Option D is wrong because Lambda is already being used; the issue is throughput, not compute.

Full explanation →

1653

MCQeasy

A data scientist is using SageMaker to train a linear learner algorithm. After training, the evaluation shows that the model has high bias. Which action is most likely to reduce bias?

A.Increase the L2 regularization strength

B.Reduce the amount of training data

C.Add feature crosses for categorical variables

D.Remove some features that have low variance

AnswerC

Adding feature crosses increases model capacity to capture interactions, reducing bias.

Why this answer

High bias indicates that the model is underfitting the data, meaning it is too simple to capture the underlying patterns. Adding feature crosses for categorical variables creates interaction features that allow the linear learner to model non-linear relationships, increasing model complexity and reducing bias. This is a standard technique in linear models to address underfitting without switching to a non-linear algorithm.

Exam trap

The trap here is that candidates often confuse bias with variance and incorrectly choose regularization (Option A) to fix underfitting, when regularization actually increases bias and is used to combat overfitting (high variance).

How to eliminate wrong answers

Option A is wrong because increasing L2 regularization strength penalizes large weights, which simplifies the model further and increases bias, not reduces it. Option B is wrong because reducing the amount of training data typically worsens underfitting by providing fewer examples for the model to learn from, increasing bias. Option D is wrong because removing low-variance features reduces the information available to the model, which can increase bias by discarding potentially useful signals.

Full explanation →

1654

MCQeasy

A company uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is organized by year/month/day/hour. The team needs to ensure that all data is encrypted at rest in S3 using an AWS KMS customer managed key (CMK). Which configuration should the team implement?

A.Configure the S3 bucket's default encryption to use the customer managed KMS key.

B.Use an AWS Lambda function to encrypt the data after it is delivered to S3.

C.In the Firehose delivery stream configuration, enable S3 destination encryption and select the customer managed KMS key.

D.Add a bucket policy that denies PutObject unless the request includes the correct KMS key.

AnswerC

Firehose supports SSE-KMS for the S3 destination directly.

Why this answer

Option A is correct because Kinesis Data Firehose can be configured to use server-side encryption with AWS KMS (SSE-KMS) for the S3 destination. Option B is wrong because S3 default encryption is applied at the bucket level, but Firehose can override it; the recommended approach is to configure encryption in Firehose. Option C is wrong because the bucket policy with Deny is not needed if encryption is configured properly.

Option D is wrong because client-side encryption is not suitable for this pipeline.

Full explanation →

1655

MCQhard

A company is using Amazon SageMaker to train a large natural language processing model. The training job uses a GPU instance and is expected to take several hours. The data scientist wants to monitor GPU utilization in real-time. Which approach is MOST effective?

A.Use SageMaker Managed Spot Training to reduce cost and monitor utilization via spot instance status

B.Modify the training script to periodically log GPU utilization to a file in S3

C.Use SageMaker Debugger to capture GPU utilization tensors

D.Enable CloudWatch metrics for the training job and view GPU utilization in the CloudWatch console

AnswerD

SageMaker automatically publishes GPU metrics to CloudWatch.

Why this answer

Option A is correct because SageMaker publishes CloudWatch metrics for GPU utilization. Option B uses a custom solution when a built-in one exists. Option C is for debugging, not real-time monitoring.

Option D is for managed spot training, not monitoring.

Full explanation →

1656

MCQmedium

A data scientist is exploring log files stored in S3. They ran the above AWS CLI command. What does the output indicate about the data, and what EDA step should be taken next?

A.All log files are about 150KB-200KB in size.

B.There are 3 objects in the bucket under the prefix.

C.There are 3 log files larger than 100KB in the specified prefix.

D.The prefix 'logs/2023/' contains exactly 3 objects.

AnswerC

The command filters by size >100000 bytes and returns keys and sizes.

Why this answer

Option B is correct because the command filters objects larger than 100000 bytes, and the output shows three large files. Option A is wrong because the command does not count all objects; it filters by size. Option C is wrong because the output shows three files, not all files.

Option D is wrong because the command explicitly filters by size, so the output is not all objects.

Full explanation →

1657

MCQmedium

A data science team is using Amazon SageMaker to train a deep learning model for object detection using the built-in SSD algorithm. The dataset consists of 100,000 labeled images stored in a SageMaker Pipe Mode input. The training job uses a single ml.p3.2xlarge instance. After 2 hours, the training job fails with the error 'ResourceLimitExceeded: The account-level service limit for ml.p3.2xlarge for training job usage is 1. Contact AWS Support to request a limit increase'. However, the team has already submitted a limit increase request and it was approved for 5 instances. What is the most likely cause of the error?

A.The instance is running out of GPU memory

B.The built-in SSD algorithm requires a GPU instance type with at least 16 GB of GPU memory

C.The service limit increase has not yet been applied to the account in the current region

D.The IAM role does not have permission to access the S3 bucket for model artifacts

AnswerC

Limit increases are region-specific and may take some time to become effective.

Why this answer

Option B is correct because the limit increase applies to the region and may take time to propagate, or the instance type limit is per-family, not per instance type. Option A (GPU memory) would be a different error. Option C (S3 access) would be AccessDenied.

Option D (algorithm) unrelated.

Full explanation →

1658

MCQmedium

A data scientist is using Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is taking too long. The data scientist wants to reduce training time without changing the model architecture. Which action should they take?

A.Use Pipe mode for data input

B.Use a smaller instance type

C.Increase the number of epochs

D.Decrease the batch size

AnswerA

Pipe mode streams data, reducing download time.

Why this answer

Using Pipe mode streams data directly from S3 without downloading, reducing I/O time. Option A is wrong because a smaller instance may increase training time. Option C is wrong because reducing batch size can increase training steps.

Option D is wrong because increasing epochs increases training time.

Full explanation →

1659

MCQmedium

A training job fails with the error shown. The training script expects a file named 'train.csv' in the 'training' channel. What is the most likely cause?

A.The 'train.csv' file is located inside a subfolder within 's3://my-bucket/data/', and the script expects it directly in the channel path.

B.The S3 bucket policy denies access to the 'train.csv' file.

C.The channel name in the input data configuration does not match the script's expected channel name.

D.The training script has a bug that prevents it from reading the file.

AnswerA

SageMaker downloads the entire S3 prefix; if the file is nested, it may not be at the expected location.

Why this answer

The error indicates that the training script cannot find 'train.csv' in the expected location. When SageMaker copies data from an S3 channel path (e.g., 's3://my-bucket/data/') to the training instance, it places the contents of that S3 prefix directly into the channel directory (e.g., '/opt/ml/input/data/training/'). If the CSV file is inside a subfolder (e.g., 's3://my-bucket/data/subfolder/train.csv'), the script will not find it at the top level of the channel path, causing a 'file not found' error.

Exam trap

Cisco often tests the distinction between S3 prefix behavior and file location expectations, trapping candidates who assume SageMaker automatically searches subdirectories or flattens the S3 structure.

How to eliminate wrong answers

Option B is wrong because an S3 bucket policy denying access would produce a different error (e.g., 'AccessDenied' or '403 Forbidden'), not a 'file not found' error from the training script. Option C is wrong because the error message does not mention a channel name mismatch; such a mismatch would cause SageMaker to fail to mount the channel, resulting in a different error during the job setup phase. Option D is wrong because the error is specifically about a missing file, not a runtime bug in the script's reading logic; a bug would typically produce a Python traceback or parsing error, not a 'file not found' error.

Full explanation →

1660

Multi-Selecteasy

A company needs to move 50 TB of data from an on-premises data center to Amazon S3. The company has a limited internet bandwidth of 100 Mbps. The data transfer must be completed within 10 days. Which TWO services should the company use together to meet these requirements?

Select 2 answers

A.Amazon S3 as the destination

B.AWS Direct Connect

C.AWS Site-to-Site VPN

D.AWS Snowball Edge

E.AWS DataSync over the internet

AnswersA, D

Data is ultimately stored in S3.

Why this answer

Options A and E are correct. AWS Snowball Edge is a physical device for large data transfer, and S3 is the destination. Option B is wrong because internet transfer at 100 Mbps would take more than 50 days.

Option C is wrong because Direct Connect is for dedicated network, not for physical transfer. Option D is wrong because VPN is not designed for bulk data transfer.

Full explanation →

1661

MCQmedium

A company is using Amazon SageMaker to deploy a model for real-time inference. The model has a latency requirement of less than 100 milliseconds. During testing, the latency is around 150 milliseconds. Which action can most likely reduce the latency to meet the requirement?

A.Reduce the batch size for inference.

B.Enable data capture for the endpoint.

C.Increase the initial variant weight for the production variant.

D.Use a larger instance type for the endpoint.

AnswerD

A larger instance type provides more compute resources, reducing inference latency.

Why this answer

Enabling data capture adds overhead and increases latency. Using a larger instance type would provide more compute and reduce latency, but may increase cost. Reducing the batch size for inference (if batching is used) can reduce latency because the model processes fewer requests at once.

However, the question implies a real-time endpoint which typically processes one request at a time; batch size might be 1. Increasing the variant weight for the production variant is for traffic routing, not latency. The most direct is to use a more powerful instance type.

But also consider that increasing batch size (if using multi-record) increases latency. Reducing batch size reduces latency. However, for a real-time endpoint, the instance type is key.

I'll go with using a larger instance type.

Full explanation →

1662

MCQhard

A company is deploying a machine learning model for real-time fraud detection. The model must have low latency (under 100 ms) and high throughput. The data scientist trains a gradient boosting model and deploys it to a SageMaker endpoint with a single ml.c5.xlarge instance. During load testing, the endpoint exceeds the latency threshold. Which change is MOST likely to reduce latency?

A.Replace the model with a simpler model, such as logistic regression

B.Use a larger instance type, such as ml.c5.4xlarge

C.Switch to batch transform for inference

D.Enable automatic scaling on the endpoint

AnswerA

A simpler model has lower inference latency, meeting the 100 ms requirement.

Why this answer

Option A is correct because replacing the gradient boosting model with a simpler model like logistic regression reduces the computational complexity per inference. Gradient boosting involves traversing many decision trees, each requiring multiple conditional checks and arithmetic operations, while logistic regression is a single linear transformation. This directly lowers CPU utilization per request, reducing latency under the same instance resources.

Exam trap

The trap here is that candidates often assume scaling up instance size or adding automatic scaling will fix latency, but latency is a per-request metric that depends on model complexity, not just infrastructure parallelism or throughput.

How to eliminate wrong answers

Option B is wrong because using a larger instance type (ml.c5.4xlarge) increases available vCPUs and memory, but the bottleneck is likely per-request computation time, not parallelism; a larger instance may improve throughput but does not guarantee per-request latency drops below 100 ms if the model itself is computationally heavy. Option C is wrong because batch transform is designed for offline, asynchronous inference on large datasets, not real-time low-latency serving; switching to batch transform would increase latency dramatically (minutes vs milliseconds) and violate the real-time requirement. Option D is wrong because automatic scaling adjusts the number of instances based on traffic, which helps with throughput under varying load but does not reduce the per-request latency of a single inference; scaling adds more endpoints but each individual request still faces the same model computation time.

Full explanation →

1663

MCQhard

A company is using Amazon Redshift for data warehousing. The data engineering team notices that queries are slow and the system is frequently writing to disk due to insufficient memory. Which type of workload management (WLM) configuration change would help reduce disk writes?

A.Increase the number of query concurrency slots.

B.Increase the memory percentage allocated to the WLM queue.

C.Enable query monitoring rules to abort queries that spill to disk.

D.Enable short query acceleration (SQA).

AnswerB

More memory per query reduces disk spill.

Why this answer

If queries spill to disk, they need more memory. Increasing the memory percentage allocated to the queue reduces disk spills. Option A is wrong because concurrency slots increase parallelism but may reduce memory per query.

Option C is wrong because query monitoring rules only flag issues, not fix them. Option D is wrong because short query acceleration is for short queries, not memory issues.

Full explanation →

1664

MCQeasy

A data scientist needs to evaluate a binary classification model. The dataset is balanced. Which metric is most appropriate to compare model performance?

A.Recall

B.F1 score

C.Precision

D.Accuracy

AnswerD

For balanced classes, accuracy is a straightforward metric.

Why this answer

For a balanced binary classification dataset, accuracy is the most appropriate metric because it directly measures the proportion of correct predictions (true positives and true negatives) out of all predictions. Since the class distribution is equal, accuracy is not misleadingly high due to class imbalance, making it a reliable and straightforward measure of overall model performance.

Exam trap

AWS often tests the misconception that F1 score or precision-recall metrics are always superior, but for balanced datasets, accuracy is the simplest and most appropriate metric, and candidates may overlook this by defaulting to imbalance-focused metrics.

How to eliminate wrong answers

Option A is wrong because recall focuses only on true positives relative to actual positives, ignoring true negatives and thus not capturing overall performance on a balanced dataset. Option B is wrong because the F1 score is the harmonic mean of precision and recall, which is more useful when there is class imbalance; for a balanced dataset, accuracy is simpler and equally informative. Option C is wrong because precision only considers true positives relative to predicted positives, neglecting true negatives and overall correctness, which is insufficient for balanced data.

Full explanation →

1665

MCQmedium

A data engineer needs to design a data pipeline that ingests CSV files from an SFTP server daily, transforms them, and loads them into Amazon Redshift. The files are typically 2-3 GB. Which combination of AWS services is MOST appropriate?

A.Use AWS Glue ETL with a JDBC connection to the SFTP server to read files directly.

B.Use AWS Lambda to download the files from SFTP, transform them in memory, and write to Redshift using the Data API.

C.Use AWS Transfer Family to automate SFTP file retrieval to S3, then use Redshift COPY to load data.

D.Use Amazon Kinesis Data Firehose with an HTTP endpoint source to receive files from SFTP.

AnswerC

Transfer Family handles SFTP natively, and COPY loads data efficiently into Redshift.

Why this answer

Option D is correct because AWS Transfer Family provides SFTP integration, and Amazon Redshift COPY command efficiently loads large files from S3. Option A is wrong because Lambda has a 15-minute timeout and 6 MB payload limit for invocation. Option B is wrong because Glue ETL can read from S3 but does not directly connect to SFTP.

Option C is wrong because Kinesis is for streaming, not batch file transfers.

Full explanation →

1666

MCQmedium

A company is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires low latency. The team is concerned about cost. Which SageMaker hosting option should the team use?

A.Use a SageMaker batch transform job.

B.Use a SageMaker Serverless Inference endpoint.

C.Use a single-instance endpoint with a large instance type.

D.Use a SageMaker multi-model endpoint.

AnswerD

Multi-model endpoints share resources and reduce cost per model.

Why this answer

Option C is correct because multi-model endpoints share resources and reduce cost. Option A is wrong because it's for testing. Option B is wrong because serverless can have cold starts.

Option D is wrong because batch transform is not real-time.

Full explanation →

1667

MCQeasy

An ML team wants to perform batch inference on a large dataset stored in Amazon S3 using a pre-trained model. The team needs to process the data in parallel across multiple instances to reduce processing time. Which approach should they use?

A.Use SageMaker Processing to run a custom inference script.

B.Use SageMaker Batch Transform with multiple instances.

C.Use SageMaker Training to run inference as a training job.

D.Use SageMaker Ground Truth to process the data.

AnswerB

Batch Transform splits the input data and runs inference in parallel.

Why this answer

SageMaker Batch Transform is designed for batch inference, automatically distributing the dataset across instances for parallel processing. SageMaker Processing (B) is for data preprocessing, not inference. SageMaker Training (C) is for training models.

SageMaker Ground Truth (D) is for labeling.

Full explanation →

1668

Multi-Selectmedium

Which TWO options are best practices for training machine learning models using SageMaker? (Choose TWO.)

Select 2 answers

A.Train the final model on the combined training and test sets to maximize data usage

B.Use incremental training when you have new data that is similar to the original training data

C.Use SageMaker Managed Spot Training to reduce training costs

D.Always use the largest possible instance type to minimize training time

E.Always enable checkpointing to save the model after every epoch

AnswersB, C

Incremental training saves time by starting from an existing model.

Why this answer

Option B is correct because SageMaker's incremental training allows you to continue training an existing model with new data that shares the same schema and feature space, without retraining from scratch. This is a best practice when you have a steady stream of similar data, as it saves time and compute resources while preserving previously learned patterns.

Exam trap

Cisco often tests the misconception that 'more data is always better' (Option A) or that 'bigger instances are always faster' (Option D), when in reality best practices prioritize data integrity, cost efficiency, and appropriate resource scaling.

Full explanation →

1669

MCQmedium

A team uses SageMaker to train a deep learning model. They notice the training job is using only a fraction of the GPU memory. Which configuration change would most improve GPU utilization?

A.Increase the batch size in the training script

B.Decrease the batch size to reduce memory fragmentation

C.Use a single GPU instead of multiple GPUs

D.Enable SageMaker Managed Spot Training

AnswerA

Larger batch sizes consume more GPU memory and improve utilization.

Why this answer

Option A is correct: increasing batch size uses more GPU memory and improves utilization. Option B (reducing batch size) would decrease utilization. Option C (using single GPU) would not help.

Option D (spot training) does not affect utilization.

Full explanation →

1670

MCQmedium

A company is using Amazon SageMaker to train a model. The training job is using a large dataset stored in S3. The data scientist notices that the training job is spending a significant amount of time reading data from S3. Which approach would BEST reduce data loading time?

A.Use the Pipe mode input for the training data

B.Use the File mode input with a larger instance

C.Use a larger training instance with more CPU

D.Increase the batch size to reduce the number of batches

AnswerA

Pipe mode streams data directly, reducing I/O.

Why this answer

Pipe mode streams data directly from S3 into the training algorithm without first downloading it to the training instance's local storage. This eliminates the I/O bottleneck of writing large datasets to disk, significantly reducing data loading time compared to File mode, which downloads the entire dataset before training begins.

Exam trap

The trap here is that candidates often confuse 'batch size' with data loading performance, or assume that more CPU/instance size will speed up S3 reads, when in fact the bottleneck is the network and disk I/O, not compute.

How to eliminate wrong answers

Option B is wrong because File mode requires the entire dataset to be downloaded to the instance's local disk before training starts, which adds significant latency and does not address the root cause of slow S3 reads. Option C is wrong because a larger instance with more CPU does not reduce the time spent reading data from S3; the bottleneck is network I/O and S3 request latency, not compute capacity. Option D is wrong because increasing batch size only affects the number of forward/backward passes per epoch, not the time spent loading data from S3; the data must still be read in its entirety.

Full explanation →

1671

MCQhard

A data scientist is using SageMaker to train an XGBoost model for regression. The training data contains categorical features with high cardinality (e.g., zip code with over 10,000 unique values). Which feature engineering approach is MOST appropriate to avoid overfitting while preserving predictive power?

A.Use target encoding with smoothing

B.One-hot encode the categorical features

C.Apply frequency encoding based on category occurrence

D.Label encode the categorical features

AnswerA

Target encoding captures category-target relationship with regularization to avoid overfitting.

Why this answer

Target encoding with smoothing is the most appropriate approach because it replaces each high-cardinality category with the mean of the target variable for that category, regularized by a smoothing factor that pulls estimates toward the global mean. This preserves predictive power by capturing the relationship between the category and the target while preventing overfitting on rare categories that have few samples. In SageMaker XGBoost, this avoids the curse of dimensionality from one-hot encoding and the arbitrary ordering from label encoding.

Exam trap

The trap here is that candidates often default to one-hot encoding for categorical features, not realizing that high cardinality makes it computationally infeasible and prone to overfitting, while target encoding with smoothing offers a compact and powerful alternative.

How to eliminate wrong answers

Option B is wrong because one-hot encoding a feature with over 10,000 unique values would create over 10,000 binary columns, drastically increasing dimensionality and memory usage, which leads to overfitting and poor generalization in tree-based models like XGBoost. Option C is wrong because frequency encoding replaces categories with their occurrence counts, which loses the relationship between the category and the target variable, often reducing predictive power and introducing bias toward frequent categories. Option D is wrong because label encoding assigns arbitrary integer labels to categories, which implies an ordinal relationship that does not exist, misleading the XGBoost model into treating the feature as ordered and potentially causing poor splits.

Full explanation →

1672

MCQhard

A machine learning engineer is deploying a model for real-time inference using Amazon SageMaker. The model is a large ensemble that requires 8 GB of memory and 4 vCPUs. The expected traffic is 100 requests per second with a 200 ms latency requirement. Which instance configuration should they choose?

A.ml.t2.medium (2 vCPU, 4 GB)

B.ml.c5.2xlarge (8 vCPU, 16 GB)

C.ml.p3.2xlarge (8 vCPU, 61 GB GPU)

D.ml.m5.large (2 vCPU, 8 GB)

AnswerB

Adequate memory and vCPUs for the workload.

Why this answer

Option C is correct because ml.c5.2xlarge has 8 vCPUs and 16 GB memory, which is sufficient and cost-effective for the workload. Option A is wrong because ml.m5.large has only 2 vCPUs and 8 GB memory, which may not handle the throughput. Option B is wrong because ml.p3.2xlarge is GPU-optimized and overkill.

Option D is wrong because ml.t2.medium is too small for the memory requirement.

Full explanation →

1673

Multi-Selecthard

Which THREE techniques are commonly used to detect multicollinearity in a dataset during exploratory data analysis?

Select 3 answers

A.Heatmap of missing values

B.Eigenvalue analysis from PCA

C.Correlation matrix

D.Variance Inflation Factor (VIF)

E.Scatter matrix of all features

AnswersB, C, D

Near-zero eigenvalues indicate linear dependencies.

Why this answer

Options A, B, and D are correct. A: Correlation matrix shows pairwise correlations; high values indicate collinearity. B: Variance Inflation Factor (VIF) quantifies how much a feature is explained by others.

D: Eigenvalues from PCA can indicate multicollinearity if some are near zero. Option C is incorrect because a scatter matrix shows pairwise relationships but not a multicollinearity measure. Option E is incorrect because a heatmap of missing values is unrelated to multicollinearity.

Full explanation →

1674

MCQhard

A machine learning engineer is training a model using Amazon SageMaker. The training data is stored in S3 and is 10 TB. The engineer wants to use Pipe input mode to stream data from S3. Which algorithm supports Pipe mode?

A.Amazon SageMaker Linear Learner

B.Amazon SageMaker K-Means

C.Amazon SageMaker XGBoost

D.Amazon SageMaker PCA

AnswerC

Supports Pipe input mode.

Why this answer

Option A is correct because Amazon SageMaker XGBoost supports Pipe mode for streaming. Option B is wrong because Linear Learner does not support Pipe mode. Option C is wrong because K-Means does not support Pipe mode.

Option D is wrong because PCA does not support Pipe mode.

Full explanation →

1675

Multi-Selecthard

A company uses Amazon SageMaker to train a deep learning model using TensorFlow. The training job is failing with an 'OutOfMemory' error. The instance type is ml.p3.2xlarge with 16 GB GPU memory. The model has 10 million parameters. Which THREE actions should be taken to resolve the memory issue? (Choose THREE.)

Select 3 answers

A.Reduce the batch size

B.Increase the number of epochs

C.Enable mixed precision training

D.Increase the batch size

E.Use gradient accumulation

AnswersA, C, E

Smaller batch size directly reduces memory usage.

Why this answer

Options B, C, and E are correct. Gradient accumulation simulates larger batch sizes without increasing memory usage. Mixed precision training reduces memory footprint.

Reducing batch size directly decreases memory usage. Option A is wrong because increasing batch size exacerbates memory issue. Option D is wrong because increasing epochs does not affect memory per step.

Full explanation →

1676

MCQeasy

A data scientist needs to run a one-time SQL query on a large dataset in S3 to create a training dataset. The query involves aggregations and joins. Which service is most suitable?

A.AWS Glue ETL job

B.Amazon Athena

C.Amazon EMR with Spark SQL

D.Amazon RDS with data loaded into it

AnswerB

Athena is serverless and allows ad-hoc SQL queries directly on S3 data.

Why this answer

Option C (Amazon Athena) is correct for serverless SQL queries on S3. Option A (Amazon EMR) is overkill for a one-time query. Option B (AWS Glue) is for ETL jobs, not ad-hoc queries.

Option D (Amazon RDS) requires moving data.

Full explanation →

1677

MCQmedium

A data scientist uses SageMaker to train a model. The training job takes 10 hours, but the team needs to reduce costs. Which approach is MOST cost-effective?

A.Enable Managed Spot Training

B.Use SageMaker Automatic Model Tuning

C.Use a larger instance type to finish faster

D.Use SageMaker Distributed Training with more instances

AnswerA

Spot instances offer significant discounts, reducing cost.

Why this answer

Spot instances can reduce costs up to 90%. Managed Spot Training is the most cost-effective. Option C is correct.

Option A increases cost. Option B may reduce time but not necessarily cost. Option D is for hyperparameter tuning.

Full explanation →

1678

MCQeasy

A team is using Amazon SageMaker to train a linear regression model on a dataset with 10 features. After training, they notice the model has high bias. Which action is MOST likely to reduce bias?

A.Increase the regularization parameter lambda

B.Add L2 regularization

C.Use a smaller training dataset

D.Add polynomial features to capture non-linear relationships

AnswerD

Adding features increases model complexity, reducing bias.

Why this answer

Option D is correct because high bias indicates underfitting, which can be reduced by adding more features or increasing model complexity. Option A reduces risk of overfitting, not bias. Option B increases regularization, which increases bias.

Option C reduces data, potentially increasing bias.

Full explanation →

1679

Multi-Selecteasy

A data scientist is working with a dataset that contains geolocation coordinates (latitude and longitude) and timestamps. The scientist wants to visualize the data to check for spatial and temporal patterns. Which TWO AWS services can be used for this visualization?

Select 2 answers

A.Amazon Comprehend

B.Amazon SageMaker Data Wrangler

C.Amazon Rekognition

D.Amazon QuickSight

E.AWS Glue

AnswersB, D

Can create scatter plots with coordinates.

Why this answer

Options B and D are correct. Amazon QuickSight supports geospatial charts and time-series. Amazon SageMaker Data Wrangler can also visualize geospatial data as scatter plots.

Option A is wrong because Amazon Comprehend is for NLP. Option C is wrong because AWS Glue is for ETL. Option E is wrong because Amazon Rekognition is for image/video analysis.

Full explanation →

1680

MCQhard

An e-commerce company uses Amazon DynamoDB as the primary data store for user sessions. They want to run analytics on historical session data using Amazon Athena. What is the recommended approach to export DynamoDB data to S3 in a format optimized for Athena?

A.Use AWS Data Pipeline to copy data to S3 as CSV

B.Use Amazon Kinesis Data Firehose to stream data from DynamoDB to S3

C.Use DynamoDB Streams with AWS Lambda to write to S3 as JSON

D.Use AWS Glue ETL to read from DynamoDB and write to S3 as Parquet

AnswerD

Glue can efficiently export data and convert to columnar format.

Why this answer

AWS Glue ETL can read from DynamoDB and write to S3 in Parquet format, which is optimal for Athena. Option A is wrong because DynamoDB Streams to Lambda to S3 is complex and not optimal. Option B is wrong because Data Pipeline is legacy.

Option D is wrong because Kinesis is for streaming.

Full explanation →

1681

Multi-Selectmedium

A machine learning engineer is training a deep learning model on Amazon SageMaker. The training job is taking a long time. Which THREE actions can reduce training time? (Choose 3.)

Select 3 answers

A.Use SageMaker managed spot training

B.Use SageMaker managed warm pools to reuse the training environment

C.Use SageMaker distributed training (data parallelism)

D.Use a smaller batch size

E.Use SageMaker hyperparameter tuning jobs

AnswersA, B, C

Spot instances can reduce cost and training time if interruptions are tolerated.

Why this answer

Warm pools, distributed training, and spot training can reduce training time.

Full explanation →

1682

MCQmedium

A machine learning engineer is exploring a dataset with 50 features. Some features are highly correlated. Which technique should the engineer use to reduce dimensionality while preserving variance?

A.Principal Component Analysis (PCA)

B.Factor Analysis

C.t-Distributed Stochastic Neighbor Embedding (t-SNE)

D.Linear Discriminant Analysis (LDA)

AnswerA

PCA reduces dimensionality by finding components that maximize variance.

Why this answer

PCA (Principal Component Analysis) is the standard technique for dimensionality reduction by projecting data onto principal components that capture maximum variance. LDA is supervised and aims to separate classes. t-SNE is for visualization. Autoencoders can reduce dimensionality but are more complex.

Factor analysis assumes latent factors.

Full explanation →

1683

MCQhard

A company is using SageMaker Ground Truth to label images for a computer vision model. After launching the labeling job, they notice that the labeling throughput is lower than expected. What should they do to increase throughput?

A.Use a private workforce with more workers.

B.Change the labeling task to use a single annotator per image.

C.Reduce the number of workers assigned to each task.

D.Increase the time allowed for each labeling task.

AnswerA

More workers increase labeling parallelism and throughput.

Why this answer

Option D is correct because using a private workforce with more workers increases parallelism. Option A is incorrect because reducing the number of workers decreases throughput. Option B is incorrect because using a single annotator per task may not speed up labeling.

Option C is incorrect because increasing task time per image slows down throughput.

Full explanation →

1684

Multi-Selecthard

Which THREE are valid reasons to perform feature scaling during exploratory data analysis?

Select 3 answers

A.To improve performance of distance-based algorithms like KNN.

B.To change the shape of the feature distribution.

C.To increase the number of features.

D.To ensure features have zero mean and unit variance.

E.To reduce the effect of outliers by clipping values.

AnswersA, D, E

Distance algorithms are sensitive to scale.

Full explanation →

1685

MCQeasy

A data scientist is analyzing a dataset and notices that the distribution of a continuous feature is heavily right-skewed. Which transformation is most likely to make the distribution more symmetric?

A.Log transformation (natural log)

B.Min-Max scaling

C.One-hot encoding

D.Square transformation

AnswerA

Log transformation compresses high values, reducing right skew.

Why this answer

Option A is correct because log transformation is commonly used to reduce right skewness. Option B is wrong because square transformation increases skewness. Option C is wrong because Min-Max scaling does not change shape.

Option D is wrong because one-hot encoding is for categorical features.

Full explanation →

1686

MCQhard

A company is building a recommendation system using Amazon SageMaker Factorization Machines. The dataset includes user IDs, item IDs, and implicit feedback (clicks). The data is sparse with millions of users and items. The model needs to capture interactions between users and items. Which hyperparameter tuning strategy should be used to improve model performance?

A.Increase L2 regularization to prevent overfitting.

B.Increase the batch size to speed up training.

C.Decrease the learning rate to improve convergence.

D.Change the activation function to ReLU.

E.Increase the number of factors (num_factors) to capture more latent features.

AnswerE

More factors increase model capacity to learn interactions.

Why this answer

Option A is correct because increasing the number of factors allows the model to capture more complex interactions. Option B (learning rate) helps convergence but not specifically interaction complexity. Option C (batch size) affects speed, not capacity.

Option D (regularization) prevents overfitting but does not increase interaction capacity. Option E (activation function) is not relevant for factorization machines (linear model).

Full explanation →

1687

MCQeasy

A data scientist wants to understand the relationship between a categorical feature with 3 levels and a continuous target variable. Which visualization is most appropriate?

A.Correlation matrix

B.Line chart

C.Box plot grouped by category

D.Scatter plot

AnswerC

Box plots compare distributions across categories.

Why this answer

A box plot grouped by category (Option C) is the most appropriate visualization because it directly compares the distribution of a continuous target variable across the three levels of a categorical feature. It displays median, quartiles, and potential outliers for each group, making it ideal for understanding central tendency, spread, and skewness in a side-by-side comparison.

Exam trap

The trap here is that candidates often confuse the purpose of a scatter plot (for two continuous variables) with the need to compare a continuous variable across categories, leading them to choose Option D instead of recognizing that a grouped box plot is the standard tool for this task.

How to eliminate wrong answers

Option A is wrong because a correlation matrix is used to quantify linear relationships between continuous variables, not between a categorical feature and a continuous target. Option B is wrong because a line chart is designed to show trends over a continuous or time-ordered axis, not to compare distributions across discrete categories. Option D is wrong because a scatter plot visualizes the relationship between two continuous variables; it cannot effectively display a categorical feature with only three levels without overplotting or requiring jittering, and it does not summarize distributional properties like median or quartiles.

Full explanation →

1688

MCQmedium

A team is training a large language model using PyTorch on multiple GPUs. The training is taking too long due to inefficient data loading. Which AWS service can help accelerate data loading by caching data close to the GPU instances?

A.Amazon FSx for Lustre

B.Amazon EBS Snapshots for fast restore

C.Amazon S3 Transfer Acceleration

D.Amazon CloudFront

AnswerA

High-performance file system with sub-millisecond latency.

Why this answer

Amazon EBS Snapshots are not for caching. FSx for Lustre provides high-performance file system optimized for ML workloads. Option B is wrong because S3 Transfer Acceleration speeds up uploads, not loading.

Option D is wrong because CloudFront is a CDN for web content.

Full explanation →

1689

MCQeasy

A company wants to use Amazon Rekognition to detect objects in images stored in an S3 bucket. The images are uploaded by users. Which IAM policy statement is necessary to allow Rekognition to read from the bucket?

A.s3:PutObject

B.s3:DeleteObject

C.s3:GetObject

D.s3:ListBucket

AnswerC

GetObject allows Rekognition to read images.

Why this answer

Rekognition needs s3:GetObject permission (Option B) to read images. Option A (s3:PutObject) is for writing. Option C (s3:ListBucket) is for listing.

Option D (s3:DeleteObject) is for deletion.

Full explanation →

1690

MCQhard

A company uses Amazon SageMaker to train machine learning models. The data science team has developed a training script that uses TensorFlow. They want to run the training job on a GPU instance (ml.p3.2xlarge) and store the model artifact in Amazon S3. The training job completes successfully, but the model artifact is not saved to S3. The team has confirmed that the S3 bucket policy allows write access from the SageMaker execution role. The training script uses the TensorFlow estimator with the following configuration: ``` tensorflow_estimator = TensorFlow( entry_point='train.py', role='arn:aws:iam::123456789012:role/SageMakerExecutionRole', instance_count=1, instance_type='ml.p3.2xlarge', output_path='s3://my-bucket/output', framework_version='2.3', py_version='py37', ) ``` The train.py script saves the model using `model.save('/opt/ml/model')`. What is the MOST likely reason the model artifact is not being saved to S3?

A.The training script must save the model to /opt/ml/model/saved_model instead of /opt/ml/model.

B.The SageMaker execution role does not have the s3:PutObject permission for the S3 bucket.

C.The output_path parameter is incorrectly formatted; it should include a trailing slash.

D.The TensorFlow estimator requires the model_dir parameter to be set to the S3 output path.

AnswerB

Correct: The role needs s3:PutObject to write to S3.

Why this answer

The TensorFlow estimator's output_path specifies where the model artifact should be uploaded after training. However, SageMaker automatically uploads the contents of /opt/ml/model to S3 at the end of training. The script is saving to the correct directory.

The issue is likely that the training script is not saving the model correctly or the training fails before saving. But given the job completes successfully, the most common cause is that the SageMaker execution role does not have permission to write to the S3 bucket. The bucket policy allows write access, but the IAM role may lack the necessary S3 permissions.

Option C is correct because the role needs s3:PutObject permission on the bucket. Option A is incorrect because the output_path is correctly specified. Option B is incorrect because the script saves to the right directory.

Option D is incorrect because the estimator does not have a 'model_dir' parameter that overrides the default; the default is /opt/ml/model.

Full explanation →

1691

MCQhard

Refer to the exhibit. A data engineer has attached this IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from an S3 bucket (data-bucket) that is encrypted with SSE-KMS using the key arn:aws:kms:us-east-1:123456789012:key/abc123, transforms the data, and writes the result to a different S3 bucket (output-bucket) encrypted with a different KMS key (arn:aws:kms:us-east-1:123456789012:key/xyz789). When the job runs, it fails with an access denied error. What is the cause?

A.The policy does not include s3:GetObject permission for the output bucket.

B.The policy does not include glue:CreateTable permission.

C.The policy does not include s3:PutObject permission for the output bucket.

D.The policy does not grant kms:Encrypt permission for the output bucket's KMS key.

AnswerD

To write to an SSE-KMS encrypted bucket, the role needs kms:Encrypt or kms:GenerateDataKey for that key.

Why this answer

Option C is correct because the policy grants kms:Decrypt and kms:GenerateDataKey for the input bucket's key, but does not grant the equivalent permissions for the output bucket's key (xyz789). The job needs to encrypt the output, so it needs kms:Encrypt (or GenerateDataKey) for the output key. Option A is wrong because the policy includes s3:PutObject.

Option B is wrong because the Glue catalog permissions are sufficient. Option D is wrong because the policy includes s3:GetObject.

Full explanation →

1692

Multi-Selectmedium

A data scientist is training a deep learning model on SageMaker using a custom container. The training job fails with an 'OutOfMemory' error. Which THREE actions could resolve this issue? (Choose 3.)

Select 3 answers

A.Use gradient accumulation to simulate larger batch sizes.

B.Reduce the number of training epochs.

C.Reduce the batch size.

D.Use an instance type with more memory, such as ml.p3.16xlarge.

E.Increase the learning rate.

AnswersA, C, D

Gradient accumulation allows training with effectively larger batch without memory increase.

Why this answer

OutOfMemory can be resolved by reducing batch size, using gradient accumulation, or using a larger instance. Option A (reducing epochs) does not affect memory usage per batch. Option E (increasing learning rate) might cause instability but not directly memory.

Full explanation →

1693

MCQhard

A team wants to automate the retraining of a model weekly using new data that arrives in S3. Which combination of services should they use?

A.AWS Lambda and S3 events

B.Amazon SageMaker Processing jobs

C.AWS Step Functions and AWS Glue

D.Amazon SageMaker Pipelines and S3 events

AnswerD

SageMaker Pipelines is designed for ML workflows and can be triggered by S3 events.

Why this answer

Option C is correct because SageMaker Pipelines provides a managed workflow for training and retraining. Option A is wrong because Step Functions alone is not ML-specific. Option B is wrong because Lambda is not designed for long-running training.

Option D is wrong because SageMaker Processing is for data processing, not full pipeline automation.

Full explanation →

1694

MCQmedium

A data engineering team needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The data is currently stored in HDFS. Which service should they use for an efficient transfer?

A.AWS DataSync

B.S3 Transfer Acceleration

C.Amazon Kinesis Data Streams

D.AWS Snowball Edge

AnswerA

DataSync can transfer data from HDFS to S3.

Why this answer

AWS DataSync can transfer data from on-premises HDFS to S3 efficiently, with built-in encryption and validation. Option A is wrong because S3 Transfer Acceleration speeds up transfers but requires network. Option C is wrong because Snowball Edge is for offline transfer.

Option D is wrong because Kinesis is for streaming, not bulk historical.

Full explanation →

1695

MCQmedium

A data scientist is troubleshooting access to an S3 bucket. The above IAM policy is attached to their role. What is the likely result when they try to list objects in the 'confidential' folder?

A.Access is allowed because the Allow statement grants s3:ListBucket.

B.Access is denied unconditionally.

C.Access is allowed only if the request uses HTTPS.

D.Access is denied if the request does not originate from the specified VPC endpoint.

AnswerD

The condition requires the request to come from vpce-12345678 to allow access.

Why this answer

Option C is correct because the Deny statement explicitly denies s3:* actions on the confidential folder unless the request comes from a specific VPC endpoint. If the request is from outside that endpoint or from a different VPC, it will be denied. Option A is wrong because the Deny overrides the Allow.

Option B is wrong because the Deny is conditional. Option D is wrong because the condition restricts access.

Full explanation →

1696

Multi-Selectmedium

A data scientist is training a model using Amazon SageMaker and wants to reduce the training time. The training job uses a single GPU instance. Which THREE actions can reduce training time?

Select 3 answers

A.Use distributed training across multiple GPU instances.

B.Use Pipe input mode instead of File input mode.

C.Use a larger instance type with more GPU memory and compute.

D.Increase the amount of training data.

E.Reduce the batch size.

AnswersA, B, C

Distributed training parallelizes the workload.

Why this answer

Options A, B, and D are correct. Using multiple GPUs (distributed training), using Pipe input mode, and using a larger instance type with more compute power can reduce training time. Option C is wrong because using more training data increases training time.

Option E is wrong because reducing batch size can actually increase training time due to more iterations.

Full explanation →

1697

MCQmedium

Refer to the exhibit. A company is using an IAM role with the attached policy to deploy a SageMaker model. The data scientist can create training jobs and models, but when trying to create an endpoint, they receive an access denied error. What is the missing permission?

A.cloudwatch:PutMetricData

B.iam:PassRole

C.ec2:CreateNetworkInterface

D.sagemaker:InvokeEndpoint

E.kms:Decrypt

AnswerC

SageMaker creates an ENI in the VPC for the endpoint.

Why this answer

Option E is correct because to create an endpoint, SageMaker needs permission to call ec2:CreateNetworkInterface to set up the elastic network interface in the customer's VPC. Option A (sagemaker:InvokeEndpoint) is for invoking, not creating. Option B (iam:PassRole) is needed to pass the execution role to SageMaker, but the error is about creating endpoint, not passing role.

Option C (kms:Decrypt) is for encrypted data. Option D (cloudwatch:PutMetricData) is for publishing metrics.

Full explanation →

1698

MCQhard

Refer to the exhibit. A custom training job using Pipe input mode fails. The logs indicate the algorithm cannot read the data. What is the most likely issue?

A.The algorithm expects File mode but Pipe mode is specified

B.The instance type is too small for the data

C.The training data is compressed

D.The training image is not accessible

AnswerA

Pipe mode sends data via pipe; algorithms expecting files will fail.

Why this answer

Pipe mode streams data from S3, but the algorithm must be designed to read from a pipe (stdin) rather than a file. Many custom algorithms expect files. Option A is wrong because the image path is correct.

Option B is wrong because instance type is not the cause. Option C is wrong because compression is none.

Full explanation →

1699

MCQhard

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

A.Apply PCA to all features to decorrelate them.

B.Standardize all features using StandardScaler.

C.For each highly correlated pair, remove one feature based on domain knowledge or higher correlation with target.

D.Randomly drop half of the correlated features.

AnswerC

This reduces redundancy while retaining predictive power.

Why this answer

Option C is correct because when features are highly correlated (e.g., > 0.95), they introduce multicollinearity, which can destabilize coefficient estimates in linear models and reduce interpretability. Removing one feature from each correlated pair based on domain knowledge or its correlation with the target variable preserves predictive power while reducing redundancy. This approach is more targeted than PCA, which transforms features into uncorrelated components but sacrifices interpretability and may not align with the binary target.

Exam trap

Cisco often tests the misconception that PCA is the default solution for multicollinearity, but the trap here is that PCA transforms features into uninterpretable components, whereas removing correlated features directly preserves the original feature space and domain relevance.

How to eliminate wrong answers

Option A is wrong because PCA decorrelates features by projecting them onto orthogonal components, but it does not remove features—it creates new synthetic features that are linear combinations of the originals, losing interpretability and potentially discarding target-specific information. Option B is wrong because standardizing features (e.g., using StandardScaler) only scales them to zero mean and unit variance, which does not address multicollinearity; it is a preprocessing step for algorithms sensitive to feature scales, not a remedy for correlated features. Option D is wrong because randomly dropping half of the correlated features ignores the relationship between features and the target variable, which can discard informative predictors and degrade model performance; a principled selection based on domain knowledge or target correlation is required.

Full explanation →

1700

MCQmedium

A company wants to use Amazon SageMaker to train a model using a custom algorithm packaged in a Docker container. Which approach should they use?

A.Use SageMaker Ground Truth

B.Use SageMaker Autopilot

C.Use the SageMaker SDK to create an Estimator with the image URI of the custom container

D.Select one of the built-in algorithms in SageMaker

AnswerC

The Estimator can accept a custom Docker image for training.

Why this answer

SageMaker supports bring-your-own-container for custom algorithms. Option B is wrong because SageMaker built-in algorithms are predefined. Option C is wrong because SageMaker Autopilot automates model selection, not custom containers.

Option D is wrong because SageMaker Ground Truth is for labeling.

Full explanation →

1701

MCQmedium

A company is deploying a fraud detection model using Amazon SageMaker. The model is a linear learner trained on 100 GB of data. For inference, the model receives individual transactions and must return a prediction within 100 ms. Which endpoint configuration should the team use to meet the latency requirement?

A.Use a multi-model endpoint with CPU instances.

B.Deploy a single model endpoint using a GPU instance and enable autoscaling.

C.Use a batch transform job scheduled every minute.

D.Deploy using SageMaker Serverless Inference.

AnswerB

GPU instance can process individual transactions fast, autoscaling handles traffic.

Why this answer

Option B is correct because a single-model endpoint on a GPU instance provides the low-latency, high-throughput inference required for real-time fraud detection. GPU instances accelerate linear learner inference by parallelizing matrix operations, enabling sub-100 ms predictions for individual transactions. Autoscaling ensures the endpoint can handle traffic spikes without degrading latency.

Exam trap

The trap here is that candidates often choose multi-model endpoints (Option A) thinking they reduce cost, but they overlook the cold-start latency penalty for large models, which violates the strict 100 ms requirement.

How to eliminate wrong answers

Option A is wrong because multi-model endpoints share a single container and load models on demand, which adds cold-start latency that can exceed 100 ms for individual transactions, especially with a 100 GB model. Option C is wrong because batch transform jobs are designed for offline, asynchronous processing of large datasets, not real-time inference with a 100 ms latency requirement. Option D is wrong because SageMaker Serverless Inference has a maximum concurrency limit and cold-start latency that can exceed 100 ms, making it unsuitable for sub-100 ms real-time predictions.

Full explanation →

1702

Multi-Selectmedium

A data science team is deploying a machine learning model using Amazon SageMaker. The model requires GPU inference and must handle variable traffic with low latency. Which TWO options should the team implement to meet these requirements? (Choose TWO.)

Select 2 answers

A.Use a SageMaker multi-model endpoint with a GPU instance to serve multiple models.

B.Deploy to a SageMaker real-time endpoint using a CPU instance and attach an Elastic Inference accelerator.

C.Use AWS Lambda with an attached GPU function for inference.

D.Host the model on a SageMaker batch transform job with GPU instances.

E.Deploy the model to a SageMaker real-time endpoint using a GPU instance type.

AnswersA, E

Correct: Multi-model endpoint on GPU provides GPU inference and efficient resource utilization for variable traffic.

Why this answer

Option A (SageMaker real-time endpoint with GPU instance) ensures GPU inference and low latency for real-time predictions. Option E (SageMaker multi-model endpoint) reduces cost by sharing a GPU instance across multiple models, which is efficient for variable traffic. The other options either do not support GPU (B, D) or are not suitable for real-time low-latency inference (C).

Full explanation →

1703

MCQeasy

A healthcare company needs to predict patient readmission risk using clinical notes. Which AWS service can be used to preprocess the text data into numerical features for a machine learning model?

A.Amazon SageMaker Ground Truth

B.Amazon Comprehend

C.Amazon Translate

D.Amazon Rekognition

AnswerB

Comprehend provides NLP capabilities for text feature extraction.

Why this answer

Amazon Comprehend is a natural language processing (NLP) service that can extract entities, key phrases, and sentiment. It is suitable for preprocessing clinical notes into features. SageMaker Ground Truth is for data labeling.

Rekognition is for images. Translate is for translation.

Full explanation →

1704

Multi-Selectmedium

Which TWO of the following are valid approaches to handle missing values in a dataset for a machine learning model?

Select 2 answers

A.Use a neural network to predict missing values

B.Impute missing values with the mean of the column

C.Remove rows with missing values

D.Standardize the features to handle missing values

E.Apply one-hot encoding to convert missing values

AnswersB, C

Mean imputation is a standard technique for numerical features.

Why this answer

Removing rows with missing values is a valid approach (listwise deletion). Imputing with the mean is also valid. Using a neural network to predict missing values is possible but not standard.

Standardization does not handle missing values. One-hot encoding is for categorical variables.

Full explanation →

1705

Drag & Dropmedium

Drag and drop the steps to perform hyperparameter tuning using SageMaker Automatic Model Tuning in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Tuning involves defining search space, creating a tuning job, setting limits, executing, and selecting best model.

Full explanation →

1706

MCQhard

Refer to the exhibit. A team deploys this CloudFormation stack. The Kinesis stream is created, but the Firehose delivery stream fails to create with a 'Resource handler returned message: Unable to assume role' error. What is the most likely cause?

A.The shard count of 2 is too low for Firehose to read from.

B.The Kinesis stream is encrypted with KMS, but Firehose does not have permission to decrypt.

C.The retention period of 168 hours is too long for Firehose.

D.The IAM role 'firehose-role' does not exist in the AWS account.

AnswerD

Correct: The role ARN is hardcoded; if the role is missing, Firehose cannot assume it.

Why this answer

The Firehose role 'firehose-role' is specified with a static ARN that includes an account ID. If the stack is deployed in a different account, or if the role does not exist, the assumption fails. Option A (role does not exist in the account) is the most likely.

Option B (stream encryption) is not related. Option C (retention period) is fine. Option D (shard count) is fine.

Full explanation →

1707

MCQmedium

A data scientist is using Amazon SageMaker to train a model on a large dataset (10 TB) stored in S3 in Parquet format. The training job uses an ml.p3.16xlarge instance with multiple GPUs. The data scientist notices that the GPU utilization is low (around 30%) and the training is slow. The dataset consists of hundreds of thousands of small Parquet files. The data scientist suspects that the I/O is bottlenecked. What should the data scientist do to improve GPU utilization and training speed?

A.Increase the batch size

B.Consolidate the small Parquet files into larger files (e.g., 1 GB each)

C.Use a smaller instance type to reduce cost

D.Use Pipe input mode to stream data directly

AnswerB

Larger files reduce I/O overhead.

Why this answer

Option A is correct because consolidating small files into larger files reduces the overhead of reading many files from S3, improving I/O throughput and keeping GPUs busy. Option B (use Pipe mode) may help but does not address the file size issue. Option C (increase batch size) may improve utilization but the I/O bottleneck remains.

Option D (use a smaller instance) would not improve speed.

Full explanation →

1708

MCQeasy

A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?

A.Use Amazon EMR with Spark to read JSON, convert to Parquet, and partition, then query with Athena.

B.Use AWS Glue ETL jobs to read JSON from S3, transform to Parquet, and write to a partitioned S3 location, then use Athena.

C.Use S3 Event Notifications to trigger an AWS Lambda function that converts the JSON to Parquet and writes to a partitioned S3 location, then query with Athena.

D.Use Amazon Kinesis Firehose to ingest data and convert to Parquet, then write to S3, and query with Athena.

AnswerC

Lambda is serverless, cost-effective for per-file processing, and can partition output easily.

Why this answer

Option C is correct because AWS Lambda triggered by S3 Event Notifications provides a fully serverless, event-driven architecture with minimal operational overhead for converting JSON to Parquet and partitioning by date and event type. Lambda can process each new JSON file as it arrives, perform the transformation in memory (using libraries like PyArrow or Pandas), and write the Parquet output to a partitioned S3 path, which Athena can then query directly. This approach avoids managing any clusters or job scheduling, aligning with the requirement for a fully managed, serverless solution.

Exam trap

Cisco often tests the misconception that AWS Glue is the only serverless ETL option, but the trap here is that Lambda with S3 Event Notifications is a simpler, fully serverless alternative for file-based transformations when the workload fits within Lambda's constraints.

How to eliminate wrong answers

Option A is wrong because Amazon EMR with Spark requires provisioning and managing a cluster (even if ephemeral), incurring operational overhead and not being fully serverless; it also introduces complexity for a simple transformation task. Option B is wrong because AWS Glue ETL jobs, while serverless, involve job scheduling, startup latency, and cost for each job run, and are overkill for a real-time, event-driven pipeline where Lambda can handle the transformation more efficiently with lower latency and cost. Option D is wrong because Amazon Kinesis Firehose is designed for streaming data ingestion, not for batch processing of existing JSON files in S3; it cannot be triggered by S3 events to process files already stored, and its Parquet conversion is limited to the Firehose delivery stream, not arbitrary file transformations.

Full explanation →

1709

MCQmedium

A machine learning engineer is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket encrypted with AWS KMS. The SageMaker training job fails with an AccessDenied error when trying to read the data. Which IAM policy addition should resolve the issue?

A.Add kms:Decrypt permission for the KMS key.

B.Add s3:GetObject permission for the bucket.

C.Add kms:GenerateDataKey permission for the key.

D.Attach the AmazonSageMakerFullAccess policy.

AnswerA

Decrypt is required to read encrypted objects.

Why this answer

The correct answer is A because when an S3 bucket is encrypted with AWS KMS, the SageMaker training job's execution role must have the `kms:Decrypt` permission for the specific KMS key to read the encrypted objects. Without this permission, the job fails with an AccessDenied error, even if `s3:GetObject` is granted, because SageMaker must decrypt the data before reading it.

Exam trap

The trap here is that candidates often assume `s3:GetObject` is sufficient for reading encrypted objects, overlooking that KMS-encrypted S3 data requires explicit `kms:Decrypt` permissions on the execution role.

How to eliminate wrong answers

Option B is wrong because `s3:GetObject` alone is insufficient; the error occurs specifically due to KMS encryption, so the missing permission is for KMS decryption, not S3 read access. Option C is wrong because `kms:GenerateDataKey` is used for creating new data keys for encryption, not for decrypting existing objects; the required permission for reading encrypted data is `kms:Decrypt`. Option D is wrong because attaching the `AmazonSageMakerFullAccess` managed policy does not automatically grant permissions for customer-managed KMS keys; it only provides basic SageMaker permissions, and explicit KMS key permissions must be added to the role.

Full explanation →

1710

MCQeasy

A data scientist is performing EDA on a dataset with both numerical and categorical features. Which technique is best for detecting multicollinearity among numerical features?

A.Chi-square test of independence

B.Box plots for each numerical feature

C.Correlation matrix with heatmap

D.Pair plot

AnswerC

Correlation matrix shows pairwise linear correlations, indicating multicollinearity.

Why this answer

Option B is correct because a correlation matrix quantifies linear relationships between numerical features. Option A is wrong because box plots show distribution, not relationships. Option C is wrong because chi-square test is for categorical associations.

Option D is wrong because pair plots visualize scatter plots but not a quantitative measure of multicollinearity.

Full explanation →

1711

Multi-Selectmedium

A data engineering team is designing a data pipeline that processes streaming data from Amazon Kinesis Data Streams using AWS Lambda. The team notices that some records are being processed multiple times (duplicates). Which TWO steps should the team take to ensure exactly-once processing?

Select 2 answers

A.Design the Lambda function to be idempotent.

B.Use a unique record identifier and store processed IDs in an external store like DynamoDB.

C.Increase the batch size to reduce the number of invocations.

D.Use Kinesis Producer Library (KPL) to guarantee exactly-once delivery.

E.Disable retries on the Lambda function.

AnswersA, B

Idempotency ensures repeated processing produces same result.

Why this answer

Options A and C are correct. Making the Lambda function idempotent ensures that processing the same record multiple times does not cause duplicates downstream. Using a unique identifier per record and checking against a DynamoDB table allows deduplication.

Option B is wrong because increasing batch size increases duplicates. Option D is wrong because disabling retries may cause data loss. Option E is wrong because Kinesis does not guarantee exactly-once delivery; deduplication is needed.

Full explanation →

1712

MCQmedium

A company is building a data lake on Amazon S3 and wants to use AWS Glue to catalog the data. The data includes CSV, Parquet, and JSON files. The team wants to ensure that the Glue crawler can infer the schema correctly and update the Data Catalog when new partitions are added. Which crawler configuration should be used?

A.Create separate crawlers for each file format and schedule them at different times.

B.Use a crawler that only catalogs Parquet files because they are more efficient.

C.Use a crawler with 'Update all new and existing partitions' disabled to avoid schema conflicts.

D.Create a single crawler that includes all file extensions and set the 'Update all new and existing partitions' option.

AnswerD

Correct: Single crawler with partition updates ensures comprehensive cataloging.

Why this answer

Option C (create a single crawler that includes all supported file types and set 'Update all new and existing partitions' to true) is correct because Glue can handle multiple formats, and updating partitions ensures new data is cataloged. Option A (separate crawlers per format) is unnecessary. Option B (only Parquet) ignores other formats.

Option D (disable partition updates) would miss new partitions.

Full explanation →

1713

MCQeasy

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by a custom application that runs on Amazon EC2 instances. The company notices that the consumer application is falling behind the producer, causing data to be throttled. Which action should the company take to improve the consumer's throughput?

A.Reduce the data retention period of the stream

B.Increase the number of shards in the Kinesis data stream

C.Increase the maximum concurrency of the AWS Lambda function that processes the stream

D.Use Amazon Kinesis Data Firehose to deliver data to Amazon S3

AnswerB

More shards increase the stream's read and write capacity.

Why this answer

Option A is correct because increasing the number of shards increases the stream's capacity and allows more consumers to read in parallel. Option B is wrong because Lambda concurrency is not directly related to Kinesis shard throughput. Option C is wrong because Kinesis Data Firehose is a different service.

Option D is wrong because reducing the retention period does not increase throughput.

Full explanation →

1714

Multi-Selecthard

A data scientist is analyzing a dataset with a continuous target variable and suspects that the relationship between a predictor and the target is non-linear. Which THREE techniques can the scientist use to explore and model this non-linearity?

Select 3 answers

A.Apply logistic regression to binarize the target.

B.Compute the Pearson correlation coefficient between the predictor and target.

C.Add polynomial features (e.g., x^2, x^3) and check if model performance improves.

D.Fit a decision tree regressor and examine feature importance.

E.Create a scatter plot and overlay a LOESS (local regression) smooth curve.

AnswersC, D, E

Polynomial features capture non-linearity in linear models.

Why this answer

Options A, B, and C are correct. Scatter plots with a LOESS curve visually reveal non-linearity. Polynomial features allow linear models to capture non-linear relationships.

Decision trees can model non-linear interactions without explicit feature engineering. Option D is wrong because correlation measures only linear relationships. Option E is wrong because logistic regression is for binary outcomes.

Full explanation →

1715

MCQeasy

A machine learning team is analyzing a dataset with numerical features. They compute the pairwise correlation matrix and find that two features, 'X1' and 'X2', have a correlation coefficient of 0.98. The team plans to train a linear regression model. Which of the following actions should the team take to avoid multicollinearity issues?

A.Perform PCA on the dataset to reduce dimensionality.

B.Add an interaction term between X1 and X2 to the model.

C.Standardize both features using Z-score normalization.

D.Remove one of the two highly correlated features.

AnswerD

This directly addresses multicollinearity by eliminating redundancy.

Why this answer

Option C is correct because removing one of the highly correlated features reduces multicollinearity. Option A is wrong because PCA is not necessary for just two correlated features. Option B is wrong because standard scaling does not address correlation.

Option D is wrong because adding interaction terms increases multicollinearity.

Full explanation →

1716

MCQmedium

A data scientist is training a binary classifier on an imbalanced dataset where the positive class represents 1% of the data. The model currently achieves 99% accuracy but a recall of only 10% on the positive class. Which metric combination should the data scientist prioritize to evaluate model improvements?

A.F1 score and AUC-ROC

B.Precision and recall at 90% precision

C.Accuracy and RMSE

D.Precision and RMSE

AnswerA

F1 score balances precision and recall; AUC-ROC is robust to imbalance.

Why this answer

With a highly imbalanced dataset (1% positive class), 99% accuracy is misleading because the model can achieve it by simply predicting the majority class. The low recall (10%) indicates the model fails to identify most positive instances. The F1 score balances precision and recall, providing a single metric for minority class performance, while AUC-ROC evaluates the model's ability to distinguish between classes across all thresholds, making it robust to class imbalance.

Exam trap

The trap here is that candidates see 99% accuracy and assume the model is good, failing to recognize that accuracy is a poor metric for imbalanced datasets, and that metrics like RMSE are for regression, not classification.

How to eliminate wrong answers

Option B is wrong because 'precision and recall at 90% precision' is not a standard metric combination; it fixes precision arbitrarily, which may not be achievable or relevant for evaluating overall model improvements, and it ignores the trade-off with recall. Option C is wrong because accuracy is misleading on imbalanced data (as shown) and RMSE is a regression metric, not suitable for binary classification evaluation. Option D is wrong because RMSE is inappropriate for classification tasks; it measures continuous error, not classification performance, and precision alone does not capture recall or threshold behavior.

Full explanation →

1717

MCQmedium

A data scientist has this IAM policy attached to their role. When trying to create a SageMaker endpoint using the AWS CLI, they get an 'AccessDenied' error. What is the most likely reason?

A.The role does not have permission to use KMS to decrypt the model artifacts

B.The policy does not include 'sagemaker:InvokeEndpoint' action

C.The role does not have permission to access the S3 bucket containing the model artifacts

D.The policy uses a wildcard '*' for resources instead of specific ARNs

AnswerB

Creating an endpoint may require additional actions like DescribeEndpoint; InvokeEndpoint is needed for real-time inference.

Why this answer

Option B is correct because the policy only allows creating the endpoint (CreateEndpoint), but not invoking it (InvokeEndpoint). The error may be due to missing permissions for other actions like sagemaker:DescribeEndpoint. Option A (S3) is not shown.

Option C (KMS) not shown. Option D (missing resources) is possible but policy uses *.

Full explanation →

1718

MCQeasy

A data scientist is using AWS Glue to prepare training data. The job reads from an S3 bucket, performs transformations, and writes to another S3 bucket. The job is failing due to insufficient memory. Which solution should the data scientist use to fix this?

A.Use AWS Glue's job bookmark feature.

B.Increase the number of DPU (Data Processing Units) for the job.

C.Use Amazon Athena instead of AWS Glue.

D.Use a columnar file format like Parquet.

AnswerB

More workers provide more memory.

Why this answer

Option C is correct because increasing the number of workers adds more memory. Option A is wrong because changing the file format does not increase memory. Option B is wrong because it's for ETL performance.

Option D is wrong because using Athena is not a job.

Full explanation →

1719

MCQmedium

Refer to the exhibit. A data scientist creates a SageMaker model using the configuration above. When deploying the model to an endpoint, the endpoint status remains 'Creating' for a long time and then fails. What is the most likely cause?

A.The S3 model artifact does not exist

B.The environment variable SAGEMAKER_REGION is incorrect

C.The model name is already in use

D.The IAM role lacks permission to pull the Docker image from ECR

AnswerD

The image is in a different account; the role needs ecr:GetDownloadUrlForLayer and BatchGetImage permissions.

Why this answer

The image URI points to an ECR repository in account 382416733822, which is not the customer's account. SageMaker expects the image to be in the same account or accessible via cross-account permissions. This URI is likely the AWS account for built-in algorithms, but if the region or repository is incorrect, it may fail.

The most likely issue is that the image does not exist in that account or the role lacks permissions to pull it.

Full explanation →

1720

MCQeasy

A data scientist needs to implement a recommendation system for an e-commerce website. Which Amazon service is specifically designed for building and deploying recommendation models?

A.Amazon SageMaker

B.Amazon Rekognition

C.Amazon Forecast

D.Amazon Personalize

AnswerD

Personalize is specifically for building and deploying recommendation models.

Why this answer

Amazon Personalize is a fully managed machine learning service that provides real-time personalized recommendations. It is purpose-built for recommendation systems. SageMaker is a general-purpose ML platform, but Personalize is specialized for recommendations.

Full explanation →

1721

MCQhard

A company runs an e-commerce platform that generates clickstream data in real-time. The data is ingested into Amazon Kinesis Data Streams (100 shards) and processed by AWS Lambda functions, which aggregate data in 1-minute windows and write the results to Amazon S3. The Lambda functions are triggered by the Kinesis stream using the event source mapping. Recently, the company noticed that some records are being processed multiple times, leading to duplicate data in S3. The Lambda function is idempotent, but the duplicates are causing downstream issues. The Lambda function's concurrency limit is 1000, and the batch size is 100. The average processing time per record is 200 ms. What is the most likely cause of the duplicates, and how should it be fixed?

A.Increase the Lambda concurrency limit to 2000 to handle the load.

B.Ensure the Lambda function is idempotent and uses the sequence number to deduplicate records.

C.Decrease the batch size to 10 to reduce the impact of failures.

D.Use Amazon SQS FIFO queue as a buffer between Kinesis and Lambda to guarantee exactly-once processing.

AnswerB

If the function fails and retries, using sequence numbers allows it to skip already processed records, preventing duplicates.

Why this answer

Option B is correct. Lambda functions process records from Kinesis in batches. If the function fails (e.g., due to timeout or error), the entire batch is retried, causing duplicates if some records were already partially processed.

To avoid duplicates, the function should be idempotent and should not commit partial results. Option A is wrong because the concurrency is sufficient. Option C is wrong because increasing batch size increases the risk of partial failure.

Option D is wrong because a FIFO queue does not integrate with Kinesis.

Full explanation →

1722

MCQmedium

A company is building a classification model and discovers that the target variable is imbalanced: 95% of samples belong to class A and 5% to class B. The data scientist needs to understand the distribution of numeric features for each class. Which approach is most appropriate?

A.Run a t-test for each feature to determine statistical significance between classes.

B.Generate box plots for each feature using Amazon QuickSight.

C.Use Amazon SageMaker Data Wrangler to create histograms for each feature, grouped by class label.

D.Compute the correlation matrix between features and the target.

AnswerC

Histograms grouped by class provide a clear view of feature distributions across classes.

Why this answer

Using Amazon SageMaker Data Wrangler to generate histograms segmented by class is a straightforward way to visualize feature distributions for each class. Option B (t-tests) may be used later but doesn't provide distribution visualization. Option C (box plots) is a good alternative but not as comprehensive as histograms for distribution shape.

Option D (correlation matrix) does not show class-wise distribution.

Full explanation →

1723

MCQhard

An organization is migrating its on-premises Hadoop cluster to AWS. The cluster runs Spark jobs that process 50 TB of data daily. The data is stored in HDFS with 3x replication. Which storage option on AWS provides the best price-performance for this workload?

A.Use AWS Glue to run Spark jobs with data stored in S3

B.Use Amazon EMR with S3 as the data store via EMRFS

C.Use Amazon Redshift Spectrum to query the data directly in S3

D.Use Amazon EMR with HDFS on EBS volumes

AnswerB

S3 provides 11 9's durability and is cheaper than EBS. EMRFS seamlessly integrates with Spark.

Why this answer

Amazon EMR with S3 as storage (using EMRFS) allows separating compute and storage. S3 is durable and cost-effective, avoiding 3x replication overhead. HDFS on EBS would require similar replication and is more expensive.

S3 with EMR is the standard recommendation.

Full explanation →

1724

MCQmedium

A data scientist is using Amazon SageMaker to train a linear regression model. The training data has 10 features and 100,000 observations. The model's training loss is decreasing, but the validation loss starts increasing after a few epochs. Which step should the data scientist take first to address this issue?

A.Add more features to the model

B.Reduce the learning rate

C.Increase the batch size

D.Increase the number of epochs

AnswerB

Reducing the learning rate can help the model converge more stably and reduce overfitting.

Why this answer

The increasing validation loss while training loss decreases is a classic sign of overfitting. Reducing the learning rate (Option B) is the first step to stabilize training by allowing the optimizer to take smaller, more controlled steps, which can help the model converge to a better local minimum and reduce validation loss. In SageMaker, this is typically adjusted via the `learning_rate` hyperparameter in the estimator.

Exam trap

The trap here is that candidates often confuse overfitting with underfitting and incorrectly choose to add more features or increase epochs, not realizing that the validation loss increase is a direct sign of overfitting that requires reducing model capacity or learning rate.

How to eliminate wrong answers

Option A is wrong because adding more features increases model complexity, which typically worsens overfitting by giving the model more capacity to memorize noise. Option C is wrong because increasing batch size provides a more accurate gradient estimate but does not directly address overfitting; it may even lead to sharper minima and worse generalization. Option D is wrong because increasing the number of epochs gives the model more iterations to overfit, which will further increase validation loss.

Full explanation →

1725

MCQmedium

A data scientist has this IAM policy attached to their IAM role. They are trying to run a SageMaker training job that reads data from 'my-bucket' and writes output to 'my-bucket'. The job fails. What is the most likely reason?

A.The sagemaker:CreateTrainingJob action is not allowed on specific resources

B.Missing s3:ListBucket permission on the bucket

C.Missing iam:PassRole permission

D.The training job requires permissions to write to CloudWatch Logs

AnswerC

SageMaker needs permission to pass the execution role to the training job.

Why this answer

Option B is correct: the policy does not grant permission to pass the execution role to SageMaker (iam:PassRole). Option A is incorrect because s3:GetObject and s3:PutObject are present. Option C is incorrect because the actions are allowed.

Option D is irrelevant.

Full explanation →

Page 23 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →