Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 751–825

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 11 of 24

751

MCQmedium

A team is exploring a dataset with missing values in multiple columns. They want to decide whether to drop rows or impute values. Which approach is most appropriate for exploratory data analysis?

A.Impute missing values with the mean of each column

B.Analyze the missing data pattern using visualizations and summary statistics

C.Drop all rows with missing values to ensure data quality

D.Use Amazon SageMaker Data Wrangler to automatically impute missing values

AnswerB

Understanding the missing data pattern is crucial before deciding on imputation or deletion.

Why this answer

Option A is correct because during EDA, it is important to first understand the pattern and extent of missing data before deciding on treatment. Option B is wrong because dropping rows without analysis may discard valuable data. Option C is wrong because imputing without understanding the missing mechanism may introduce bias.

Option D is wrong because EDA does not require using a specific AWS service.

Full explanation →

752

MCQhard

An ML team is deploying a model to a SageMaker endpoint for real-time inference. The model is large (2 GB) and requires GPU for low-latency inference. The team wants to minimize cost while maintaining a response time of under 200 ms. Which instance configuration and SageMaker feature would be best?

A.Use a GPU instance (ml.p3.2xlarge) with SageMaker Elastic Inference.

B.Use a batch transform job on GPU instances.

C.Use a serverless inference endpoint with a CPU instance.

D.Use a multi-model endpoint on a GPU instance.

AnswerA

Elastic Inference provides GPU acceleration at lower cost than a full GPU instance.

Why this answer

Option B is correct because using a GPU instance (ml.p3.2xlarge) with SageMaker's Elastic Inference attaches a fraction of GPU acceleration to a CPU instance, balancing cost and performance. Option A is wrong because serverless inference may not support GPU and has cold starts. Option C is wrong because multi-model endpoints are for hosting multiple models on the same instance, not primarily for latency.

Option D is wrong because batch transforms are for offline inference.

Full explanation →

753

MCQhard

Refer to the exhibit. A data scientist ran an S3 Select query on a large CSV file stored in Amazon S3. The output shows only 2 records returned, but the data scientist expected thousands. The file size is 10 GB. What is the MOST likely reason for the small result set?

A.The file needs to be indexed by S3 Select before querying.

B.The city column may have leading/trailing spaces or case differences.

C.The CSV file contains nested arrays that S3 Select cannot parse.

D.S3 Select does not support the WHERE clause on CSV files.

AnswerB

String comparison is exact; variations cause mismatches, reducing results.

Why this answer

S3 Select performs exact string matching by default, so if the WHERE clause filters on the city column, any leading/trailing spaces or case differences will cause mismatches, returning far fewer rows than expected. The query likely used a literal like 'New York' while the data contains ' New York ' or 'new york', resulting in only 2 matches instead of thousands.

Exam trap

Cisco often tests the nuance that S3 Select does not automatically trim or normalize string data, so candidates mistakenly assume the query engine handles such common data quality issues.

How to eliminate wrong answers

Option A is wrong because S3 Select does not require indexing; it scans the entire file and applies the query on the fly. Option C is wrong because S3 Select can parse CSV files with nested arrays as long as the CSV is well-formed (e.g., quoted fields), and nested arrays are not inherently unsupported. Option D is wrong because S3 Select fully supports the WHERE clause on CSV files, including standard SQL predicates.

Full explanation →

754

Multi-Selectmedium

A data scientist is training a neural network for image classification. The training loss decreases but validation loss increases after a few epochs. Which TWO actions should be taken to address this?

Select 2 answers

A.Implement early stopping based on validation loss.

B.Increase the learning rate.

C.Increase the dropout rate.

D.Increase the number of epochs.

E.Add more convolutional layers.

AnswersA, C

Early stopping prevents overfitting by halting training when validation loss stops improving.

Why this answer

Option A is correct because early stopping monitors validation loss and halts training when it stops improving, preventing overfitting. This directly addresses the symptom of decreasing training loss with increasing validation loss, which is a classic sign of overfitting.

Exam trap

AWS often tests the distinction between underfitting and overfitting solutions, where candidates mistakenly choose capacity-increasing options (like more layers or epochs) when the problem is overfitting, not underfitting.

Full explanation →

755

MCQhard

A machine learning team is using Amazon SageMaker to train a large language model. The training script uses PyTorch and the model requires significant memory. The team wants to use model parallelism across multiple GPUs. Which SageMaker feature should they use?

A.SageMaker Distributed Training

B.SageMaker model parallelism library

C.SageMaker Horovod

D.SageMaker Debugger

AnswerB

SMP is specifically designed for model parallelism.

Why this answer

SageMaker's model parallelism library (SMP) is designed for distributed training of large models across GPUs. Horovod is for data parallelism, not model parallelism. SageMaker Debugger is for monitoring training.

Distributed Training is a generic term; the specific library is SMP.

Full explanation →

756

MCQhard

A data scientist submits a SageMaker training job with the provided configuration. The job fails immediately with the error 'Algorithm not found: 382416733822.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.2-1'. What is the most likely cause?

A.The training region is different from the image region.

B.The ECR repository URI is incorrect or the image does not exist.

C.The input data format is incorrect.

D.The IAM role does not have permission to pull the image.

AnswerB

The URI may have wrong account ID or tag.

Why this answer

Option A is correct because the image URI may be wrong. Option B is wrong because region is specified. Option C is wrong because role is specified.

Option D is wrong because format is correct.

Full explanation →

757

MCQmedium

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to an Amazon S3 bucket. The data is then queried using Amazon Athena. The marketing team wants to run daily reports that aggregate click events by product ID. However, the reports are slow because Athena scans the entire dataset each time. The data is partitioned by date (e.g., s3://bucket/clickstream/2023/01/01/). The product ID is a column within the data. The data engineering team wants to improve query performance without moving the data to another service. Which approach should the team take?

A.Convert the data from JSON to Parquet format

B.Use Amazon Redshift Spectrum to query the data

C.Create a view in Athena that filters by product ID

D.Repartition the data by product ID in addition to date

AnswerD

Partitioning by product ID allows Athena to skip irrelevant partitions.

Why this answer

Partitioning by product ID would allow Athena to prune partitions for queries filtering by product ID. Option A is wrong because converting to Parquet alone may improve but does not eliminate full scan. Option C is wrong because creating a view does not change physical storage.

Option D is wrong because Redshift Spectrum still requires scanning.

Full explanation →

758

MCQeasy

A data scientist is training a neural network for image classification. The training loss is decreasing steadily, but the validation loss starts increasing after a few epochs. What is the MOST likely cause?

A.The learning rate is too high

B.The gradients are vanishing

C.The model is underfitting

D.The model is overfitting to the training data

AnswerD

Overfitting causes validation loss to increase.

Why this answer

The correct answer is D because the validation loss increasing while the training loss continues to decrease is the classic signature of overfitting. The model is memorizing the training data (including noise) rather than learning generalizable patterns, causing it to perform poorly on unseen validation data.

Exam trap

AWS often tests the distinction between overfitting and underfitting by describing a scenario where training loss decreases but validation loss increases, and the trap is that candidates may mistakenly attribute this to a high learning rate or vanishing gradients instead of recognizing it as the hallmark of overfitting.

How to eliminate wrong answers

Option A is wrong because a learning rate that is too high typically causes the training loss to oscillate or diverge, not steadily decrease while validation loss increases. Option B is wrong because vanishing gradients prevent the model from learning at all, resulting in stagnant training loss, not a decreasing training loss with increasing validation loss. Option C is wrong because underfitting means the model fails to capture patterns in the training data, leading to high training loss that does not decrease adequately, which contradicts the described steady decrease in training loss.

Full explanation →

759

MCQmedium

Refer to the exhibit. An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job that reads training data from s3://my-bucket/confidential/data.csv. What will happen?

A.The training job will succeed because there is an Allow on my-bucket/*

B.The training job will succeed because the Deny statement is invalid

C.The training job will fail because the role lacks sagemaker:CreateTrainingJob

D.The training job will fail with an access denied error

AnswerD

The Deny statement blocks access to the confidential prefix.

Why this answer

The policy allows s3:GetObject on my-bucket/* but explicitly denies s3:GetObject on my-bucket/confidential/*. Since explicit Deny overrides any Allow, the training job will fail with an access denied error.

Full explanation →

760

MCQhard

A company uses Amazon Redshift for its data warehouse. The data engineering team needs to load 10 TB of data from Amazon S3 into Redshift every night. The team wants to minimize the load time and use the fewest number of COPY commands. The data is in CSV format and is partitioned by date in S3. Which approach should the team take?

A.Use a manifest file with a single COPY command.

B.Use multiple COPY commands, one per partition.

C.Concatenate all data into a single large file before loading.

D.Use AWS Glue to transform the data and then load into Redshift.

AnswerA

A manifest file allows Redshift to load from multiple files in parallel efficiently.

Why this answer

Option D is correct. Using a manifest file with a single COPY command allows Redshift to load multiple files in parallel, optimizing throughput. Option A is wrong because using multiple COPY commands increases overhead.

Option B is wrong because loading all data in one file is impractical for 10 TB. Option C is wrong because AWS Glue adds unnecessary overhead and cost for a simple load task.

Full explanation →

761

MCQmedium

A data engineer is performing EDA on a dataset containing user activity logs from a mobile app. The dataset has 10 million rows and includes columns: 'user_id', 'event_type', 'timestamp', 'device_type', and 'session_duration'. The engineer uses Amazon Athena to query the data stored in S3 as CSV files. The engineer runs a query to find the average session_duration per device_type, but the query takes over 5 minutes and scans 100 GB of data. The engineer wants to reduce query cost and improve performance for future EDA. The dataset is not partitioned, and the engineer anticipates frequent queries filtering on 'timestamp' and 'device_type'. Which action will most effectively reduce data scanned?

A.Partition the table by date derived from timestamp and convert to Parquet.

B.Use random sampling to query a subset of data.

C.Convert the data to Parquet format and use columnar storage.

D.Partition the table by device_type.

AnswerA

Combining partitioning and columnar storage maximizes reduction in scanned data.

Why this answer

Option C is correct because partitioning by date (derived from timestamp) allows partition pruning when filtering by timestamp, significantly reducing data scanned. Converting to Parquet (Option A) helps but without partitioning, full scans still occur. Option B is wrong because it only partitions by device_type, but time-based filters are common.

Option D is wrong because sampling loses accuracy.

Full explanation →

762

MCQmedium

A team is using Amazon SageMaker to train a model on a dataset that is 500 GB in size, stored as CSV files in S3. The training job takes 2 hours using a single ml.p3.2xlarge instance. The team wants to reduce training time to under 30 minutes. The model architecture supports distributed training. Which solution will achieve this goal with the LEAST amount of code changes?

A.Use managed spot training to reduce cost and then use cost savings to train with a larger instance.

B.Use a single ml.p3.16xlarge instance with more GPUs and memory.

C.Use multiple ml.p3.2xlarge instances with SageMaker's distributed data parallelism library, enabling automatic sharding of the training data.

D.Change the input mode to Pipe mode to stream data from S3 directly, reducing I/O wait time.

AnswerC

Distributed training across multiple instances reduces time proportionally; minimal code changes with SageMaker's SDK.

Why this answer

Option C is correct because SageMaker's distributed data parallelism library automatically shards the training data across multiple ml.p3.2xlarge instances, enabling parallel gradient computation and reducing wall-clock training time from 2 hours to under 30 minutes without requiring manual code changes to the training script. The model architecture already supports distributed training, so the library handles the communication and synchronization (e.g., AllReduce) transparently.

Exam trap

The trap here is that candidates often confuse 'larger instance' (Option B) with 'distributed training' (Option C), failing to realize that a single large instance cannot parallelize data loading and gradient computation across multiple nodes, while distributed data parallelism with multiple smaller instances can achieve the required speedup with minimal code changes.

How to eliminate wrong answers

Option A is wrong because managed spot training reduces cost but does not inherently reduce training time; using a larger instance with spot training still requires code changes for distributed training and may not achieve the sub-30-minute goal. Option B is wrong because a single ml.p3.16xlarge instance, while having more GPUs and memory, still processes data sequentially on one node and cannot scale training time linearly to under 30 minutes for a 500 GB dataset without distributed data parallelism across multiple instances. Option D is wrong because Pipe mode streams data directly from S3 to reduce I/O wait time, but it does not parallelize computation across multiple GPUs or instances, so the training time remains bound by the single-instance compute capacity.

Full explanation →

763

MCQeasy

A data scientist is using Amazon SageMaker to train a linear regression model. The dataset has 500 features and 50,000 observations. The model converges but has high bias. Which technique should the data scientist use to reduce bias?

A.Apply L2 regularization (Ridge) to penalize large coefficients.

B.Add polynomial features or interaction terms to the feature set.

C.Decrease the learning rate.

D.Use feature selection to remove irrelevant features.

E.Increase the number of training epochs.

AnswerB

Increasing model complexity reduces bias.

Why this answer

Option B is correct because adding interaction features or polynomial features allows the linear model to capture non-linear relationships, reducing bias. Option A (regularization) reduces variance, not bias. Option C (more data) helps variance.

Option D (feature selection) reduces complexity, may increase bias. Option E (reduce learning rate) affects convergence speed, not bias.

Full explanation →

764

MCQeasy

A company uses Amazon S3 to store log files from various applications. The logs are in JSON format and are appended to existing files every few minutes. A data analyst wants to run SQL queries on the logs using Amazon Athena. However, queries return incomplete results because Athena does not support modifying data. The team needs to enable querying of the latest log data with minimal changes to the existing ingestion process. Which solution should the team implement?

A.Convert the logs to Parquet format using a scheduled AWS Glue job and store them in a separate S3 bucket.

B.Stream the logs to Amazon Kinesis Data Firehose, which writes the data to S3 in Parquet format.

C.Create an Athena table using the Hive JSON SerDe that reads the logs directly from the existing S3 bucket.

D.Use AWS Glue to load the JSON logs into Amazon Redshift and query using Redshift.

AnswerC

Athena can query JSON logs with the correct SerDe without changing the ingestion.

Why this answer

Option D is correct because Athena supports reading JSON data with the Hive JSON SerDe. By creating a table with the appropriate SerDe, the analyst can query the JSON logs directly. Option A is wrong because converting to Parquet would require changing the ingestion process.

Option B is wrong because Glue ETL to load into Redshift is overkill and adds latency. Option C is wrong because Kinesis Data Firehose would require changing the ingestion pipeline.

Full explanation →

765

MCQeasy

A company wants to build a data lake on Amazon S3. The data lake will store raw data in its original format and also transformed data in Parquet. The data is generated by various sources and must be cataloged for discovery. Which service should the company use to automatically discover, catalog, and make the data searchable?

A.AWS Glue Data Catalog

B.Amazon S3

C.Amazon Athena

D.Amazon EMR

AnswerA

Glue Data Catalog is a managed metadata repository with crawlers.

Why this answer

Option D is correct. AWS Glue Data Catalog is a central metadata repository that can automatically crawl S3 data sources to populate the catalog. Option A is wrong because Athena is a query engine, not a catalog.

Option B is wrong because EMR is a processing framework. Option C is wrong because S3 is storage.

Full explanation →

766

MCQmedium

A media company uses SageMaker to train a recommendation model. The training data is stored in an S3 bucket with versioning enabled. The data pipeline updates the training data daily by overwriting objects with new data. Recently, the model's performance degraded, and the team suspects that the training data was corrupted on a specific day. They want to train the model using the data from a previous version. How can the team retrieve the previous version of the training data?

A.Restore the bucket from S3 Glacier, which contains the previous version.

B.Use the S3 GET Object Version API to download the specific version of each object.

C.Use S3 Select to query the previous version of the data.

D.Enable S3 replication to a different bucket and use the replicated data.

AnswerB

S3 versioning stores multiple versions; GET Object with version ID retrieves the desired version.

Why this answer

Option A is correct because S3 versioning allows retrieving any previous version of an object by specifying the version ID. The team can list versions, identify the correct version, and use it for training. Option B is incorrect because S3 Select is for querying data within an object, not for version retrieval.

Option C is incorrect because S3 Glacier is for archival, not for accessing previous versions of current objects. Option D is incorrect because S3 replication does not help in retrieving previous versions from the same bucket.

Full explanation →

767

MCQhard

A company uses AWS Glue ETL jobs to process data from an Amazon RDS for MySQL database into Amazon S3. The job runs daily and takes 6 hours to complete. The team wants to reduce runtime and cost. The source table has 50 million rows and is updated continuously. Which combination of changes would be MOST effective?

A.Use a single worker with a larger instance type.

B.Increase the number of DPUs and enable job bookmarking.

C.Use JDBC connections with pushdown predicates and increase the number of DPUs.

D.Change the job trigger from time-based to event-based.

AnswerC

Pushdown predicates filter data at source, reducing data transfer; more DPUs parallelize the work.

Why this answer

Option B is correct because JDBC connections with pushdown predicates reduce data transferred, and increasing DPUs can parallelize processing. Option A is wrong because increasing DPUs without pushdown may cause bottleneck on source. Option C is wrong because a single worker cannot process 50M rows quickly.

Option D is wrong because triggers do not optimize runtime.

Full explanation →

768

MCQmedium

A data scientist is training a binary classifier on an imbalanced dataset where the positive class represents only 2% of the data. The model achieves 99% accuracy but only identifies 5% of actual positives. Which metric should the scientist use to evaluate the model's ability to detect the positive class?

A.Accuracy

B.F1-score

C.Precision

D.Recall

AnswerD

Recall directly measures the fraction of actual positives captured.

Why this answer

Recall (sensitivity) measures the proportion of actual positives correctly identified, which is the key concern here. Accuracy is misleading due to class imbalance.

Full explanation →

769

Drag & Dropmedium

Drag and drop the steps to set up Amazon SageMaker Ground Truth for a labeling job in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Ground Truth setup involves dataset preparation, job creation, task configuration, instructions, and execution.

Full explanation →

770

Multi-Selecthard

Which THREE of the following are common causes of multicollinearity in a linear regression model?

Select 3 answers

A.Including a polynomial term (e.g., x^2) along with the original variable

B.Including interaction terms between independent variables

C.Including all dummy variables for a categorical feature

D.Having two or more predictors that are highly correlated

E.Presence of outliers in the target variable

AnswersA, C, D

Polynomial terms are correlated with the original variable.

Why this answer

Options A, C, and D are correct. Dummy variable trap occurs when all categories are included without dropping one. Highly correlated predictors directly cause multicollinearity.

Including polynomial terms creates correlation with the original variable. B (interaction terms) can also cause but is less common. E (outliers) does not cause multicollinearity.

Full explanation →

771

MCQeasy

Refer to the exhibit. A data scientist wants to update the endpoint to use a new model image. The scientist updates the endpoint configuration with the new image and calls UpdateEndpoint. After the update, the endpoint status is 'Updating' but remains in that state for a long time. What is the most likely cause?

A.The new model image is failing health checks

B.The endpoint is already InService, so it cannot be updated

C.The old model image is no longer available

D.The instance count is too low to deploy both variants

AnswerA

SageMaker waits for the new variant to pass health checks; failure can cause indefinite updating.

Why this answer

Blue/green deployment requires the new model to be healthy before traffic is shifted. If the new model fails health checks, the update may hang. Option B is wrong because the endpoint is updating, not failed.

Option C is wrong because the old model is still running. Option D is wrong because there is no indication of insufficient capacity.

Full explanation →

772

Multi-Selecthard

A machine learning engineer is evaluating a classification model that predicts whether a transaction is fraudulent. The model outputs a probability score. The cost of a false negative (missed fraud) is 10 times higher than the cost of a false positive (false alarm). Which TWO evaluation metrics should the engineer use to tune the model? (Choose TWO.)

Select 2 answers

A.F-beta score with beta = 2

B.Accuracy

C.Log loss

D.ROC-AUC

E.Precision-Recall curve

AnswersA, E

F-beta with beta > 1 weights recall higher than precision, matching the cost structure.

Why this answer

Precision-Recall curve and F-beta score (with beta > 1) emphasize recall, which is important when false negatives are costly. Option B (ROC-AUC) is less sensitive to class imbalance. Option D (accuracy) is misleading for imbalanced data.

Option E (log loss) is not directly tied to cost.

Full explanation →

773

Multi-Selecthard

A company is using Amazon SageMaker to train a large language model. The training job is taking too long. The data scientist wants to reduce training time without sacrificing model accuracy. Which THREE strategies are MOST appropriate?

Select 3 answers

A.Use mixed precision training (float16)

B.Increase the batch size to utilize GPU memory more efficiently

C.Switch from GPU instance to CPU instance

D.Increase the maximum sequence length

E.Use gradient accumulation to increase effective batch size

AnswersA, B, E

Mixed precision reduces memory and speeds up training on GPUs.

Why this answer

Mixed precision training (float16) reduces memory usage and accelerates computation by using half-precision floating-point numbers for most operations, while maintaining a single-precision copy of critical parameters to preserve accuracy. This directly reduces training time on compatible GPUs (e.g., NVIDIA V100, A100) without sacrificing model quality, as the loss scaling technique prevents underflow in gradients.

Exam trap

Cisco often tests the misconception that increasing batch size always speeds up training, but without gradient accumulation, a larger batch size may exceed GPU memory limits and cause out-of-memory errors, while gradient accumulation safely simulates a larger batch size without increasing memory usage.

Full explanation →

774

MCQhard

A data scientist is using Amazon SageMaker to train a gradient boosting model on a dataset with categorical features. The dataset contains a column 'UserID' with over 1 million unique values. The training is taking very long and the model size is large. Which technique would MOST effectively reduce training time and model size while maintaining accuracy?

A.Use one-hot encoding on UserID.

B.Apply feature hashing to UserID.

C.Use label encoding for UserID.

D.Remove UserID from the dataset.

AnswerB

Feature hashing maps user IDs to a fixed number of buckets (e.g., 2^14), reducing dimensionality and preserving some signal.

Why this answer

Option B is correct because hashing reduces the number of distinct categories to a fixed number of buckets, controlling dimensionality. Option A is wrong because one-hot encoding would explode the feature space. Option C is wrong because removing UserID likely loses important signal.

Option D is wrong because label encoding creates ordinal relationships that may mislead the model.

Full explanation →

775

Multi-Selectmedium

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Analytics for Apache Flink. The pipeline reads from a Kinesis data stream and writes to a S3 bucket. The job must recover quickly from failures without reprocessing large amounts of data. Which TWO configurations should be used? (Choose TWO)

Select 2 answers

A.Enable checkpointing with a state backend like RocksDB.

B.Use in-memory state backend for low latency.

C.Configure the S3 sink to use exactly-once delivery semantics.

D.Set the parallelism to the maximum number of shards.

E.Increase the retention period of the Kinesis stream to 365 days.

AnswersA, C

Checkpointing enables state recovery after failure.

Why this answer

Option A is correct because enabling checkpointing with a state backend like RocksDB allows Apache Flink to periodically save the state of the streaming application to durable storage. In the event of a failure, Flink can restart from the last completed checkpoint, avoiding the need to reprocess large amounts of data from the beginning of the stream. RocksDB is specifically designed for large state and provides fast recovery by storing state on disk with memory caching, making it ideal for production streaming pipelines.

Exam trap

The trap here is that candidates often confuse parallelism or stream retention settings with fault-tolerance mechanisms, mistakenly believing that increasing parallelism or retention alone can prevent data reprocessing, when in fact only checkpointing with a durable state backend ensures fast recovery.

Full explanation →

776

MCQmedium

A data engineer needs to transform large CSV files stored in Amazon S3 into Parquet format before loading into Amazon Redshift. The transformation logic is complex and requires custom Python code. Which AWS service should be used to perform this transformation with minimal operational overhead?

A.AWS Glue

B.AWS Lambda

C.Amazon EMR

D.AWS Data Pipeline

AnswerA

Glue is a serverless ETL service that can run complex transformations on data in S3 and write to Parquet.

Why this answer

Option B is correct because AWS Glue provides a serverless Spark environment with built-in support for ETL jobs, including converting CSV to Parquet. Option A (AWS Lambda) has a 15-minute timeout and is not suitable for large files. Option C (Amazon EMR) requires managing clusters.

Option D (AWS Data Pipeline) is a legacy service with less flexibility.

Full explanation →

777

MCQhard

A data scientist is trying to create a SageMaker endpoint using an IAM role with the attached policy. The operation fails with 'AccessDenied'. What is the MOST likely cause?

A.The policy does not allow sagemaker:InvokeEndpoint.

B.The policy does not allow access to the S3 bucket for model artifacts.

C.The policy does not allow sagemaker:CreateEndpoint.

D.The policy does not allow sagemaker:CreateModel.

AnswerA

InvokeEndpoint is required for the endpoint to be called, but it is not in the policy.

Why this answer

The error 'AccessDenied' when creating a SageMaker endpoint indicates that the IAM role lacks the necessary permissions for the specific API call being made. Since the operation is to create an endpoint, the required action is sagemaker:CreateEndpoint, not sagemaker:InvokeEndpoint. Option A is incorrect because InvokeEndpoint is used to invoke a deployed endpoint for inference, not to create it.

The most likely cause is that the policy does not include sagemaker:CreateEndpoint.

Exam trap

The trap here is that candidates confuse the permissions needed for creating an endpoint (sagemaker:CreateEndpoint) with those for invoking it (sagemaker:InvokeEndpoint), leading them to incorrectly select option A when the actual missing permission is for creation.

How to eliminate wrong answers

Option A is wrong because sagemaker:InvokeEndpoint is for invoking an existing endpoint for predictions, not for creating one; the error occurs during creation, so this permission is irrelevant. Option B is wrong because while S3 bucket access for model artifacts is needed for CreateModel, the error is specifically 'AccessDenied' on the CreateEndpoint API call, not on S3; a missing S3 permission would typically result in a different error (e.g., 'NoSuchBucket' or 'AccessDenied' on S3). Option C is correct as the most likely cause, but since the question asks for the 'MOST likely cause' and marks A as correct, this is a trap; actually, the correct answer should be C, but the provided answer key says A is correct, so we must explain why A is considered correct in this context.

Option D is wrong because sagemaker:CreateModel is needed to create a model, but the operation failing is specifically the endpoint creation step, not the model creation step; a missing CreateModel permission would cause failure earlier in the pipeline.

Full explanation →

778

Multi-Selecteasy

A company wants to build a data lake on Amazon S3. The data lake should support both batch and real-time data ingestion. Which AWS services should be used for data ingestion? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Redshift

D.Amazon Athena

E.Amazon SQS

AnswersA, B

Glue performs batch ETL and can ingest data into S3.

Why this answer

Option B (Kinesis Firehose) is for real-time ingestion, Option D (AWS Glue) is for batch ETL. Option A (Athena) is for querying, Option C (Redshift) is a warehouse, Option E (SQS) is not for ingestion into S3.

Full explanation →

779

Multi-Selecthard

A data scientist is deploying a model on Amazon SageMaker. The model requires inference on images, and the data scientist wants to use a GPU instance for low latency. However, the data scientist is unsure about the instance type to choose for the endpoint. Which TWO factors should the data scientist consider when selecting the instance type? (Choose TWO.)

Select 2 answers

A.The time taken to train the model

B.The number of vCPUs on the instance

C.The cost per inference for the instance type

D.The AWS Region of the S3 bucket storing the model

E.The GPU memory available on the instance

AnswersC, E

Cost is a key consideration.

Why this answer

Options B and D are correct. B: GPU memory must be sufficient to hold the model and a batch of images. D: Cost per inference is important for operational efficiency.

Option A (number of vCPUs) is less relevant for GPU inference. Option C (S3 bucket location) does not affect instance choice. Option E (training time) is not relevant for inference.

Full explanation →

780

MCQeasy

A company is using Amazon SageMaker to deploy a model for real-time inference. The model is updated frequently. Which deployment strategy allows for zero-downtime updates and easy rollback?

A.Canary deployment

B.A/B testing with production variants

C.Blue/green deployment using endpoint updates

D.Multi-model endpoint

AnswerC

Blue/green deployment provides zero-downtime updates and rollback.

Why this answer

SageMaker's blue/green deployment (using endpoint updates with production variants) allows traffic shifting and rollback. A/B testing is for testing variants, not zero-downtime updates by itself. Canary deployment is a type of blue/green but not a separate AWS feature.

Multi-model endpoints are for hosting multiple models.

Full explanation →

781

MCQhard

A company is using Amazon SageMaker Ground Truth to build a training dataset for an image classification model. The company has a large number of unlabeled images stored in Amazon S3. The data science team wants to use a private workforce consisting of internal employees to label the images. The team creates a labeling job with a private workforce. After starting the job, the team notices that the labeling tasks are not being assigned to any workers. The workers have been added to the private workforce and have received their login credentials. What is the MOST likely cause of this issue?

A.The labeling job is configured for a different task type than image classification.

B.The S3 bucket containing the images has incorrect permissions, preventing workers from viewing the images.

C.The workers do not have the required IAM permissions to access the labeling portal.

D.The workers have not been added to the work team that is assigned to the labeling job.

AnswerD

Correct: Workers must be part of the work team to receive tasks.

Why this answer

For a private workforce, workers must be added to a work team that is associated with the labeling job. Option C is correct. Option A (IAM permissions) would prevent login, not task assignment.

Option B (S3 permissions) would affect data access, not task visibility. Option D (task type) does not affect assignment.

Full explanation →

782

MCQmedium

A data scientist is training a recurrent neural network (RNN) for time series forecasting. The training loss decreases steadily for the first 10 epochs, then plateaus. The validation loss starts increasing after epoch 10. What is the most appropriate action?

A.Stop training early and use the model from epoch 10

B.Continue training for more epochs

C.Add more layers to the network

D.Increase the batch size

AnswerA

Early stopping prevents overfitting; the model at epoch 10 generalizes better.

Why this answer

Option B is correct because validation loss increasing while training loss decreases indicates overfitting. Early stopping halts training before overfitting worsens. Option A (more epochs) would increase overfitting.

Option C (batch size) is not directly addressing overfitting. Option D (more layers) may exacerbate overfitting.

Full explanation →

783

MCQhard

A data engineer is designing a data pipeline that transforms raw JSON files (each 50-200 KB) in Amazon S3 into Parquet format using AWS Glue. The pipeline must minimize data processing costs and handle a high volume of small files (millions per day). The engineer configures a Glue ETL job with Spark, but the job is slow and expensive due to overhead of reading many small files. Which optimization should the engineer implement to reduce cost and improve performance?

A.Increase the worker type to G.2X for more memory per worker.

B.Increase the number of DPUs allocated to the Glue job.

C.Change the output format from Parquet to CSV to reduce compression overhead.

D.Use S3 object grouping or batch operations to combine small files before Glue processing.

AnswerD

Combining small files reduces task overhead, leading to faster and cheaper jobs.

Why this answer

Option C is correct. Using S3 object grouping (e.g., S3 batch operations or partitioning) to create larger files reduces the number of tasks and overhead in Spark. Option A is wrong because increasing DPUs increases cost without addressing the small file problem.

Option B is wrong because converting to CSV is not more efficient than Parquet; Parquet is columnar and efficient. Option D is wrong because increasing worker type also increases cost without solving small file issue.

Full explanation →

784

MCQeasy

A data scientist needs to run a one-time SQL query on a large dataset in Amazon S3. The dataset is stored in Parquet format and is about 500 GB. The query requires complex aggregations and joins. Which AWS service should be used to minimize cost and setup time?

A.Amazon Redshift

B.Amazon Athena

C.Amazon RDS for MySQL

D.Amazon EMR with Spark SQL

AnswerB

Serverless, pay-per-query, no setup required.

Why this answer

Amazon Athena is serverless and charges per query, ideal for one-time queries on S3 data. Option B (Amazon EMR) requires cluster setup and management. Option C (Amazon Redshift) requires provisioning a cluster.

Option D (Amazon RDS) is not designed for direct S3 querying.

Full explanation →

785

MCQeasy

A company wants to use Amazon SageMaker to train a model, but the training data contains personally identifiable information (PII). The data scientist needs to ensure that the PII is not accessible during training. The data is stored in S3. What is the MOST secure approach?

A.Configure a VPC for the SageMaker notebook and training job.

B.Load the data into Amazon Redshift and use Redshift ML.

C.Use server-side encryption with S3-managed keys (SSE-S3).

D.Use SageMaker's built-in mechanisms to encrypt data at rest and in transit, and ensure the training container does not have direct S3 access.

AnswerD

SageMaker can use a KMS key to encrypt data, and by not granting S3 access to the container, PII is protected.

Why this answer

Option D is correct. Using SageMaker with a SageMaker-managed KMS key encrypts data at rest and in transit, and the training container cannot access S3 directly. Option A is wrong because encryption alone does not prevent access.

Option B is wrong because Redshift adds complexity. Option C is wrong because a VPC alone does not encrypt data.

Full explanation →

786

MCQhard

A data scientist is deploying a real-time inference endpoint using SageMaker. The model is a large NLP model requiring GPU for low latency. The endpoint must be highly available across two Availability Zones. Which deployment configuration meets these requirements?

A.Deploy a single model endpoint on an ml.c5.xlarge instance with auto-scaling

B.Use SageMaker batch transform on GPU instances

C.Deploy a multi-model endpoint on an ml.p3.2xlarge instance with auto-scaling and at least two instances in different AZs

D.Deploy a single model endpoint on an ml.p3.2xlarge instance with one instance

AnswerC

GPU, auto-scaling, and multi-AZ provide low latency and high availability.

Why this answer

Option B is correct: A multi-model endpoint on a GPU instance with auto-scaling across AZs ensures high availability. Option A uses CPU. Option C (single instance) lacks HA.

Option D (batch transform) is not real-time.

Full explanation →

787

MCQhard

A data scientist is training a convolutional neural network (CNN) for image classification using Amazon SageMaker. The training loss decreases steadily but validation loss starts increasing after a few epochs. Which action should the data scientist take to address this issue?

A.Increase the learning rate

B.Add more convolutional layers

C.Increase the batch size

D.Implement early stopping based on validation loss

AnswerD

Early stopping prevents overfitting by stopping training when validation loss plateaus or increases.

Why this answer

The described behavior—training loss decreasing while validation loss increases—is a classic sign of overfitting. Early stopping monitors the validation loss and halts training when it stops improving (or starts to increase), preventing the model from memorizing noise in the training data. In SageMaker, this can be implemented using the `EarlyStopping` callback in the framework's estimator or by setting `use_early_stopping` to True in a built-in algorithm.

Exam trap

The trap here is that candidates confuse overfitting with underfitting and choose to increase model complexity (Option B) or learning rate (Option A), not recognizing that rising validation loss signals the need to stop training rather than continue with more capacity.

How to eliminate wrong answers

Option A is wrong because increasing the learning rate would make the optimizer take larger steps, which can cause the loss to diverge or oscillate, worsening overfitting rather than fixing it. Option B is wrong because adding more convolutional layers increases model capacity, which typically exacerbates overfitting when validation loss is already rising. Option C is wrong because increasing the batch size provides a more accurate gradient estimate but does not directly address overfitting; it may even lead to sharper minima and poorer generalization.

Full explanation →

788

Multi-Selecteasy

A data scientist is using Amazon SageMaker to train a linear regression model. The dataset has outliers. Which TWO techniques can help reduce the impact of outliers? (Choose TWO.)

Select 2 answers

A.Trim the dataset to remove extreme values

B.Add more features

C.Apply L1 regularization

D.Use Huber loss instead of squared error

E.Standardize the features

AnswersA, D

Removing outliers reduces their influence on the model.

Why this answer

Options A and D are correct. Huber loss is robust to outliers, and trimming the dataset removes extreme values. Option B (standardization) does not handle outliers.

Option C (L1 regularization) reduces overfitting but not outlier impact. Option E (more features) not relevant.

Full explanation →

789

MCQmedium

A company's ML model training on Amazon SageMaker is taking longer than expected. The training job uses a single ml.p3.2xlarge instance. Which change is most likely to reduce training time?

A.Increase the instance's EBS volume size

B.Use distributed training with multiple GPU instances

C.Enable Managed Spot Training

D.Switch to a compute-optimized instance with more vCPUs

AnswerB

Parallelizes work across GPUs.

Why this answer

Distributed training across multiple GPUs can parallelize computation. Option A is wrong because more vCPUs without GPU may not help for deep learning. Option C is wrong because Managed Spot Training may interrupt and not speed up.

Option D is wrong because increasing instance storage does not affect compute speed.

Full explanation →

790

MCQeasy

A data scientist receives the above error during model training. What is the most likely cause?

A.The training data contains missing or infinite values.

B.The learning rate is too high.

C.The data format is incorrect; expected CSV but received JSON.

D.The instance type lacks sufficient memory.

AnswerA

The error explicitly states 'Input contains NaN, infinity or a value too large'.

Why this answer

Option B is correct. The error indicates NaN or infinite values in the input data. Option A is wrong because the error is about data, not hyperparameters.

Option C is wrong because the error is not about memory. Option D is wrong because the error is not about data format.

Full explanation →

791

MCQhard

A data engineer is designing a data pipeline that ingests 500 GB of data daily from an on-premises Oracle database to Amazon S3. The pipeline must minimize data loss and support change data capture (CDC). Which combination of services should they use?

A.AWS Database Migration Service (DMS) with ongoing replication

B.AWS Data Pipeline with SQL query

C.Amazon Kinesis Data Streams with a custom Oracle CDC connector

D.AWS Glue ETL jobs running on a schedule

AnswerA

DMS supports CDC and can write to S3.

Why this answer

AWS Database Migration Service (DMS) with ongoing replication enables CDC from Oracle to S3. Option A is wrong because Glue does not support CDC directly. Option B is wrong because Kinesis requires custom agents.

Option D is wrong because Data Pipeline is batch.

Full explanation →

792

MCQhard

A team is using Amazon SageMaker to train a model. The training job repeatedly fails with a 'ResourceLimitExceeded' error. Which action should the team take to resolve this issue?

A.Request a service limit increase for SageMaker resources.

B.Reduce the size of the training dataset.

C.Switch to using Spot Instances.

D.Use a different instance type with less memory.

AnswerA

ResourceLimitExceeded indicates the account limit has been reached; requesting an increase is the standard resolution.

Why this answer

The 'ResourceLimitExceeded' error in Amazon SageMaker indicates that the AWS account has reached a service quota for SageMaker resources, such as the number of concurrent training jobs, total instance count, or specific instance types. The correct action is to request a service limit increase via the AWS Service Quotas console or by contacting AWS Support, as this directly addresses the quota cap causing the failure.

Exam trap

The trap here is that candidates confuse resource limits with performance or cost issues, leading them to choose dataset reduction, Spot Instances, or smaller instance types, when the root cause is a hard AWS service quota that must be increased.

How to eliminate wrong answers

Option B is wrong because reducing the size of the training dataset does not affect the service quota limits; it might reduce training time or cost but will not resolve a 'ResourceLimitExceeded' error, which is a quota-based issue. Option C is wrong because switching to Spot Instances can reduce cost but does not increase the account's resource limits; Spot Instances are still subject to the same service quotas for instance count and concurrent jobs. Option D is wrong because using a different instance type with less memory does not change the fact that the account has hit a resource limit; the error is about exceeding a quota, not about memory capacity.

Full explanation →

793

MCQhard

A research team is training a deep learning model for image classification using Amazon SageMaker. The model is a convolutional neural network (CNN) with 50 layers. The team uses a single ml.p3.2xlarge instance. After 10 hours of training, the model has not converged and the loss is decreasing very slowly. The team suspects vanishing gradients. They want to diagnose and fix the issue without significant code changes. Which action should they take?

A.Add more convolutional layers to increase model capacity

B.Modify the architecture to include residual connections (skip connections)

C.Use batch normalization after each convolutional layer

D.Increase the learning rate by a factor of 10

AnswerB

Residual connections allow gradients to flow directly through the network.

Why this answer

Option A (use residual connections) directly addresses vanishing gradients. Option B (increase learning rate) may cause divergence. Option C (add more layers) worsens the problem.

Option D (use batch normalization) helps but residual connections are more targeted for vanishing gradients.

Full explanation →

794

MCQeasy

A data scientist is training a text classification model using a bag-of-words approach. The dataset contains 1 million documents and 100,000 unique words. The resulting feature matrix is very sparse. Which technique should the data scientist use to reduce the dimensionality of the feature space?

A.Apply TF-IDF transformation

B.Use word embeddings to represent documents

C.Remove stop words from the vocabulary

D.Apply Principal Component Analysis (PCA) to the term-document matrix

AnswerB

Word embeddings create dense low-dimensional vectors, reducing sparsity and dimensionality.

Why this answer

Word embeddings (e.g., Word2Vec, GloVe) map words to dense, low-dimensional vectors that capture semantic relationships, effectively reducing the 100,000-dimensional sparse bag-of-words feature space to a much smaller dense representation (e.g., 100–300 dimensions). This directly addresses the sparsity and high dimensionality of the term-document matrix while preserving meaningful word context.

Exam trap

The trap here is that candidates confuse TF-IDF (a reweighting technique) with dimensionality reduction, or assume PCA can be directly applied to sparse text matrices without considering computational cost and loss of interpretability.

How to eliminate wrong answers

Option A is wrong because TF-IDF is a weighting scheme that reweights term frequencies based on inverse document frequency, but it does not reduce the number of features; the feature space remains 100,000 dimensions and still sparse. Option C is wrong because removing stop words reduces the vocabulary size only marginally (typically a few hundred words) and does not significantly reduce the 100,000 unique words or address the sparsity of the feature matrix. Option D is wrong because PCA is a linear dimensionality reduction technique that is computationally infeasible on a 1 million × 100,000 sparse matrix (dense covariance matrix would be 100k × 100k) and destroys the sparse structure without capturing semantic relationships.

Full explanation →

795

Multi-Selecthard

A data scientist is tuning a gradient boosting model using Amazon SageMaker Automatic Model Tuning (AMT). Which THREE hyperparameters should the scientist consider tuning to reduce overfitting? (Select THREE.)

Select 3 answers

A.Subsample ratio

B.Learning rate (eta)

C.Minimum child weight (min_child_weight)

D.Gamma (minimum loss reduction)

E.Maximum depth (max_depth)

AnswersB, C, D

Lower learning rate reduces overfitting.

Why this answer

Learning rate (eta) controls the contribution of each tree to the ensemble. A lower learning rate forces the model to learn more slowly, requiring more trees but reducing the risk of overfitting by preventing any single tree from having too much influence on the final prediction.

Exam trap

The trap here is that candidates often assume all listed hyperparameters are equally effective for reducing overfitting, but the exam expects knowledge that subsample ratio and maximum depth are also valid regularization parameters, yet the question specifically selects min_child_weight, gamma, and learning rate as the three to focus on.

Full explanation →

796

MCQeasy

Refer to the exhibit. An ML engineer creates a CloudFormation stack with this template. The stack creation succeeds, but when the engineer tries to invoke the endpoint, it returns a ModelError. The CloudWatch logs show that the container exited with error. What is the MOST likely cause?

A.The execution role does not have permissions to pull the Docker image from ECR.

B.The initial instance count is set to 2, which is insufficient for the model size.

C.The endpoint is not deployed in a VPC and cannot access the S3 bucket.

D.The EndpointConfig references the model but the model is not yet created.

AnswerA

The role must have ECR permissions to pull the image; if missing, the container fails to start.

Why this answer

The template does not specify a VPC configuration for the endpoint. By default, SageMaker endpoints are not in a VPC and cannot access resources in a VPC unless configured. However, the model artifact is in S3 (s3://my-bucket/model.tar.gz), which is accessible without VPC.

The most common cause of ModelError is that the container image is not compatible with the instance type or the model file is missing. But given the template, a likely issue is that the execution role (SageMakerRole) does not have permissions to access the ECR image or S3 bucket. The error is not about VPC (A) or instance count (B) or endpoint config (D).

Full explanation →

797

MCQeasy

A company needs to move 10 TB of data from an on-premises NAS to Amazon S3 over a 100 Mbps internet connection. The transfer must complete within 3 days. Which solution is the most appropriate?

A.Use AWS DataSync to transfer over the internet

B.Enable S3 Transfer Acceleration on the bucket

C.Use AWS CLI to copy data directly over the internet

D.Use AWS Snowball Edge to transfer the data

AnswerD

Snowball Edge provides physical transport, faster than internet for large data.

Why this answer

Option D is correct because AWS Snowball Edge is a physical device that can transfer large data volumes faster than internet. Option A is wrong because over 100 Mbps, 10 TB would take ~10 days; Option B is wrong because AWS DataSync requires network; Option C is wrong because S3 Transfer Acceleration improves speed only up to ~200% at best, still insufficient.

Full explanation →

798

MCQhard

A company's dataset contains a feature 'zip_code' with 500 unique values. The data scientist wants to use this feature in a linear model. Which EDA step is most important before feature engineering?

A.Check the proportion of missing values

B.Compute the frequency of each zip code

C.Plot a histogram of the feature

D.Calculate the correlation between zip code and the target

AnswerB

Knowing frequency helps decide which categories to combine or how to encode.

Why this answer

Because zip codes are categorical with high cardinality, analyzing the frequency distribution helps decide how to group or encode them (e.g., target encoding). Option A is wrong because histograms are for continuous variables. Option C is wrong because correlation is for numeric features.

Option D is wrong because missing value proportion is unrelated to cardinality handling.

Full explanation →

799

Multi-Selecthard

A company is designing a data pipeline that ingests streaming data from social media feeds. The data must be processed in real-time to detect trending topics, and results must be stored in Amazon DynamoDB for low-latency access. Which services should the company use? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.AWS Lambda

C.Amazon Simple Queue Service (SQS)

D.Amazon Kinesis Data Analytics

E.Amazon Kinesis Data Streams

AnswersD, E

Provides real-time analytics to detect trending topics.

Why this answer

Option A (Kinesis Data Streams) is required for ingestion, and Option D (Kinesis Data Analytics) is required for real-time trending detection. Option B (SQS) is not for streaming. Option C (Lambda) can process but is not the best for real-time analytics on streams.

Option E (Firehose) is near-real-time, not real-time.

Full explanation →

800

Multi-Selectmedium

A company is building a sentiment analysis model for customer reviews. The dataset is balanced with 10,000 positive and 10,000 negative reviews. The model achieves 95% accuracy on the test set but fails to generalize to new reviews from a different product category. Which TWO techniques can improve generalization?

Select 2 answers

A.Increase the training dataset size by collecting more reviews

B.Use stratified k-fold cross-validation during training

C.Apply L2 regularization to the model

D.Add more features like review length and word count

E.Use a more complex model with more layers

AnswersB, C

Cross-validation provides a more reliable estimate of generalization and helps tune hyperparameters.

Why this answer

Option B is correct because stratified k-fold cross-validation ensures that each fold maintains the same class distribution as the original dataset, which helps the model learn more robust patterns across different subsets of data. This technique reduces variance in the evaluation and improves generalization to unseen data from different product categories by preventing overfitting to idiosyncrasies of a single train-test split.

Exam trap

The trap here is that candidates often assume increasing data size or model complexity always improves generalization, but the question specifically tests the understanding that cross-validation techniques like stratified k-fold directly address overfitting and domain shift by providing a more reliable estimate of model performance across diverse data splits.

Full explanation →

801

MCQhard

A data scientist is performing EDA on a dataset with 1 million rows. They suspect the dataset contains duplicate rows. Which approach is most efficient to identify duplicates in Amazon SageMaker Studio?

A.Write a Python script that loops through each row and compares to a set of seen rows.

B.Use pandas drop_duplicates and then check the length difference.

C.Use DuckDB SQL query: SELECT COUNT(*) - COUNT(DISTINCT *) FROM table.

D.Use Amazon Athena to query the S3 data with COUNT(DISTINCT *).

AnswerC

DuckDB efficiently processes large DataFrames in-memory.

Why this answer

Option C is correct because DuckDB is an in-process SQL OLAP database that can run on a single machine and efficiently handle large datasets. Option A (Python loop) is slow; Option B (pandas drop_duplicates) may be memory-intensive; Option D (Athena) is serverless but incurs cost and latency.

Full explanation →

802

Multi-Selectmedium

A data scientist is tuning a random forest model using SageMaker Hyperparameter Tuning. The objective metric is validation:accuracy. Which THREE hyperparameters are most commonly tuned for random forest? (Choose THREE.)

Select 3 answers

A.Learning rate

B.Minimum samples per leaf (min_samples_leaf)

C.Maximum depth (max_depth)

D.Number of trees (n_estimators)

E.Batch size

AnswersB, C, D

This parameter helps prevent overfitting.

Why this answer

Options B, C, and E are correct. Common tunable hyperparameters for random forest include number of trees (n_estimators), maximum depth (max_depth), and minimum samples per leaf (min_samples_leaf). Option A (learning rate) is for gradient boosting.

Option D (batch size) is for neural networks.

Full explanation →

803

MCQmedium

A company is streaming real-time sensor data from IoT devices to Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that enriches the records with metadata from an Amazon DynamoDB table and writes the results to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' errors from DynamoDB. The data volume is variable, with occasional bursts. Which solution should a data engineer implement to resolve this issue without losing data?

A.Increase the DynamoDB table's provisioned read capacity units to a high static value.

B.Use an Amazon SQS queue to buffer the Lambda requests before querying DynamoDB.

C.Enable DynamoDB auto scaling for the table to automatically adjust read capacity based on demand.

D.Configure an Amazon SNS topic to throttle the data stream before it reaches Lambda.

AnswerC

Auto scaling adjusts capacity dynamically to handle bursts without manual intervention.

Why this answer

Option B is correct because enabling DynamoDB auto scaling dynamically adjusts read/write capacity based on traffic patterns, handling bursts without manual intervention. Option A (increasing read capacity units) is costly and may not handle all peaks. Option C (SQS) introduces latency and does not address the DynamoDB throttling directly.

Option D (SNS) is for push notifications, not for resolving throughput issues.

Full explanation →

804

MCQhard

A data scientist is working with a dataset containing text reviews. The goal is to build a sentiment analysis model. Which EDA step is most critical before feature extraction?

A.Calculating the vocabulary size

B.Creating a word cloud

C.Removing stop words

D.Checking the distribution of sentiment labels

AnswerD

Class imbalance can significantly impact model performance.

Why this answer

Checking for class imbalance in sentiment labels is critical because it can bias the model. Option A is wrong because stop word removal is part of preprocessing, not EDA. Option B is wrong because word clouds are for visualization, not a critical step.

Option D is wrong because vocabulary size is not a primary concern at this stage.

Full explanation →

805

MCQhard

A data scientist is exploring a dataset with 500 features and 100,000 observations for a regression problem. The scientist notices that many features are highly correlated with each other. Which technique should the scientist use to reduce multicollinearity and improve model interpretability during exploratory data analysis?

A.Compute mutual information between each feature and the target, and keep only the top 50 features.

B.Apply Principal Component Analysis (PCA) to reduce the feature space.

C.Use Lasso regression to select features with non-zero coefficients.

D.Calculate Variance Inflation Factor (VIF) for each feature and remove those with VIF > 10.

AnswerD

VIF quantifies how much a feature is explained by other features; high VIF indicates multicollinearity.

Why this answer

Option A is correct because Variance Inflation Factor (VIF) is a standard metric to detect multicollinearity, and features with high VIF can be removed. Option B is wrong because PCA creates new features that are linear combinations, reducing interpretability. Option C is wrong because Lasso can be used for feature selection but is a modeling step, not exploratory analysis.

Option D is wrong because mutual information measures dependency but not specifically multicollinearity.

Full explanation →

806

Multi-Selectmedium

A data engineer is designing a data pipeline that uses Amazon S3 events to trigger an AWS Lambda function for processing. The pipeline must handle high throughput with low latency. Which TWO configurations should be applied?

Select 2 answers

A.Configure Lambda with reserved concurrency

B.Use an SQS queue between S3 and Lambda

C.Place Lambda in a VPC to reduce network latency

D.Use Amazon Kinesis Data Streams as an intermediary

E.Enable S3 Event Notifications to invoke Lambda directly

AnswersA, E

Ensures Lambda has enough capacity to handle bursts.

Why this answer

Reserved concurrency ensures that the Lambda function always has a guaranteed number of concurrent executions available, preventing it from being throttled by other functions in the same AWS account. This is critical for high-throughput, low-latency pipelines because S3 event notifications can burst many invocations simultaneously, and without reserved concurrency, the function might hit the account-level concurrency limit and drop events.

Exam trap

The trap here is that candidates often confuse 'reducing latency' with 'using a VPC' or 'adding a queue,' but for S3-triggered Lambda, direct invocation with reserved concurrency is the simplest and lowest-latency path, while VPCs and queues add overhead.

Full explanation →

807

MCQmedium

A data scientist is analyzing a dataset with 500 features and 10,000 rows. The target variable is binary. After training a logistic regression model, the coefficients show many non-zero values but the model has low accuracy on the test set. Which EDA step should the data scientist perform next to improve model performance?

A.Apply Principal Component Analysis (PCA) to reduce dimensionality.

B.Collect more training data to improve generalization.

C.Normalize the features using StandardScaler.

D.Use correlation analysis or mutual information to select the most relevant features.

AnswerD

Feature selection removes irrelevant features, reducing noise and overfitting.

Why this answer

Option B is correct because feature selection helps reduce noise and overfitting, improving model accuracy. Option A is wrong because scaling does not reduce the number of features. Option C is wrong because PCA may lose interpretability and is not directly aimed at reducing overfitting due to irrelevant features.

Option D is wrong because more data does not necessarily address the issue of irrelevant features.

Full explanation →

808

MCQmedium

A company is using Amazon SageMaker to train a deep learning model. The training job uses a script that reads data from Amazon S3 using the SageMaker SDK's `s3_input` method. The training job runs on a single ml.p3.2xlarge instance. The data scientist notices that the GPU utilization is very low during training, often below 20%. The training dataset is large, approximately 50 GB, stored as TFRecord files in S3. What is the MOST likely cause of low GPU utilization?

A.The training script is using a CPU-only version of TensorFlow.

B.The data loading pipeline is not optimized, causing the GPU to wait for data.

C.The batch size is too large, causing the GPU to run out of memory.

D.The instance type does not have enough GPU memory for the model.

AnswerB

Correct: Inefficient data loading leads to GPU starvation.

Why this answer

Low GPU utilization often indicates that the data loading pipeline is a bottleneck. The training script may not be using efficient data loading techniques like prefetching and parallel data extraction. Option A is correct.

Option B (batch size) could be a factor but is less likely given TFRecord format. Option C (instance type) is unlikely because ml.p3.2xlarge has a capable GPU. Option D (framework) is not the cause.

Full explanation →

809

MCQmedium

A data scientist is analyzing a dataset with missing values in several features. The dataset is large (10 million rows) and stored in an S3 bucket as CSV files. The scientist wants to use AWS Glue to catalog the data and then use Amazon Athena to query it. However, the missing values are causing errors in downstream machine learning models. Which approach should the scientist take to handle missing values during exploratory data analysis?

A.Use Amazon SageMaker Data Wrangler to create a data flow that imputes missing values and export the transformed dataset to S3.

B.Use AWS Glue ETL jobs with a custom transformation script that uses the AWS Glue library to drop or impute missing values before writing to a new dataset.

C.Use Amazon Redshift Spectrum with an external table to query the data and use SQL COALESCE to handle missing values on the fly.

D.Use Amazon Athena to run SQL queries that impute missing values and write the results to a new table.

AnswerB

AWS Glue provides native transforms like DropNullFields and FillWithValue, and custom scripts allow handling missing values efficiently at scale.

Why this answer

Option C is correct because AWS Glue provides built-in transforms to handle missing values during the ETL process, and using a custom script with the AWS Glue library allows fine-grained control. Option A is wrong because Athena cannot modify data; it is only a query engine. Option B is wrong because SageMaker Data Wrangler is for interactive data preparation, not for large-scale automated ETL.

Option D is wrong because Redshift Spectrum is for querying, not for cleaning missing values.

Full explanation →

810

MCQeasy

A data scientist is training a linear regression model on a dataset with 10 features. After training, the model has high variance on the test set. Which technique should the data scientist use to reduce variance without significantly increasing bias?

A.Use L2 regularization

B.Add more features

C.Use a simpler model

D.Use a deeper decision tree

AnswerA

L2 regularization penalizes large coefficients, reducing variance.

Why this answer

L2 regularization (Ridge regression) adds a penalty term proportional to the square of the magnitude of the coefficients, which shrinks them toward zero. This reduces model complexity and variance by preventing any single feature from having an overly large influence, without eliminating features entirely, thus keeping bias relatively low.

Exam trap

AWS often tests the distinction between L1 (Lasso) and L2 (Ridge) regularization, and the trap here is that candidates might think adding more features or using a simpler model is the only way to reduce variance, overlooking that L2 regularization can reduce variance without the drastic bias increase of feature elimination.

How to eliminate wrong answers

Option B is wrong because adding more features increases model complexity, which typically increases variance further, not reduces it. Option C is wrong because using a simpler model (e.g., reducing the number of features or using a less flexible algorithm) would reduce variance but at the cost of a significant increase in bias, violating the requirement to not significantly increase bias. Option D is wrong because a deeper decision tree increases model complexity and variance, which is the opposite of what is needed to address high variance.

Full explanation →

811

MCQeasy

A data scientist is investigating an application that logs errors to Amazon CloudWatch Logs. The data scientist runs the CloudWatch Logs Insights query shown in the exhibit. The query returns no results, even though the data scientist knows errors have occurred. What is the most likely cause?

A.The stats count() function is misspelled.

B.The filter pattern is case-sensitive and the log messages use a different case for 'error'.

C.The query sorts by timestamp descending, which hides results.

D.The bin(5m) function is not supported in CloudWatch Logs Insights.

AnswerB

CloudWatch Logs Insights is case-sensitive; 'ERROR' will not match 'Error'.

Why this answer

Option A is correct because the query is case-sensitive; 'ERROR' may not match 'error' or 'Error'. Option B is wrong because the sort order does not affect whether results are returned. Option C is wrong because the query uses correct syntax.

Option D is wrong because bin(5m) is valid if there are logs within the time range.

Full explanation →

812

MCQeasy

A company is using SageMaker to train a linear regression model on a dataset that fits into memory on a single instance. The training job is taking longer than expected. The data scientist wants to reduce training time without changing the algorithm. Which approach is most effective?

A.Disable automatic model tuning.

B.Use a larger instance type with more vCPUs.

C.Use SageMaker's distributed training with multiple instances.

D.Reduce the number of epochs.

AnswerC

Parallel processing reduces training time.

Why this answer

Option A is correct because SageMaker's managed training can automatically distribute the data across multiple instances for parallel training, reducing time. Option B is wrong because increasing instance size might help but may not be as cost-effective as distributed training. Option C is wrong because it would increase time.

Option D is wrong because it would lose optimization benefits.

Full explanation →

813

MCQhard

Refer to the exhibit. A CloudFormation template creates an S3 bucket. The data engineering team stores daily log files in this bucket and queries them using Amazon Athena. After 30 days, queries on logs older than 30 days start failing with 'Access Denied' errors. What is the MOST likely reason?

A.The lifecycle rule transitions objects to GLACIER after 30 days, making them inaccessible to Athena.

B.The bucket uses default encryption with SSE-S3, which Athena does not support.

C.The lifecycle rule deletes objects after 30 days.

D.The bucket policy denies access to objects older than 30 days.

AnswerA

Athena cannot query GLACIER objects; they must be restored first.

Why this answer

Option C is correct because the lifecycle rule transitions objects to GLACIER after 30 days, and Athena cannot query objects in GLACIER storage class. Option A is wrong because transition to GLACIER does not affect permissions. Option B is wrong because object is not deleted until 365 days.

Option D is wrong because SSE-S3 is not mentioned.

Full explanation →

814

Multi-Selecthard

Which THREE of the following are best practices for training a deep learning model on Amazon SageMaker?

Select 3 answers

A.Use Pipe mode for large datasets to reduce I/O overhead

B.Use SageMaker Debugger to automatically fix training errors

C.Set up automatic model tuning (hyperparameter optimization)

D.Use SageMaker Debugger to profile GPU utilization

E.Train on a single instance to avoid distributed training overhead

AnswersA, C, D

Pipe mode streams data directly, reducing disk I/O.

Why this answer

Profiling GPU utilization helps identify bottlenecks. Using Pipe mode for large datasets reduces I/O. Setting up automatic model tuning (hyperparameter optimization) is a best practice.

Training on a single instance is not a best practice for large models. Debugger is for monitoring, not for training acceleration.

Full explanation →

815

MCQhard

A company deploys a SageMaker model for inference. After a few days, response times increase significantly. CloudWatch metrics show high CPU utilization and memory usage. The model is a large ensemble. What is the most cost-effective solution?

A.Configure SageMaker automatic scaling based on CPU utilization

B.Use CloudWatch alarms to notify the team, who manually launch additional endpoints

C.Migrate the model to AWS Lambda with provisioned concurrency

D.Replace the current instance type with a larger one

AnswerA

Auto scaling dynamically adjusts instance count to handle load cost-effectively.

Why this answer

Option C is correct: automatic scaling adds instances based on demand, handling spikes cost-effectively. Option A (ad hoc monitoring) does not automatically adjust. Option B (migrate to Lambda) may not support large models.

Option D (increase instance size) is less cost-effective than scaling out.

Full explanation →

816

MCQeasy

A data scientist is trying to run a SageMaker training job that writes output to an S3 bucket 'my-bucket'. The IAM policy is shown. The training job fails with an AccessDenied error when trying to write to S3. What is the reason?

A.The S3 bucket is encrypted with AWS KMS and the policy does not include kms:GenerateDataKey

B.The policy does not allow s3:ListBucket

C.The policy does not allow s3:PutObject

D.The policy does not allow s3:PutObjectAcl

AnswerA

When KMS encryption is used, SageMaker needs kms:GenerateDataKey permission to write.

Why this answer

The correct answer is A because the training job fails with an AccessDenied error when writing to an S3 bucket that is encrypted with AWS KMS. The IAM policy shown must include the `kms:GenerateDataKey` permission to allow the SageMaker training job to generate a data key for encrypting the output objects. Without this KMS permission, the S3 PutObject operation is denied even if the policy allows `s3:PutObject`, as KMS encryption requires explicit authorization to use the customer master key (CMK).

Exam trap

The trap here is that candidates often focus only on S3 permissions (like `s3:PutObject`) and overlook the need for KMS permissions when the bucket uses SSE-KMS, leading them to incorrectly select Option C or D.

How to eliminate wrong answers

Option B is wrong because `s3:ListBucket` is not required for writing objects to S3; it is needed for listing bucket contents, not for PutObject operations. Option C is wrong because the policy likely includes `s3:PutObject` (as the question implies the policy allows writing), but the AccessDenied error stems from missing KMS permissions, not from a missing PutObject action. Option D is wrong because `s3:PutObjectAcl` is only required when explicitly setting object ACLs during upload, which is not a default behavior for SageMaker training jobs; the error is not related to ACL management.

Full explanation →

817

MCQeasy

A company stores sensitive customer data in Amazon S3. The company must ensure that data is encrypted at rest. The company also needs to manage the encryption keys using an AWS service that allows automatic rotation of keys. Which solution meets these requirements?

A.Use client-side encryption with AWS CloudHSM

B.Use Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS) and enable automatic key rotation

C.Use Server-Side Encryption with Customer-Provided Keys (SSE-C)

D.Use Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3)

AnswerB

SSE-KMS allows automatic rotation of KMS keys.

Why this answer

Option B is correct because AWS KMS with automatic key rotation meets the encryption and key management requirements. Option A is wrong because SSE-S3 uses S3-managed keys that cannot be rotated manually. Option C is wrong because SSE-C requires the customer to manage keys.

Option D is wrong because CloudHSM is a hardware security module but does not automatically rotate keys.

Full explanation →

818

Multi-Selecteasy

A data scientist wants to understand the distribution and missing values in a large dataset stored in Amazon S3. Which TWO AWS services can be used directly for this exploratory data analysis? (Choose TWO.)

Select 2 answers

A.Amazon SageMaker Data Wrangler

B.AWS CloudTrail

C.Amazon Athena

D.Amazon EMR

E.AWS Glue DataBrew

AnswersC, E

Athena can run SQL queries to compute distributions and count nulls.

Why this answer

Amazon Athena allows SQL-based queries on S3 data, including aggregation for distribution analysis. AWS Glue DataBrew provides visual profiling to detect missing values and distributions. SageMaker Data Wrangler is also a valid choice but is not a direct service for S3 data without additional steps; EMR requires cluster setup.

Full explanation →

819

MCQhard

A data scientist needs to run a hyperparameter tuning job for a deep learning model. Which SageMaker feature should they use?

A.SageMaker Hyperparameter Tuning Job

B.SageMaker Experiments

C.SageMaker Automatic Model Tuning

D.SageMaker Processing

AnswerA

This is the correct feature for hyperparameter optimization.

Why this answer

Option B is correct because SageMaker Hyperparameter Tuning Jobs are designed for this purpose. Option A is wrong because Automatic Model Tuning is the same as hyperparameter tuning but the correct term is 'Hyperparameter Tuning Job'. Option C is wrong because SageMaker Experiments track runs, not tune.

Option D is wrong because SageMaker Processing is for data processing.

Full explanation →

820

MCQmedium

A company is storing customer transaction data in Amazon S3 as CSV files. A data scientist uses AWS Glue to crawl the data and create a table in the AWS Glue Data Catalog. When querying the table with Amazon Athena, the data scientist notices that some columns have NULL values where data should exist. The data scientist examines the raw CSV files and confirms the data is present. What is the most likely cause of the NULL values?

A.The CSV files have different schemas (e.g., different columns) across partitions.

B.Athena is configured to skip corrupted records, causing NULLs.

C.The Glue crawler incorrectly inferred the data type of the columns.

D.The CSV files use a custom delimiter that the Glue crawler does not recognize.

AnswerA

Schema evolution causes missing columns to appear as NULL when queried.

Why this answer

Option D is correct because the Glue crawler infers schema from the first few files; if later files have different schemas (e.g., more columns), the extra data is not captured. Option A is wrong because the crawler handles CSV without SerDe issues. Option B is wrong because Athena does not modify data.

Option C is wrong because the issue is schema mismatch, not data type inference.

Full explanation →

821

MCQhard

A team is using Amazon SageMaker Autopilot to automatically build models. The dataset has 50 features and 1 million rows. After training, Autopilot generates multiple candidates. The team wants to deploy the model with the highest accuracy. What is the best practice to select and deploy the model?

A.Deploy all candidates behind a multi-model endpoint and route traffic based on request features

B.Select the model with the highest validation accuracy after performing additional hyperparameter tuning

C.Manually review each candidate's architecture and select the one with the simplest design

D.Deploy the candidate with the highest objective metric value from the Autopilot leaderboard

AnswerD

Autopilot ranks candidates by objective metric.

Why this answer

SageMaker Autopilot's best candidate is determined by the objective metric. Option B is wrong because deploying all candidates wastes resources. Option C is wrong because the highest accuracy may not generalize; using the best objective metric is standard.

Option D is wrong because manual selection is subjective.

Full explanation →

822

MCQhard

A data scientist is using SageMaker to train a TensorFlow model. The training script uses tf.data.Dataset to load data from S3. Training is slow because of I/O bottleneck. Which change should the data scientist make to improve I/O performance?

A.Enable EBS optimization on the training instance.

B.Use Pipe input mode for the training channel.

C.Use SageMaker local mode for training.

D.Convert the dataset to RecordIO format.

AnswerB

Pipe mode streams data directly from S3, reducing I/O overhead.

Why this answer

Option B is correct because Pipe input mode streams data directly from S3 into the training algorithm without writing to disk, eliminating the I/O bottleneck caused by downloading entire files. This is particularly effective with tf.data.Dataset, as the pipeline can consume data incrementally, reducing latency and improving throughput for large datasets.

Exam trap

The trap here is that candidates often confuse EBS optimization (which improves local disk performance) with S3 data access optimization, or assume that RecordIO is a universal performance fix, ignoring that TensorFlow's native pipeline benefits more from streaming input modes.

How to eliminate wrong answers

Option A is wrong because EBS optimization improves network throughput for EBS volumes, but the training script loads data from S3, not from an EBS volume; the bottleneck is S3 I/O, not EBS. Option C is wrong because SageMaker local mode runs training on the local instance's file system, which does not address S3 I/O bottlenecks and may even exacerbate them if data must be downloaded first. Option D is wrong because converting to RecordIO format is beneficial for SageMaker's built-in algorithms (e.g., XGBoost) that natively support it, but TensorFlow's tf.data.Dataset works optimally with native formats like TFRecord; RecordIO does not improve S3 streaming performance and adds unnecessary conversion overhead.

Full explanation →

823

Multi-Selecthard

A data scientist is evaluating feature engineering options for a dataset containing a categorical variable 'education_level' with values: High School, Bachelor, Master, PhD. The target variable is continuous. Which THREE encoding methods are appropriate for this ordinal categorical variable? (Choose 3)

Select 3 answers

A.One-hot encoding

B.Target encoding (mean of target per category)

C.Hash encoding (using feature hashing)

D.Label encoding (e.g., High School=0, Bachelor=1, Master=2, PhD=3)

E.Binary encoding (convert to binary representation)

AnswersA, B, D

One-hot encoding is a safe option that does not assume any order, though it increases dimensionality.

Why this answer

Options A, B, and E are correct because label encoding preserves ordinality, target encoding captures the relationship with the target, and one-hot encoding is a safe fallback. Option C is wrong because binary encoding assumes nominal categories. Option D is wrong because hash encoding loses interpretability and may cause collisions.

Full explanation →

824

Multi-Selecteasy

A data scientist is performing hyperparameter optimization for a gradient boosting model using Amazon SageMaker Automatic Model Tuning. The objective metric is 'validation:logloss'. Which TWO strategies can help the tuning job converge faster? (Choose TWO.)

Select 2 answers

A.Use Bayesian optimization strategy

B.Increase the number of tuning jobs

C.Increase the resource limits for each training job

D.Use random search strategy

E.Use early stopping based on the objective metric

AnswersA, E

Bayesian optimization intelligently selects hyperparameters to converge faster.

Why this answer

Options A and D are correct. Early stopping terminates poorly performing jobs early, saving resources. Bayesian optimization is more efficient than random search.

Option B is wrong because random search is less efficient. Option C is wrong because more tuning jobs increase time to convergence. Option E is wrong because increasing resource limits does not speed convergence.

Full explanation →

825

MCQeasy

A data scientist wants to understand the statistical relationship between two categorical variables in a dataset. Which test is most appropriate?

A.Chi-squared test

B.Pearson correlation coefficient

C.Student's t-test

D.ANOVA test

AnswerA

Correct: Chi-squared test is used for association between categorical variables.

Why this answer

Option B is correct because the chi-squared test tests independence between categorical variables. Option A is wrong because ANOVA is for continuous vs categorical. Option C is wrong because Pearson correlation is for continuous variables.

Option D is wrong because t-test compares means of two groups.

Full explanation →

Page 11 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →