Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1201–1275

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 17 of 24

1201

MCQeasy

A company is using SageMaker Autopilot to automatically build a binary classification model. After the AutoML job completes, the data scientist wants to understand which features are most important for the best candidate model. How can the scientist get feature importance?

A.Open the SageMaker Autopilot job details and view the 'Explainability' tab

B.Re-run the best model using SageMaker built-in XGBoost with the 'feature_importance' hyperparameter

C.Check the CloudWatch Logs for the training job

D.Use SageMaker Ground Truth to label a new dataset

AnswerA

Autopilot provides feature importance in the explainability tab for the best candidate.

Why this answer

SageMaker Autopilot automatically generates a 'Explainability' tab within the job details for the best candidate model. This tab uses SHAP (SHapley Additive exPlanations) values to provide feature importance, showing which features most influence the model's predictions. The data scientist can directly access this information without any additional configuration or re-running the model.

Exam trap

AWS often tests the misconception that feature importance must be manually extracted via code or logs, when in fact SageMaker Autopilot provides it directly in the UI under the 'Explainability' tab for the best candidate model.

How to eliminate wrong answers

Option B is wrong because SageMaker built-in XGBoost does not have a 'feature_importance' hyperparameter; feature importance is a property of the trained model object (e.g., via `get_fscore()` or `plot_importance()`), not a hyperparameter set before training. Option C is wrong because CloudWatch Logs for the training job contain training metrics, loss values, and algorithm logs, but not structured feature importance data; feature importance is not emitted to logs by default. Option D is wrong because SageMaker Ground Truth is a data labeling service for creating labeled datasets, not for extracting feature importance from a trained model; it is unrelated to model interpretability.

Full explanation →

1202

Multi-Selecteasy

A data scientist wants to identify outliers in a dataset. Which TWO techniques are commonly used for outlier detection during EDA?

Select 2 answers

A.Box plot

B.Heatmap

C.Z-score analysis

D.Bar chart

E.Pearson correlation coefficient

AnswersA, C

Box plots show outliers as points outside the whiskers.

Why this answer

Option A (box plot) identifies outliers as points beyond whiskers. Option C (Z-score) flags points with |Z| > 3. Option B is wrong because bar charts are for categorical data.

Option D is wrong because Pearson correlation measures linear relationship. Option E is wrong because heatmaps show correlations.

Full explanation →

1203

MCQhard

A data scientist wants to run a one-time SQL query on a large dataset stored in Amazon S3 (CSV format, 2 TB) using Amazon Athena. The query involves joining this dataset with a smaller table stored in Amazon RDS. What is the MOST cost-effective and performant approach?

A.Export the RDS table to S3 in Parquet format, then use Athena to join the two S3 datasets

B.Use Amazon Redshift Spectrum to query both S3 and RDS

C.Use Athena Federated Query to query RDS directly

D.Use AWS Glue ETL to join the data and write results back to S3, then query with Athena

AnswerA

This keeps the query in Athena's environment, avoiding data movement and using columnar format for performance.

Why this answer

Option A is correct. Exporting the RDS table to S3 as Parquet and running the join in Athena avoids data transfer costs and leverages Athena's fast query engine. Option B (federated query) adds complexity and may be slower.

Option C (Redshift Spectrum) requires a Redshift cluster. Option D (Glue ETL) is overkill for a one-time query.

Full explanation →

1204

Multi-Selecthard

A company is using Amazon SageMaker to deploy a model for real-time inference. The model is a deep neural network that requires GPU for low latency. The endpoint currently uses a single ml.p3.2xlarge instance. Traffic is expected to increase by 5x. Which TWO actions should the company take to handle the increased traffic?

Select 2 answers

A.Use a larger instance type with more GPUs

B.Switch to a CPU-based instance

C.Enable auto-scaling on the endpoint

D.Use a multi-model endpoint

E.Decrease the batch size

AnswersA, C

Larger instance provides more GPU compute.

Why this answer

Option B is correct because enabling auto-scaling allows the endpoint to handle variable traffic. Option D is correct because using a larger instance with more GPUs (e.g., ml.p3.8xlarge) can increase throughput. Option A is wrong because switching to CPU would increase latency.

Option C is wrong because adding more instances without scaling policy may not be optimal. Option E is wrong because reducing batch size would decrease throughput.

Full explanation →

1205

MCQeasy

A data scientist is performing exploratory data analysis on a dataset stored in Amazon S3 using Amazon SageMaker Studio. The dataset has missing values in several columns. Which approach is the MOST efficient way to handle missing values within SageMaker Studio?

A.Run a Jupyter notebook on a local machine to clean the data and upload back to S3.

B.Use SageMaker Data Wrangler to impute missing values with mean, median, or mode.

C.Use AWS Glue to run a find-and-replace operation.

D.Write a custom Python script using pandas to drop rows with missing values.

AnswerB

Data Wrangler provides a visual interface for imputation.

Why this answer

SageMaker Data Wrangler provides a visual interface to handle missing values. Option B is wrong because writing a custom script is less efficient. Option C is wrong because Glue is external.

Option D is wrong because running a notebook from the local machine is not efficient.

Full explanation →

1206

MCQeasy

A data scientist is performing exploratory data analysis on a dataset with missing values. The dataset contains a column 'age' with some missing entries. Which technique is most appropriate for imputing missing values in the 'age' column if the data is normally distributed?

A.Drop all rows with missing values.

B.Replace missing values with the mode of the column.

C.Replace missing values with the mean of the column.

D.Replace missing values with the median of the column.

AnswerC

Mean imputation is suitable for normally distributed data.

Why this answer

Option A is correct because mean imputation is appropriate for normally distributed data. Option B is wrong because median imputation is better for skewed data. Option C is wrong because mode imputation is for categorical data.

Option D is wrong because dropping rows reduces sample size.

Full explanation →

1207

MCQhard

A research lab stores large genomic datasets in Amazon S3 Glacier Deep Archive. They need to run a one-time analysis on a subset of 10 PB of data. The analysis will use an Amazon EMR cluster with Amazon S3 as the data source. What is the MOST cost-effective and performant way to make the data available for the EMR cluster?

A.Restore the data to S3 Standard-IA and delete after the analysis

B.Configure the EMR cluster to read directly from Glacier Deep Archive using S3 Console

C.Initiate a Bulk retrieval request and restore the data to S3 Standard for the duration of the analysis

D.Initiate an Expedited retrieval request and use the temporary copy for the EMR cluster

AnswerC

Bulk retrieval is the lowest cost tier, and restoring to Standard avoids IA minimum charges.

Why this answer

Option D is correct because Bulk retrieval is the cheapest retrieval tier for Glacier Deep Archive, and restoring to S3 Standard allows the EMR cluster to read efficiently. Option A is wrong because reading directly from Glacier is not supported by EMR. Option B is wrong because Expedited is more expensive and not needed for a one-time batch job.

Option C is wrong because restoring to S3 Standard-IA adds retrieval costs for data that is accessed once.

Full explanation →

1208

MCQhard

A company is migrating its on-premises Apache Hadoop cluster to AWS. The cluster processes large datasets using Spark jobs. The company wants to minimize operational overhead and use native AWS services. Which combination of services should the company use?

A.Amazon EMR with Spark and Amazon S3

B.Amazon Redshift with Spectrum and Amazon S3

C.Amazon Athena and AWS Glue

D.Amazon EC2 instances with Apache Spark installed and Amazon S3

AnswerA

EMR is a managed service that runs Spark and integrates with S3.

Why this answer

Option B is correct because Amazon EMR is a managed Hadoop framework that supports Spark, and S3 is a scalable storage layer. Option A is wrong because EC2 would require managing the cluster manually. Option C is wrong because Redshift is a data warehouse, not a Hadoop replacement.

Option D is wrong because Athena is for querying, not for running Spark jobs.

Full explanation →

1209

Multi-Selecteasy

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous dataset? (Select TWO.)

Select 2 answers

A.Z-score method

B.IQR (Interquartile Range) method

C.Box plot visualization

D.Pearson correlation coefficient

E.K-means clustering

AnswersA, B

Z-scores beyond a threshold (e.g., 3) indicate outliers.

Why this answer

Options B and D are correct. IQR-based outlier detection identifies points beyond 1.5*IQR from quartiles. Z-score method flags points beyond a threshold (e.g., 3) from mean.

Option A is wrong because clustering is multivariate. Option C is wrong because box plots visualize outliers but are not a detection technique per se; they use IQR. Option E is wrong because correlation is bivariate.

Full explanation →

1210

Multi-Selecteasy

A data scientist is using Amazon SageMaker to train a large neural network on a GPU instance. The training is taking longer than expected. The scientist wants to reduce training time without changing the model architecture. Which TWO approaches should the scientist consider?

Select 2 answers

A.Use SageMaker Automatic Model Tuning to find optimal hyperparameters.

B.Use SageMaker Managed Spot Training to reduce cost.

C.Use SageMaker's distributed training with multiple GPU instances.

D.Switch to a larger GPU instance type with more CUDA cores.

E.Enable SageMaker Debugger to capture training metrics.

AnswersC, D

Distributed training parallelizes computation, reducing wall-clock time.

Why this answer

Using multiple GPU instances with SageMaker distributed training (A) can accelerate training. Using SageMaker Managed Spot Training (B) reduces cost but not time. Using SageMaker Debugger (C) helps debugging but not speed.

SageMaker Automatic Model Tuning (D) is for hyperparameter optimization. Using a larger GPU instance (E) with more memory and compute can directly reduce training time.

Full explanation →

1211

MCQhard

A machine learning team is using Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: 'Unable to write to /opt/ml/model'. The container does not have root access. What is the most likely cause?

A.The /opt/ml/model directory does not exist in the container

B.The container is using an unsupported operating system

C.The container does not have internet access

D.The container process does not have write permission to /opt/ml/model

AnswerD

The process user lacks write permissions to the directory.

Why this answer

SageMaker expects training containers to write the model artifact to /opt/ml/model. The container process must have write permissions to that directory. Root access is not required; the container runs with the 'sagemaker' user.

The directory may not exist or permissions are wrong. The error indicates a write issue, likely permissions.

Full explanation →

1212

MCQhard

A company uses Amazon SageMaker to host a model for fraud detection. The model uses a custom XGBoost container. The endpoint receives about 100 requests per second, each with 50 features. The team notices that the model's predictions are occasionally incorrect for a subset of requests. Which approach should the team take to debug the issue?

A.Use SageMaker Debugger to capture tensors during inference.

B.Scale the endpoint to more instances to reduce load.

C.Enable SageMaker Model Monitor to capture and analyze inference data.

D.Enable detailed CloudWatch Logs for the endpoint.

AnswerC

Model Monitor captures input data and predictions, enabling analysis of data quality and drift.

Why this answer

Enabling SageMaker Model Monitor with data capture (Option D) allows the team to review actual input data and predictions to detect data drift or anomalies. Option A (CloudWatch Logs) only shows container logs, not per-request payloads. Option B (Debugger) is for training, not inference.

Option C (increase instances) addresses capacity, not accuracy.

Full explanation →

1213

MCQhard

A data scientist is exploring a dataset with 200 features. They compute the pairwise correlation matrix and notice that many features have correlations above 0.95. They want to reduce redundancy before modeling. Which of the following techniques is most appropriate for identifying and removing highly correlated features?

A.Compute mutual information between each feature and the target.

B.Apply PCA and keep the first 50 components.

C.Use Lasso regression to select features.

D.Perform hierarchical clustering on the correlation matrix and select one feature per cluster.

AnswerD

This systematically removes redundancy while retaining representative features.

Why this answer

Option D is correct because hierarchical clustering on correlations groups correlated features; then one can select a representative from each cluster. Option A is wrong because PCA creates new features but does not remove original ones. Option B is wrong because Lasso performs feature selection but may not handle multicollinearity well.

Option C is wrong because mutual information does not capture pairwise redundancy directly.

Full explanation →

1214

MCQhard

A team is building a model to predict house prices. They have a dataset with features like 'SquareFootage', 'Bedrooms', 'YearBuilt', and 'Neighborhood'. They notice that 'SquareFootage' has a few extreme values (e.g., 50,000 sq ft) that are likely data entry errors. They want to handle these outliers without losing all the data. Which of the following approaches is most robust?

A.Cap 'SquareFootage' at the 99th percentile value.

B.Replace extreme values with the mean of 'SquareFootage'.

C.Apply log transformation to 'SquareFootage'.

D.Remove rows where 'SquareFootage' is above 3 standard deviations from the mean.

AnswerA

Capping limits extremes while retaining the records.

Why this answer

Option B is correct because capping at percentiles (e.g., 99th) limits extreme values while keeping the data points. Option A is wrong because removing rows with any outlier may discard useful data. Option C is wrong because log transformation does not fix errors.

Option D is wrong because imputing with mean distorts the distribution.

Full explanation →

1215

MCQeasy

A retail company uses Amazon Redshift for its data warehouse. The data engineering team runs ETL jobs that load data from multiple sources into Redshift daily. They notice that the load performance is slow and the cluster CPU utilization is high during the ETL window. The team wants to improve load performance without changing the cluster configuration. They currently load data using INSERT statements from a staging table. What should they do?

A.Run VACUUM and ANALYZE before loading

B.Use the COPY command to load data from S3 in parallel

C.Increase the number of nodes in the Redshift cluster

D.Apply compression encoding on the staging table

AnswerB

COPY is optimized for bulk loading.

Why this answer

Using the COPY command is the most efficient way to load data into Redshift, as it uses parallel processing. Option A is wrong because increasing node count changes configuration. Option B is wrong because VACUUM is for space reclaim.

Option D is wrong because compression might help but not as much as COPY.

Full explanation →

1216

MCQhard

A data scientist is building a model to predict customer churn. The dataset has 20 features, including categorical variables with high cardinality (e.g., ZIP code). The data scientist wants to use a linear model. Which feature engineering technique is MOST appropriate for the high-cardinality categorical features?

A.One-hot encoding

B.Target encoding

C.TF-IDF

D.Standard scaling

AnswerB

Target encoding handles high cardinality well.

Why this answer

Target encoding is the most appropriate technique for high-cardinality categorical features when using a linear model because it replaces each category with the mean of the target variable for that category, creating a numeric feature that captures the relationship between the category and the target without exploding the feature space. One-hot encoding would create an unmanageable number of binary columns (e.g., thousands of ZIP codes), leading to the curse of dimensionality and making the linear model unstable or computationally infeasible.

Exam trap

AWS often tests the trap that candidates default to one-hot encoding for all categorical variables, failing to recognize that high cardinality makes it impractical for linear models, whereas target encoding is a more efficient alternative that preserves feature information without dimensionality explosion.

How to eliminate wrong answers

Option A is wrong because one-hot encoding for high-cardinality features like ZIP code would generate thousands of dummy variables, causing the linear model to suffer from the curse of dimensionality, multicollinearity, and overfitting. Option C is wrong because TF-IDF is designed for text data to weigh term frequency against inverse document frequency, not for encoding categorical variables like ZIP codes in a churn prediction task. Option D is wrong because standard scaling is a normalization technique for numerical features, not a method for encoding categorical variables, and applying it to raw categorical labels would be meaningless.

Full explanation →

1217

MCQeasy

A machine learning engineer needs to process a large dataset that does not fit on a single Amazon SageMaker notebook instance's EBS volume. The data is stored in S3. What is the MOST efficient way to access the data from the notebook?

A.Increase the EBS volume size to 5 TB.

B.Mount the S3 bucket as a file system using s3fs.

C.Read the data directly from S3 using the boto3 library.

D.Use SageMaker File input mode in the notebook.

AnswerC

Reading directly from S3 avoids storage limitations and is efficient for large datasets.

Why this answer

Option C is correct because reading data directly from S3 using the boto3 library is the most efficient approach for a dataset that exceeds the notebook instance's EBS volume capacity. Boto3 allows you to stream data in chunks or use S3 Select for server-side filtering, avoiding the need to download the entire dataset to local storage. This method leverages S3's high-throughput API and eliminates the bottleneck of writing to a local EBS volume, which is limited in size and I/O performance.

Exam trap

The trap here is that candidates confuse SageMaker's File input mode (designed for training jobs) with a general-purpose data access method for notebooks, or they assume that mounting S3 as a filesystem (s3fs) is efficient for large-scale data processing, when in reality it introduces performance penalties due to FUSE overhead and lack of native parallel I/O.

How to eliminate wrong answers

Option A is wrong because increasing the EBS volume to 5 TB does not solve the fundamental issue of the dataset not fitting; it only postpones the problem and incurs unnecessary cost, and SageMaker notebook instances have a maximum EBS volume size of 5 TB, which may still be insufficient for extremely large datasets. Option B is wrong because mounting an S3 bucket as a file system using s3fs relies on FUSE (Filesystem in Userspace), which introduces significant latency and overhead due to metadata caching and POSIX translation, and is not designed for high-throughput data processing in a notebook environment. Option D is wrong because SageMaker File input mode is a training job feature that streams data from S3 to the training container, not a method for accessing data within a notebook instance; it cannot be used directly in a notebook's kernel.

Full explanation →

1218

MCQmedium

A data scientist is building a regression model to predict house prices. The dataset includes a feature 'zip_code' with 1,000 unique values. What is the best way to handle this categorical feature in the exploratory data analysis phase?

A.One-hot encode the zip_code feature

B.Apply target encoding using the mean house price per zip code

C.Replace zip_code with the frequency of each zip code in the dataset

D.Use label encoding: assign each zip code a unique integer

AnswerB

Why A is correct

Why this answer

Option A is correct because target encoding (mean encoding) captures the relationship between the category and the target, and is suitable for high-cardinality features. Option B is wrong because one-hot encoding would create too many dummy variables. Option C is wrong because label encoding implies ordinality which is not present.

Option D is wrong because frequency encoding may not capture price variation well.

Full explanation →

1219

Multi-Selecteasy

Which TWO services can be used to perform hyperparameter tuning in Amazon SageMaker? (Choose two.)

Select 2 answers

A.Amazon SageMaker Automatic Model Tuning

B.Amazon SageMaker Experiments

C.AWS Glue

D.Amazon SageMaker Ground Truth

E.Amazon EMR

AnswersA, B

This is the native hyperparameter tuning service.

Why this answer

Options A and B are correct. SageMaker Automatic Model Tuning is the native tuning service. SageMaker Experiments can track tuning jobs.

Option C (SageMaker Ground Truth) is for labeling. Option D (AWS Glue) is for ETL. Option E (Amazon EMR) is for big data processing.

Full explanation →

1220

MCQhard

A company runs a real-time fraud detection model on a SageMaker endpoint. The model is a TensorFlow neural network trained on transactional data. The endpoint uses a single ml.p3.2xlarge instance. Recently, the application’s latency has increased from 50ms to 500ms on average. The CloudWatch metrics show that CPU utilization is at 90%, GPU utilization is at 30%, and memory utilization is at 40%. The number of requests per second has remained stable. The ML team suspects the model is not fully utilizing the GPU. What action should the team take to reduce latency without changing the instance type?

A.Switch to SageMaker Batch Transform to process requests in batches

B.Change the endpoint to a compute-optimized instance like ml.c5.large

C.Use SageMaker Neo to compile the model for the target instance

D.Increase the number of instances behind the endpoint and use a load balancer

AnswerC

Neo optimizes model to better utilize GPU.

Why this answer

Optimizing the model for inference using SageMaker Neo can reduce latency by better leveraging GPU. Option A is wrong because increasing instances only helps throughput, not per-request latency. Option B is wrong because SageMaker Batch Transform is for offline inference.

Option D is wrong because CPU instance would not improve GPU utilization.

Full explanation →

1221

MCQmedium

A data scientist is using Amazon SageMaker Ground Truth to create a labeled dataset for object detection. The team has limited budget and wants to minimize labeling costs while ensuring high-quality labels. Which approach is MOST cost-effective?

A.Use only a private workforce of domain experts to label all data.

B.Use a public workforce and have each data point labeled by three workers.

C.Use active learning to automatically label high-confidence data and send only uncertain data to a private workforce.

D.Use the built-in automated labeling feature without human review.

AnswerC

Active learning reduces labeling cost while ensuring quality.

Why this answer

Option D is correct because active learning selects the most uncertain samples for human labeling, reducing the number of labels needed while maintaining quality. Option A is wrong because using only automated labeling may introduce errors. Option B is wrong because using only workforce is expensive.

Option C is wrong because using all workers is costly.

Full explanation →

1222

MCQhard

A machine learning team is deploying a model using Amazon SageMaker. The model receives requests with sparse high-dimensional features. The team wants to minimize inference latency. Which SageMaker endpoint configuration is MOST suitable?

A.Use a multi-variant endpoint with two variants

B.Use a serverless endpoint with provisioned concurrency

C.Use a single model endpoint with a large instance type

D.Use a multi-model endpoint on a GPU instance

AnswerD

Multi-model endpoints reduce latency by loading models on demand.

Why this answer

Option D is correct because multi-model endpoints on GPU instances allow multiple models to be loaded into memory on a single GPU-backed instance, reducing cold-start latency for sparse high-dimensional features by keeping models warm and leveraging GPU parallelism for inference. This minimizes inference latency compared to other configurations by avoiding the overhead of separate endpoint invocations and optimizing resource utilization for high-dimensional data.

Exam trap

The trap here is that candidates often assume a large single-instance endpoint (Option C) is sufficient for low latency, but they overlook the GPU acceleration and memory efficiency of multi-model endpoints for sparse high-dimensional data, which is a key optimization tested in the MLS-C01 exam.

How to eliminate wrong answers

Option A is wrong because a multi-variant endpoint with two variants distributes traffic across multiple model versions or instance types, which does not inherently reduce inference latency for sparse high-dimensional features and can introduce additional routing overhead. Option B is wrong because a serverless endpoint with provisioned concurrency is designed for infrequent or variable traffic patterns, but it incurs cold-start latency for each new invocation and is not optimized for high-dimensional sparse data that benefits from GPU acceleration. Option C is wrong because a single model endpoint with a large instance type may provide sufficient compute but does not leverage GPU parallelism for sparse high-dimensional features, leading to suboptimal inference latency compared to GPU-based multi-model endpoints.

Full explanation →

1223

MCQmedium

A data scientist is using Amazon SageMaker to train a model using the built-in XGBoost algorithm. The training job is taking a long time. The data scientist notices that the input data is in CSV format and the training job is using File mode. The data size is 50 GB. What is the BEST way to reduce training time?

A.Use a larger instance type with more vCPUs.

B.Convert the data to Parquet format.

C.Reduce the number of features in the dataset.

D.Switch the input mode to Pipe.

AnswerD

Pipe mode reduces I/O wait time by streaming data.

Why this answer

Option B is correct. Pipe mode streams data directly from S3, reducing I/O overhead. Option A is wrong because increasing instance count may not help if I/O is bottleneck.

Option C is wrong because converting to Parquet may not be supported directly by XGBoost. Option D is wrong because reducing data harms accuracy.

Full explanation →

1224

MCQmedium

An ML team uses Amazon SageMaker to train a deep learning model. The training job runs on a single ml.p3.2xlarge instance and is taking 10 hours. The team wants to reduce the training time to under 2 hours without changing the model architecture. Which approach is MOST effective?

A.Use SageMaker distributed training with multiple ml.p3.2xlarge instances.

B.Use SageMaker Managed Spot Training to reduce cost.

C.Switch to a single ml.p3.16xlarge instance with more GPUs.

D.Enable SageMaker Debugger to identify bottlenecks.

AnswerA

Distributed training partitions the model or data across instances, reducing wall-clock time.

Why this answer

Using SageMaker's distributed training with multiple GPU instances (A) can significantly reduce training time by parallelizing the workload. Changing to a larger instance (B) may help but not as much as multiple instances. Using SageMaker Debugger (C) does not speed up training.

Using Spot Instances (D) saves cost but not time.

Full explanation →

1225

MCQmedium

A company is training a deep learning model on SageMaker using a large dataset stored in S3. The training job is taking a long time due to I/O bottlenecks. Which action would MOST effectively reduce the I/O bottleneck?

A.Use Amazon EFS as the data source.

B.Use Amazon FSx for Lustre as the data source.

C.Increase the number of training instances.

D.Use Pipe input mode in the SageMaker estimator.

AnswerD

Pipe mode streams data, reducing disk I/O.

Why this answer

Using Pipe mode streams data directly from S3 to the algorithm without writing to disk, reducing I/O wait. Option A (increasing instance count) may help but not directly address I/O. Option C (using EFS) adds latency.

Option D (using FSx) is more complex.

Full explanation →

1226

MCQmedium

A company is using Amazon Rekognition to detect objects in images stored in S3. They want to reduce costs by processing images only when they are uploaded. Which AWS service should be used to trigger Rekognition automatically?

A.Amazon CloudWatch Events

B.Amazon Simple Notification Service (SNS)

C.AWS Lambda

D.AWS Step Functions

AnswerC

Lambda can be triggered by S3 event and call Rekognition.

Why this answer

Option C is correct because S3 events can trigger Lambda, which calls Rekognition API. Option A is incorrect because CloudWatch Events can trigger on a schedule, not directly on S3 uploads. Option B is incorrect because Step Functions can orchestrate but not directly triggered by S3 uploads without Lambda.

Option D is incorrect because SNS is passive and needs a subscriber.

Full explanation →

1227

Multi-Selectmedium

Which TWO configurations are required to enable AWS Glue to access data stored in a VPC? (Choose two.)

Select 2 answers

A.A VPC endpoint for Amazon S3.

B.An AWS Glue connection object that specifies the VPC, subnet, and security group.

C.A NAT gateway in a public subnet.

D.An Internet gateway attached to the VPC.

E.An S3 bucket policy that allows access from the Glue service principal.

AnswersA, B

Correct: Allows Glue jobs in VPC to access S3 without Internet.

Why this answer

Glue jobs running inside a VPC require a VPC endpoint for S3 (or Internet/NAT) to access S3 data, and a Glue connection that specifies the VPC, subnet, and security group. Option A (VPC endpoint for S3) and Option D (Glue connection) are correct. Option B (Internet gateway) is not secure.

Option C (NAT gateway) is an alternative but not required if using VPC endpoint. Option E (S3 bucket policy) is not specific to VPC access.

Full explanation →

1228

Multi-Selectmedium

A data engineer is exploring a large dataset in Amazon Athena. The dataset is partitioned by date and stored in Parquet format. The engineer wants to check the number of distinct values in a column for a specific date range. Which THREE practices reduce query cost and improve performance?

Select 3 answers

A.Use the COUNT(DISTINCT column) function.

B.Filter the query with a WHERE clause on the partition column.

C.Use ORDER BY to sort the results.

D.Use SELECT * to retrieve all columns.

E.Ensure the table is columnar (Parquet) to reduce I/O.

AnswersA, B, E

Efficiently counts distinct values without fetching all rows.

Why this answer

Options A, C, and D are correct. Using partition filtering limits data scanned. Using COUNT(DISTINCT) is efficient but still scans; however, the question asks for reducing cost, so partition filtering is key.

Option B is wrong because SELECT * scans all columns. Option E is wrong because ORDER BY without LIMIT requires full scan and sort.

Full explanation →

1229

MCQmedium

A data scientist is using Amazon SageMaker to build a text classification model. The dataset has 100,000 labeled samples and 20 classes. The scientist wants to use a pre-trained BERT model and fine-tune it. Which approach is MOST cost-effective?

A.Train a BERT model from scratch using a larger instance.

B.Fine-tune a pre-trained BERT-base model using a GPU instance.

C.Use a pre-trained BERT-large model with a larger instance.

D.Train a CNN model from scratch using CPU instances.

AnswerB

BERT-base is cost-effective and fine-tuning is efficient.

Why this answer

Option A is correct because using a small BERT variant like BERT-base reduces computational cost while still being effective. Option B is wrong because training from scratch is expensive. Option C is wrong because using the full BERT-large is overkill.

Option D is wrong because using a CNN from scratch would require more data and training.

Full explanation →

1230

Multi-Selectmedium

Which TWO data formats are columnar and optimized for analytics queries in Amazon S3?

Select 2 answers

A.CSV

B.ORC

C.JSON

D.Avro

E.Parquet

AnswersB, E

ORC is columnar and optimized for analytics.

Why this answer

Parquet and ORC are columnar storage formats. JSON and CSV are row-oriented. Avro is row-oriented.

Full explanation →

1231

Multi-Selecthard

A machine learning team is using SageMaker Pipelines to orchestrate a multi-step workflow. The pipeline fails with a 'ThrottlingException' when submitting a training job. Which TWO actions can reduce the likelihood of throttling?

Select 2 answers

A.Use SageMaker Model Registry to version models

B.Implement retry logic with exponential backoff in the pipeline

C.Increase the number of parallel training jobs

D.Reduce the number of concurrent pipeline steps

E.Request a service quota increase for training jobs

AnswersB, D

Exponential backoff reduces request rate after throttling.

Why this answer

Throttling occurs due to API rate limits. Implementing exponential backoff and reducing concurrent submissions help. Options C and D are correct.

Option A increases load, Option B is for model registry, Option E is not possible.

Full explanation →

1232

Multi-Selecthard

A data scientist is performing exploratory data analysis on a time-series dataset of website traffic. The dataset contains hourly page views for the past two years. The scientist wants to analyze seasonality and trends. Which THREE techniques are appropriate for this analysis? (Choose THREE.)

Select 3 answers

A.Moving average smoothing

B.Box plot by month

C.Time series decomposition (additive or multiplicative)

D.Linear regression on time index

E.Autocorrelation (ACF) plot

AnswersA, C, E

Smoothing reveals underlying trend.

Why this answer

Decomposition separates time series into trend, seasonal, and residual components. Autocorrelation plot (ACF) helps identify seasonality. Moving average smooths to reveal trends.

Linear regression is not typical for seasonal decomposition. Box plot by month can show seasonal patterns but is less common for trend.

Full explanation →

1233

MCQmedium

A data scientist is analyzing a dataset and finds that the target variable has a bimodal distribution. Which preprocessing step is most appropriate before modeling?

A.Standardize the target variable to have mean 0 and variance 1.

B.Remove outliers from the target variable.

C.Consider clustering to separate the two modes and model them separately.

D.Apply a log transformation to the target variable.

AnswerC

Bimodal distribution may indicate two subpopulations.

Why this answer

Option B is correct because clustering can identify natural groups, which can be treated as separate modeling tasks. Option A is wrong because log transformation works for skewed unimodal distributions. Option C is wrong because scaling does not change distribution shape.

Option D is wrong because removing outliers would not address bimodality.

Full explanation →

1234

MCQmedium

A data scientist needs to deploy a PyTorch model for real-time inference. Which AWS service is best suited for this task?

A.Amazon SageMaker Batch Transform

B.Amazon ECS with Fargate

C.AWS Lambda with custom container

D.Amazon SageMaker real-time endpoint

AnswerD

SageMaker provides managed real-time endpoints with auto-scaling and built-in model hosting.

Why this answer

Option D is correct because Amazon SageMaker provides built-in support for deploying PyTorch models with real-time endpoints. Option A is wrong because AWS Lambda has limited memory and runtime duration. Option B is wrong because SageMaker Batch Transform is for offline inference.

Option C is wrong because ECS requires manual setup for model serving.

Full explanation →

1235

MCQmedium

A team is using Amazon SageMaker to train a deep learning model for image classification. The training job is taking too long, and they want to reduce training time without sacrificing model accuracy. Which approach is most effective?

A.Reduce the batch size

B.Reduce the number of training epochs

C.Reduce the image resolution

D.Use transfer learning with a pre-trained model and fine-tune on the target dataset

AnswerD

Transfer learning uses features learned from a large dataset, allowing faster convergence and similar accuracy.

Why this answer

Option C is correct because using a pre-trained model (transfer learning) leverages existing feature representations, reducing training time while maintaining accuracy. Option A is wrong because reducing epochs may harm accuracy. Option B is wrong because reducing batch size can increase training time.

Option D is wrong because reducing image size may lose information.

Full explanation →

1236

MCQeasy

A data scientist is using Amazon SageMaker to train an XGBoost model on a dataset with missing values. The dataset has both numeric and categorical features. Which preprocessing step is MOST appropriate before training?

A.Impute missing numeric values with the mean and categorical values with the mode, then train without encoding

B.Remove all rows with missing values and train on the remaining data

C.One-hot encode categorical features and let XGBoost handle missing values natively

D.Label encode categorical features and use the built-in missing value handling of XGBoost

AnswerC

XGBoost handles missing values by default; one-hot encoding is appropriate for categorical data.

Why this answer

Option D is correct because XGBoost can handle missing values natively, so imputation may not be necessary, and one-hot encoding is needed for categorical features. Option A (mean imputation) may be okay but not necessary. Option B (remove rows) loses data.

Option C (label encoding) may create ordinal relationships.

Full explanation →

1237

MCQhard

A company uses an Amazon SageMaker notebook to train a model using data from an S3 bucket. The IAM role attached to the notebook has the following policy. What is the MOST specific change needed to allow the notebook to read from the bucket 'ml-data-123'?

A.Add an Allow statement for 's3:GetObject' on 'ml-data-123' to the IAM policy.

B.Remove the Deny statement from the IAM policy.

C.Create an S3 access point and update the IAM policy to use the access point ARN.

D.Add a bucket policy on 'ml-data-123' that grants access to the notebook's IAM role.

AnswerB

An explicit deny overrides any allow; removing the deny allows the existing S3 actions to work.

Why this answer

Option A is correct because the existing policy denies access to the specific bucket, so the explicit deny must be removed. Option B is wrong because the AllowedPrincipal is not required for S3 bucket policies. Option C is wrong because S3 access points are not necessary.

Option D is wrong because an explicit allow cannot override an explicit deny.

Full explanation →

1238

MCQmedium

A company uses Amazon SageMaker to deploy a real-time inference endpoint for a regression model. The endpoint is experiencing high latency during spikes in traffic. The data scientist needs to reduce latency while maintaining cost efficiency. Which action should the data scientist take?

A.Use batch transform instead of real-time inference

B.Use a larger instance type for the endpoint

C.Deploy the model on a multi-model endpoint

D.Enable automatic scaling for the endpoint

AnswerD

Automatic scaling adds instances during traffic spikes, reducing latency.

Why this answer

Option D is correct because enabling automatic scaling for the SageMaker endpoint allows the number of instances to dynamically adjust based on traffic patterns, reducing latency during spikes by adding capacity when needed and removing it during low traffic to maintain cost efficiency. Automatic scaling uses CloudWatch metrics (e.g., InvocationsPerInstance or CPUUtilization) to trigger scale-out and scale-in policies, ensuring the endpoint can handle bursts without over-provisioning.

Exam trap

The trap here is that candidates confuse automatic scaling with simply adding more resources (Option B) or assume multi-model endpoints (Option C) are a latency solution, when in fact automatic scaling is the only option that directly addresses both latency spikes and cost efficiency through dynamic instance management.

How to eliminate wrong answers

Option A is wrong because batch transform is designed for offline, asynchronous inference on large datasets and does not support real-time inference, so it cannot reduce latency for a real-time endpoint. Option B is wrong because using a larger instance type may reduce latency for individual requests but increases cost significantly and does not dynamically adapt to traffic spikes, leading to either over-provisioning or continued high latency during bursts. Option C is wrong because a multi-model endpoint hosts multiple models on the same instance to improve resource utilization, but it does not inherently reduce latency during traffic spikes; in fact, it can increase latency due to model loading/unloading overhead and contention for shared resources.

Full explanation →

1239

MCQeasy

A data scientist is using Amazon SageMaker to train a linear learner model for regression. After reviewing the training logs, the data scientist notices that the loss is not decreasing and remains high. The learning rate is set to 0.01. The data is normalized. What should the data scientist do to improve convergence?

A.Normalize the data again.

B.Reduce the mini-batch size.

C.Try different learning rates, such as 0.001 or 0.1.

D.Increase the number of epochs.

AnswerC

Tuning the learning rate is a common first step to improve convergence.

Why this answer

Option B is correct. The learning rate may be too high causing oscillation or too low causing slow convergence. Adjusting it can help.

Option A is wrong because more epochs may not help if the learning rate is inappropriate. Option C is wrong because the data is already normalized. Option D is wrong because reducing batch size increases noise but may not resolve convergence issues.

Full explanation →

1240

Multi-Selecteasy

A data scientist is training a text classification model using Amazon SageMaker's built-in BlazingText algorithm. The dataset contains 1 million documents. Which TWO hyperparameters are most important to tune for improving model accuracy?

Select 2 answers

A.Learning rate

B.Batch size

C.Loss function

D.Type of optimizer

E.Number of epochs

AnswersA, E

Learning rate controls the step size during optimization and is crucial for convergence.

Why this answer

Learning rate and number of epochs are critical hyperparameters for training neural networks like BlazingText. They control how quickly the model learns and how long it trains.

Full explanation →

1241

Multi-Selectmedium

A data scientist is using Amazon SageMaker to deploy a model for real-time inference. The endpoint receives a large number of requests with variable traffic patterns. The team wants to minimize cost while ensuring low latency. Which THREE actions should the team take? (Choose THREE.)

Select 3 answers

A.Use a multi-model endpoint to host multiple models on the same instance.

B.Enable auto-scaling for the endpoint based on the invocation count.

C.Set the initial variant weight to 1 and increase the number of instances.

D.Use a single large instance to handle all traffic.

E.Create a production variant with a smaller instance type.

AnswersA, B, E

Multi-model endpoints reduce cost by sharing resources.

Why this answer

Options A, C, and E are correct. Option A: Using a production variant with a smaller instance type reduces cost. Option C: Enabling auto-scaling adjusts capacity based on traffic.

Option E: Using a multi-model endpoint allows sharing instances among models. Option B is wrong because higher concurrency may increase cost. Option D is wrong because a single large instance may be over-provisioned.

Full explanation →

1242

MCQmedium

A data scientist ran an AWS Glue ETL job that failed with the error shown. What is the most likely cause?

A.The CSV file has a header mismatch

B.The DataFrame does not have a column named 'age'

C.The schema is evolving incorrectly

D.The data type of 'age' is incompatible

AnswerB

Correct: The error states 'age' is not in the input columns.

Why this answer

Option C is correct because the error indicates the column 'age' is not found in the input data, which has columns [id, name, salary]. Option A is wrong because the error is about a missing column, not a mismatch. Option B is wrong because schema evolution would add column, not cause error.

Option D is wrong because there is no indication of data type issue.

Full explanation →

1243

Multi-Selecthard

Which TWO of the following are valid configurations for SageMaker Training Job resource limits? (Select TWO.)

Select 2 answers

A.Maximum number of instances

B.Maximum wait time in seconds

C.Maximum run time in seconds

D.Minimum number of instances

E.Maximum number of spot instances

AnswersA, C

You can limit the number of instances used by the training job.

Why this answer

Options A and D are correct. SageMaker training jobs can have a maximum run time (A) and a maximum number of instances (D). Option B is wrong because there is no minimum instance count limit.

Option C is wrong because spot instance limit is set separately. Option E is wrong because there is no maximum wait time for training jobs.

Full explanation →

1244

MCQmedium

An IAM policy is attached to a group. A user in the group tries to read the object s3://data-lake-bucket/sensitive/file.txt from an IP address 192.168.1.1. What will happen?

A.The request is allowed because the Allow statement grants s3:GetObject

B.The request is allowed because the Deny condition does not match

C.The request is denied because of the Deny statement

D.The request is denied because the policy has no explicit Allow for the sensitive prefix

AnswerC

Deny applies when condition is met.

Why this answer

The Deny statement explicitly denies any S3 action on the sensitive prefix when the source IP is not from 10.0.0.0/8. Since the IP 192.168.1.1 is not in that range, the Deny applies. Deny statements override Allow statements.

So the user is denied access.

Full explanation →

1245

MCQmedium

A company is building a fraud detection model using a random forest classifier. The dataset is highly imbalanced with 99% legitimate transactions and 1% fraudulent. The model currently achieves 99% accuracy on the test set, but the fraud recall is only 10%. The business requires at least 80% recall for fraud. The data scientist has tried oversampling the minority class and adjusting class weights, but recall remains below 40%. The dataset contains millions of transactions with hundreds of features. Which approach should the data scientist try next to improve fraud recall?

A.Randomly undersample the majority class to a 50:50 ratio

B.Use a gradient boosting machine (e.g., XGBoost) with scale_pos_weight parameter

C.Apply PCA to reduce dimensionality before training

D.Use a logistic regression model with L2 regularization

AnswerB

Gradient boosting often outperforms random forest on imbalance with proper weighting.

Why this answer

Option D (gradient boosting with scale_pos_weight) is effective for imbalance. Option A (logistic regression) likely underperforms. Option B (undersampling) loses too much data.

Option C (PCA) reduces features but may lose signal.

Full explanation →

1246

MCQeasy

A data scientist needs to store and version machine learning models, along with metadata such as hyperparameters and metrics. Which AWS service is designed for this purpose?

A.Amazon S3 with versioning enabled

B.Amazon SageMaker Model Registry

C.Amazon DynamoDB

D.Amazon Elastic Container Registry (ECR)

AnswerB

It provides model versioning, metadata, and approval workflows.

Why this answer

Option D is correct: SageMaker Model Registry is for cataloging and versioning models. Option A (S3) is object storage but not specialized. Option B (ECR) stores container images.

Option C (DynamoDB) is a NoSQL database, not purpose-built for ML models.

Full explanation →

1247

MCQhard

A company processes large streams of IoT sensor data using Amazon Kinesis Data Streams with 100 shards. Each sensor reading is about 1 KB. The data is consumed by an Amazon EMR cluster running Spark Streaming jobs. The team notices that the Spark Streaming job's processing time is gradually increasing, and the stream is falling behind. They suspect the issue is due to skewed data distribution across shards. Which approach should the team take to diagnose and resolve the issue?

A.Increase the number of shards to 200 to provide more parallelism.

B.Modify the producer to add a random prefix to the partition key, ensuring even distribution across all shards, and monitor the stream using CloudWatch.

C.Check Amazon CloudWatch metrics for Kinesis to identify hot shards, then manually redistribute the data by repartitioning in Spark.

D.Use the Kinesis Client Library (KCL) with a custom worker to rebalance the load across shards.

AnswerB

Adding a random prefix to partition keys uniformizes distribution, eliminating hot shards; CloudWatch helps confirm the fix.

Why this answer

Option B is correct because adding a random prefix to the partition key ensures that sensor data is evenly distributed across all 100 shards, eliminating hot shards that cause processing delays. This directly addresses the skewed data distribution issue without requiring infrastructure changes, and the team can monitor the improvement using CloudWatch metrics like IncomingBytes and ReadProvisionedThroughputExceeded.

Exam trap

The trap here is that candidates often confuse consumer-side rebalancing (KCL or Spark repartitioning) with producer-side data distribution, and incorrectly assume that increasing shards or using Spark repartitioning can fix a hot shard caused by a poor partition key.

How to eliminate wrong answers

Option A is wrong because simply increasing the number of shards to 200 does not fix the root cause of skewed distribution; it only adds more shards that may still be unevenly loaded if the partition key remains the same, potentially worsening the imbalance. Option C is wrong because while CloudWatch metrics can identify hot shards, manually redistributing data by repartitioning in Spark does not change how data is written to Kinesis shards; the producer-side partition key must be fixed to prevent future skew. Option D is wrong because the Kinesis Client Library (KCL) rebalances consumers across shards, but it cannot change how data is distributed across shards at the producer level; the skew originates from the producer's partition key selection.

Full explanation →

1248

MCQeasy

A company uses Amazon SageMaker to train a model and wants to track metrics like loss and accuracy in real-time. Which SageMaker feature should be used?

A.SageMaker Model Monitor

B.SageMaker metrics and CloudWatch dashboards

C.SageMaker Experiments

D.SageMaker Debugger

AnswerB

Provides real-time training metrics.

Why this answer

Option C is correct because SageMaker's built-in metrics and CloudWatch integration allow real-time tracking. Option A is wrong because SageMaker Debugger is for model debugging, not real-time metrics. Option B is wrong because SageMaker Experiments is for managing experiments, not real-time.

Option D is wrong because SageMaker Model Monitor is for monitoring inference.

Full explanation →

1249

MCQeasy

A data engineer is tasked with building a data pipeline that moves data from an on-premises database to Amazon S3 for analytics. The database is a MySQL instance that is 2 TB in size. The company has a 1 Gbps dedicated network connection to AWS (AWS Direct Connect). The data must be transferred once daily. The engineer needs to choose the most efficient and reliable service for this task. Which service should they use?

A.AWS DataSync

B.AWS Database Migration Service (DMS)

C.AWS Glue

D.Amazon S3 Transfer Acceleration

AnswerB

DMS is designed for database migrations and supports S3 as a target.

Why this answer

AWS Database Migration Service (DMS) is designed for migrating databases to AWS and can continuously replicate data. For a one-time daily transfer, DMS can perform a full load and then ongoing replication if needed. It supports MySQL as a source and S3 as a target.

Full explanation →

1250

MCQeasy

A machine learning engineer is analyzing a text classification dataset with 50,000 documents. Which EDA step is most important to understand the vocabulary size and frequency distribution?

A.Compute TF-IDF matrix

B.Plot frequency of each word in a bar chart

C.Generate bigram collocations

D.Plot histogram of document lengths

AnswerB

Why D is correct

Why this answer

Option D is correct because word frequency distribution (e.g., zipfian) helps decide vocabulary cutoff. Option A is wrong because TF-IDF is a transformation, not EDA. Option B is wrong because document length distribution is about length, not vocabulary.

Option C is wrong because bigram analysis is more advanced; basic frequency is first.

Full explanation →

1251

Multi-Selecthard

Which THREE techniques can help reduce overfitting in a neural network trained on a small dataset?

Select 3 answers

A.Apply L2 weight regularization

B.Increase the number of hidden layers

C.Train for more epochs

D.Use data augmentation

E.Add dropout layers

AnswersA, D, E

L2 regularization penalizes large weights.

Why this answer

L2 weight regularization (also known as weight decay) penalizes large weights by adding a term to the loss function proportional to the sum of squared weights. This forces the network to learn simpler patterns and reduces sensitivity to noise in the training data, which is especially helpful when the dataset is small and prone to overfitting.

Exam trap

Cisco often tests the misconception that increasing model complexity (more layers or epochs) always improves performance, when in fact on small datasets it reliably worsens overfitting.

Full explanation →

1252

MCQhard

A machine learning engineer is using Amazon SageMaker to deploy a model for real-time inference. The model is a large ensemble that requires 4 GB of memory and has a latency requirement of 100 ms. Which instance type and deployment configuration should the engineer choose to optimize cost while meeting requirements?

A.ml.m5.large (2 vCPU, 8 GB memory)

B.SageMaker Serverless Inference

C.ml.c5.large (2 vCPU, 4 GB memory)

D.ml.p3.2xlarge (8 vCPU, 61 GB memory, 1 GPU)

AnswerA

8 GB memory provides headroom, and cost is moderate.

Why this answer

ml.m5.large provides 8 GB memory and is cost-effective for real-time inference with moderate latency requirements. Option A is wrong because ml.c5.large has only 4 GB memory, insufficient for 4 GB model plus overhead. Option B is wrong because ml.p3.2xlarge is GPU-accelerated and expensive, overkill for this model.

Option D is wrong because Serverless Inference has cold start latency that may exceed 100 ms.

Full explanation →

1253

MCQhard

A data engineer is building a data pipeline that uses Amazon S3 to store raw data, AWS Lambda for transformation, and Amazon DynamoDB for serving. The Lambda function experiences high latency when writing to DynamoDB. Which action will most effectively reduce the latency?

A.Enable DynamoDB Accelerator (DAX) for caching

B.Use Amazon S3 instead of DynamoDB

C.Configure a VPC gateway endpoint for DynamoDB

D.Increase the DynamoDB write capacity units

AnswerA

DAX provides in-memory caching, reducing latency.

Why this answer

Option A is correct because enabling DynamoDB Accelerator (DAX) provides a caching layer that reduces read/write latency. Option B is wrong because increasing write capacity units helps throughput but not latency; Option C is wrong because S3 is object storage, not a low-latency store; Option D is wrong because using a VPC endpoint does not reduce latency significantly.

Full explanation →

1254

MCQhard

A company runs a critical data pipeline using Apache Spark on Amazon EMR. The pipeline reads data from Amazon S3, performs complex transformations, and writes results back to S3. The job runs every hour and must complete within 30 minutes. Recently, the job has been taking longer and occasionally failing due to executor losses. The team suspects memory pressure. Which action should the team take to improve stability and performance without increasing cost?

A.Increase the spark.executor.memory setting to allocate more memory per executor.

B.Increase the number of core nodes in the EMR cluster.

C.Decrease the number of shuffle partitions (spark.sql.shuffle.partitions) to reduce overhead.

D.Enable Spark dynamic allocation to adjust executors based on workload.

AnswerD

Dynamic allocation helps utilize resources efficiently and prevents over-allocation.

Why this answer

Option C is correct because enabling dynamic allocation allows Spark to release idle executors and request more when needed, reducing memory pressure. Option A is wrong because increasing the number of core nodes increases cost. Option B is wrong because increasing executor memory per node may cause YARN containers to fail if the instance memory is exceeded.

Option D is wrong because reducing shuffle partitions may reduce parallelism and increase memory per task.

Full explanation →

1255

MCQmedium

A data engineer uses AWS Glue to run ETL jobs that transform data from JSON to Parquet. The job runs successfully but takes 30 minutes longer than expected. CloudWatch metrics show high memory utilization and disk spills. What is the most likely cause?

A.The number of DPUs is too low

B.The sink bucket has insufficient I/O throughput

C.The source data format is too large

D.The data is skewed and not evenly distributed across partitions

AnswerD

Data skew causes some tasks to take longer, leading to spills and increased runtime.

Why this answer

High memory utilization and disk spills indicate that the data is not evenly distributed, causing some executors to process more data than others. This is often due to data skew. Increasing DPUs might help, but addressing skew is more effective.

Full explanation →

1256

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue job that fails with an 'AccessDenied' error when trying to write to the S3 bucket 'my-data-lake'. The IAM policy attached to the Glue service role is shown. What is the missing permission?

A.s3:ListBucket

B.s3:PutObjectAcl

C.s3:GetBucketLocation

D.s3:DeleteObject

AnswerA

Correct: Glue needs ListBucket to list objects in the bucket.

Why this answer

The policy allows s3:GetObject and s3:PutObject on the bucket's objects, but it does not allow s3:ListBucket on the bucket itself. Many Glue operations require ListBucket to discover objects. Option D (s3:ListBucket) is missing.

Option A (s3:DeleteObject) is not needed. Option B (s3:GetBucketLocation) is not required. Option C (s3:PutObjectAcl) is not needed.

Full explanation →

1257

MCQhard

An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?

A.Use the DynamoDB Export to S3 feature and schedule it daily with AWS Glue.

B.Use DynamoDB Streams with AWS Lambda to write changes to S3 in Parquet format.

C.Use a script that scans the DynamoDB table and filters by last updated timestamp.

D.Set up an Amazon EMR cluster running Spark jobs to read DynamoDB and write to S3.

AnswerB

Streams capture changes in near-real-time, enabling incremental exports with minimal overhead.

Why this answer

Option B is correct because DynamoDB Streams capture every change (insert, update, delete) in near real-time, and AWS Lambda can process these events to write only the changed records to S3 in Parquet format. This approach provides incremental, daily exports with minimal operational overhead, as it is fully serverless and requires no infrastructure management.

Exam trap

The trap here is that candidates often choose Option A because they assume 'Export to S3' is incremental, but it actually exports the entire table, not just changes, leading to higher costs and redundant data processing.

How to eliminate wrong answers

Option A is wrong because the DynamoDB Export to S3 feature exports the entire table snapshot, not incremental changes, and scheduling it with AWS Glue adds unnecessary complexity and cost for a full export each day. Option C is wrong because scanning the entire DynamoDB table daily and filtering by last updated timestamp is inefficient, costly (consumes read capacity), and does not capture deletions; it also requires custom scripting and handling of large datasets. Option D is wrong because setting up and managing an Amazon EMR cluster introduces significant operational overhead for a simple incremental export task, and it is overkill compared to the serverless Streams + Lambda approach.

Full explanation →

1258

MCQhard

A company is building a real-time fraud detection system using Amazon SageMaker. The model must have low latency (under 10ms) and high throughput (thousands of predictions per second). The team has trained a gradient boosting model using XGBoost. Which SageMaker inference option is MOST suitable?

A.Use SageMaker asynchronous inference.

B.Deploy the model on a SageMaker real-time endpoint with a multi-model endpoint.

C.Deploy the model on a SageMaker serverless endpoint.

D.Use a batch transform job.

AnswerB

Multi-model endpoints optimize cost and latency for high throughput.

Why this answer

A multi-model endpoint (MME) on SageMaker is the most suitable option because it allows you to host multiple XGBoost models on a single endpoint, sharing the underlying instance to maximize throughput and minimize latency. MMEs keep models loaded in memory and route requests to the correct model with sub-10ms overhead, meeting the low-latency and high-throughput requirements for real-time fraud detection.

Exam trap

The trap here is that candidates often confuse 'real-time' with 'serverless' or 'asynchronous', failing to recognize that serverless endpoints introduce cold-start latency and throughput limits that break the sub-10ms and high-throughput requirements.

How to eliminate wrong answers

Option A is wrong because asynchronous inference is designed for large payloads or long processing times (e.g., batch processing with minutes of latency), not for real-time sub-10ms predictions. Option C is wrong because serverless endpoints have a cold-start latency that can exceed 10ms and are throttled at lower concurrency, making them unsuitable for thousands of predictions per second. Option D is wrong because batch transform jobs are offline, not real-time, and cannot provide sub-10ms latency or handle streaming prediction requests.

Full explanation →

1259

MCQeasy

A data scientist is building a regression model to predict energy consumption. The dataset includes features like temperature, humidity, day of week, and holiday flags. The scientist uses a linear regression model and obtains an R-squared of 0.85 on training and 0.40 on test. The scientist suspects the model is not capturing non-linear relationships. Which approach should the scientist use to capture non-linearity?

A.Apply PCA to the feature set

B.Increase L1 regularization using Lasso

C.Remove features with low correlation to the target

D.Add polynomial features (e.g., squared terms and interactions)

AnswerD

Polynomial features allow linear model to fit non-linear patterns.

Why this answer

Option C (add polynomial features) captures non-linear relationships. Option A (increase regularization) reduces overfitting but doesn't add non-linearity. Option B (use PCA) reduces dimensionality.

Option D (remove features) may lose information.

Full explanation →

1260

MCQhard

A company is using AWS Glue to run ETL jobs that transform data from multiple sources into a data lake on S3. The jobs are scheduled to run hourly. Recently, the jobs have been failing intermittently with 'MemoryError' exceptions. The data volume has grown over time. The data engineer needs to resolve this issue cost-effectively. Which action should be taken?

A.Increase the number of DPUs allocated to the Glue job and use a larger worker type.

B.Increase the S3 timeout settings in the Glue job configuration.

C.Switch the Glue job type from Spark to Python shell to reduce memory overhead.

D.Repartition the data using Spark's repartition method before processing.

AnswerA

More DPUs and larger worker types provide more memory to handle larger data volumes.

Why this answer

Option B is correct because Glue jobs can be configured to use more DPUs (Data Processing Units) to increase memory, and using worker type G.2X or G.4X provides more memory per worker. Option A (increasing S3 timeout) does not address memory. Option C (Spark partitioning) may help but is more complex and may not be sufficient if memory is insufficient.

Option D (changing to Python shell) reduces memory and will likely fail.

Full explanation →

1261

MCQmedium

A company is using Amazon SageMaker to deploy a model for real-time predictions. The model requires access to a DynamoDB table to look up features. The SageMaker endpoint is configured with a VPC and subnet. However, the endpoint cannot connect to DynamoDB. What is the most likely reason?

A.The security group does not allow outbound traffic to DynamoDB

B.The IAM role for the endpoint does not have dynamodb:GetItem permission

C.The VPC does not have a VPC endpoint for DynamoDB or a NAT gateway

D.The DynamoDB table is in a different AWS Region

E.The CloudWatch logs show no errors

AnswerC

Without a route to DynamoDB, the endpoint cannot connect.

Why this answer

Option C is correct because a VPC endpoint for DynamoDB (Gateway endpoint) is needed for private connectivity, or the subnet must have a NAT gateway for internet access. Option A (security group) may be an issue but the most common cause is lack of routing. Option B (IAM role) is necessary but if the role has permissions, the issue is network.

Option D (DynamoDB table) is not relevant. Option E (CloudWatch) is for logging.

Full explanation →

1262

MCQhard

A machine learning team is building a fraud detection system using Amazon SageMaker. The training data is highly imbalanced (99% legitimate, 1% fraudulent). They need to maximize the recall of the fraud class while keeping precision above 90%. Which approach should they take?

A.Undersample the majority class to create a balanced dataset and train a Random Forest

B.Train a model using the original data, then adjust the decision threshold on the validation set to maximize recall while precision > 90%

C.Train an XGBoost model with scale_pos_weight parameter set to 99

D.Use SMOTE to oversample the fraud class and then train a logistic regression

AnswerB

Threshold tuning directly optimizes recall with a precision constraint.

Why this answer

Option D is correct because adjusting the model threshold after training to favor recall while monitoring precision is the most direct way to meet the business requirement. Option A (SMOTE) can help but may not guarantee precision. Option B (weighted loss) is good but less direct than threshold tuning.

Option C (random undersampling) may discard too much data.

Full explanation →

1263

MCQeasy

A data scientist is building a classification model to predict customer churn. The dataset has 10,000 samples with 100 features. After training a logistic regression model, the scientist observes that the model has high variance (overfitting). Which technique can reduce overfitting?

A.Remove the regularization term

B.Use L2 regularization (Ridge)

C.Add polynomial features

D.Use a smaller learning rate

AnswerB

L2 regularization penalizes large weights, reducing overfitting.

Why this answer

L2 regularization (Ridge) adds a penalty on large coefficients, reducing overfitting. Removing features may help but is not the best practice. Increasing model complexity (polynomial features) would worsen overfitting.

Increasing training data helps but not listed.

Full explanation →

1264

MCQmedium

A data scientist is using Amazon SageMaker to train a model. The training dataset is stored in S3 as CSV files. The scientist wants to use the SageMaker built-in Linear Learner algorithm. Which input mode should be used for optimal performance?

A.Augmented manifest file mode

B.File mode

C.Pipe mode

D.Fast file mode

AnswerC

Pipe mode streams data, reducing I/O overhead and improving performance.

Why this answer

Pipe mode streams data directly from S3 to the algorithm without writing to disk, reducing I/O overhead. File mode downloads the entire dataset to disk, which is slower. Fast file mode is not a SageMaker feature.

Augmented manifest is for additional metadata, not performance.

Full explanation →

1265

MCQhard

A company runs a real-time recommendation system on SageMaker with a model that uses a deep neural network. The endpoint uses a single ml.p3.2xlarge instance. Recently, the number of users has grown, and the endpoint's latency has increased from 50ms to 200ms, exceeding the SLA of 100ms. The model inference code is optimized and cannot be improved further. The company wants to reduce latency while minimizing cost. The data scientist has the following options: A. Switch to a larger instance type with more GPU memory, such as ml.p3.8xlarge. B. Use SageMaker's Elastic Inference to attach an EI accelerator to the existing instance. C. Deploy the model on multiple smaller instances (e.g., ml.p3.2xlarge) behind a load balancer and distribute traffic. D. Convert the model to use TensorFlow Lite and deploy on a CPU-based instance. Which option is the MOST cost-effective and meets the latency requirement?

A.Convert to TensorFlow Lite on CPU

B.Use SageMaker's Elastic Inference

C.Switch to a larger instance type, e.g., ml.p3.8xlarge

D.Deploy on multiple smaller instances behind a load balancer

AnswerB

Elastic Inference provides cost-effective GPU acceleration.

Why this answer

Option B is the most cost-effective because Elastic Inference provides dedicated GPU acceleration at a fraction of the cost of a full GPU instance. Option A increases cost significantly. Option C may reduce latency but increases cost and complexity.

Option D may not maintain accuracy and may not meet latency requirements on CPU.

Full explanation →

1266

MCQmedium

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance. A data scientist is trying to invoke the endpoint 'my-endpoint' from the notebook but receives an AccessDenied error. What is the likely cause?

A.The policy allows InvokeEndpoint only for endpoints with the exact ARN, but the endpoint ARN is different.

B.The policy uses a wildcard for CreateEndpoint, which is too permissive.

C.The policy does not allow sagemaker:CreateEndpoint for the specific endpoint.

D.The policy is not attached to the IAM role used by the notebook instance.

AnswerD

Without the policy, InvokeEndpoint is denied.

Why this answer

Option D is correct because the error 'AccessDenied' when invoking a SageMaker endpoint from a notebook instance typically indicates that the IAM role attached to the notebook does not have the required permissions. The policy shown in the exhibit grants sagemaker:InvokeEndpoint for the specific endpoint ARN, but if the policy is not attached to the IAM role that the notebook instance is using, the role lacks the permission, resulting in the AccessDenied error. Attaching the policy to the correct IAM role resolves the issue.

Exam trap

AWS often tests the distinction between having a policy defined versus having it attached to the correct IAM role; candidates mistakenly assume that if a policy exists in the account, it automatically applies to all resources, but IAM policies must be explicitly attached to the role or user making the request.

How to eliminate wrong answers

Option A is wrong because the policy explicitly allows InvokeEndpoint for the endpoint ARN 'arn:aws:sagemaker:us-east-1:123456789012:endpoint/my-endpoint', so if the endpoint ARN matches, this is not the cause. Option B is wrong because the wildcard for CreateEndpoint is irrelevant to the InvokeEndpoint action; the error is about invoking, not creating, and a permissive CreateEndpoint policy does not cause an AccessDenied on InvokeEndpoint. Option C is wrong because the policy does not need to allow sagemaker:CreateEndpoint for invoking an endpoint; the required action is sagemaker:InvokeEndpoint, which is already allowed in the policy.

Full explanation →

1267

Multi-Selecteasy

Which TWO of the following are appropriate use cases for Amazon SageMaker built-in algorithms?

Select 2 answers

A.Classifying customer churn using tabular data

B.Reinforcement learning using Q-learning

C.Classifying text documents using word embeddings

D.Image classification using a custom CNN architecture

E.Time series forecasting using ARIMA

AnswersA, C

XGBoost or Linear Learner can be used.

Why this answer

XGBoost is suitable for tabular classification. BlazingText is for text classification on word embeddings. Image classification using custom CNNs may use built-in but not necessarily.

Time series forecasting is not a built-in algorithm (use DeepAR). Reinforcement learning is not a built-in algorithm.

Full explanation →

1268

MCQeasy

A machine learning engineer is using Amazon SageMaker to train a model. The training job fails with an out-of-memory error. The training data size is 10 GB and the instance is ml.m5.xlarge (16 GB memory). Which change is MOST likely to resolve the issue without increasing cost?

A.Reduce the batch size in the training script.

B.Switch to a GPU instance like p3.2xlarge.

C.Use a larger instance type like ml.m5.4xlarge.

D.Decrease the training dataset size.

AnswerA

Smaller batch size reduces memory footprint per iteration.

Why this answer

Option A is correct because many algorithms allow you to set a batch size, and reducing it lowers memory usage. Option B is wrong because changing to GPU may not help and could increase cost. Option C is wrong because increasing instance type increases cost.

Option D is wrong because decreasing the dataset size may lose information.

Full explanation →

1269

MCQmedium

A data scientist is performing EDA on a time series dataset of daily website visits. The scientist wants to identify any seasonality patterns. Which visualization is most appropriate?

A.Correlation matrix of visits with lagged versions of itself.

B.Scatter plot of visits against the day of the month.

C.Histogram of daily visit counts.

D.Line plot with day on x-axis and visits on y-axis, highlighting weekends.

AnswerD

Reveals periodic patterns over time.

Why this answer

A time series line plot with marked intervals (e.g., weekly, monthly) can reveal seasonal patterns. Option A (histogram) shows distribution, not time patterns. Option B (scatter plot of visits vs day) is essentially a line plot but less effective for seasonality.

Option C (correlation matrix) does not show temporal patterns.

Full explanation →

1270

Multi-Selecteasy

A data scientist is evaluating a linear regression model. Which TWO metrics are appropriate for evaluating the model's performance?

Select 2 answers

A.R-squared

B.Root Mean Squared Error (RMSE)

C.Precision

D.Area Under the ROC Curve (AUC-ROC)

E.F1 score

AnswersA, B

R-squared measures the proportion of variance explained by the model.

Why this answer

R-squared is a standard metric for linear regression that measures the proportion of variance in the dependent variable explained by the independent variables. It ranges from 0 to 1, with higher values indicating better fit, making it directly appropriate for evaluating regression model performance.

Exam trap

AWS often tests the distinction between regression and classification metrics, and the trap here is that candidates mistakenly apply classification metrics like Precision, AUC-ROC, or F1 score to a regression problem, not recognizing they are fundamentally incompatible with continuous outputs.

Full explanation →

1271

MCQeasy

A company has a dataset with 1 million rows and 500 features. They want to reduce dimensionality for visualization. Which technique is most suitable for preserving global structure?

A.Autoencoder

B.t-Distributed Stochastic Neighbor Embedding (t-SNE)

C.Linear Discriminant Analysis (LDA)

D.Principal Component Analysis (PCA)

AnswerD

PCA preserves global variance.

Why this answer

Option A is correct because PCA is a linear technique that preserves global variance. Option B is wrong because t-SNE focuses on local structure. Option C is wrong because LDA requires labels.

Option D is wrong because Autoencoders are more complex and not primarily for visualization.

Full explanation →

1272

MCQeasy

A company is using Amazon Kinesis Data Firehose to load streaming data into Amazon S3. The data is in JSON format, and they want to convert it to Parquet before storage. What should they configure?

A.Enable data format conversion in Firehose and specify a Glue table

B.Use an AWS Lambda function to transform the data

C.Run an AWS Glue ETL job after data is in S3

D.Use Kinesis Data Analytics for Apache Flink to convert the format

AnswerA

Firehose can convert to Parquet using a Glue table schema.

Why this answer

Kinesis Data Firehose can convert incoming data to Parquet or ORC using a schema from AWS Glue. Option A is wrong because Lambda can also transform but is not the primary method. Option C is wrong because Kinesis Data Analytics is for processing.

Option D is wrong because Athena is for querying.

Full explanation →

1273

MCQhard

A team deployed a SageMaker endpoint for real-time inference using a PyTorch model. After monitoring, they notice that the latency is highly variable, with p99 latency 10x the p50 latency. The endpoint uses a single ml.c5.2xlarge instance with auto-scaling based on average CPU utilization. Which change is most likely to reduce latency variability?

A.Increase the batch size for inference

B.Pre-warm the model by sending dummy requests every minute

C.Switch to a GPU instance type

D.Change the auto-scaling metric to 'InvocationsPerInstance'

AnswerD

Scaling on invocations per instance prevents overload and reduces queueing.

Why this answer

Option C is correct because high p99 latency often results from cold starts or queueing when traffic spikes. Scaling on invocations per instance ensures more instances are ready. Option A (GPU) may not help if model is CPU-bound.

Option B (batch size) can increase latency. Option D (warm-up) helps cold starts but not queueing.

Full explanation →

1274

MCQeasy

A data scientist wants to deploy a PyTorch model for real-time inference. Which SageMaker deployment option provides the lowest latency for single-digit millisecond responses?

A.SageMaker Real-Time Inference endpoint

B.SageMaker Asynchronous Inference

C.SageMaker Serverless Inference

D.SageMaker Batch Transform

AnswerA

Real-Time endpoints provide the lowest latency for online inference.

Why this answer

SageMaker Real-Time Inference endpoints are designed for low-latency, real-time predictions. Option B is correct because SageMaker Serverless Inference can have cold starts and higher latency. Option C is for batch processing.

Option D is for asynchronous inference with higher latency.

Full explanation →

1275

Multi-Selecteasy

Which TWO options are best practices for managing access to data stored in Amazon S3 for a data lake?

Select 2 answers

A.Use S3 access control lists (ACLs) for granular permissions

B.Enable default encryption with SSE-S3

C.Use IAM policies to control user and role permissions

D.Use S3 bucket policies to grant cross-account access

E.Generate pre-signed URLs for all data access

AnswersC, D

IAM policies are central to access management.

Why this answer

IAM policies and bucket policies are standard for access control. S3 ACLs are legacy and not recommended. SSE-S3 is encryption, not access control.

Pre-signed URLs are for temporary access, not general governance.

Full explanation →

Page 17 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →