AWS Certified Machine Learning Specialty MLS-C01 MLS-C01 Questions 301–375 | Page 5/24

301

Multi-Selecteasy

Which TWO AWS services can be used to transform data in transit before storing it in Amazon S3? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Redshift Spectrum

C.AWS Data Pipeline

D.Amazon Kinesis Data Firehose

E.Amazon Athena

AnswersA, D

Glue can process streaming data with streaming ETL jobs.

Why this answer

AWS Glue can perform ETL transformations on data in motion. Amazon Kinesis Data Firehose can invoke Lambda functions to transform data before delivery. Options B, D, and E are not used for transforming data in transit.

Full explanation →

302

MCQmedium

A company deploys a machine learning model on Amazon SageMaker for real-time inference. The model receives requests with large payloads (up to 5 MB) and the inference latency is high. Which configuration change would MOST likely reduce latency?

A.Pre-load multiple model containers on the same endpoint

B.Reduce the batch size for inference requests

C.Use a larger instance type with more memory and compute

D.Enable payload compression using SageMaker built-in compression

AnswerC

Larger instances can process large payloads faster.

Why this answer

Using multi-model endpoints reduces latency by loading only the required model into memory, but for large payloads, increasing instance size (Option B) helps handle compute and memory needs. Option A is wrong because SageMaker does not support payload compression natively. Option C is wrong because reducing batch size increases latency.

Option D is wrong because containers are automatically loaded; pre-loading is not an option.

Full explanation →

303

MCQhard

A machine learning engineer is analyzing a dataset for a regression problem. The target variable has a long-tail distribution with extreme outliers. The engineer wants to reduce the influence of outliers while preserving the relative order of values. Which data transformation should the engineer apply to the target variable?

A.Min-max normalization

B.Box-Cox transformation

C.Rank transformation

D.Log transformation

AnswerC

Rank transformation replaces values with their rank order, making the distribution uniform and robust to outliers.

Why this answer

Option B is correct because the rank transformation maps values to their ranks, eliminating the impact of outliers while preserving order. Option A is wrong because Box-Cox requires positive values and may not reduce outlier influence. Option C is wrong because log transformation can reduce skew but still allows outliers to remain influential.

Option D is wrong because min-max scaling does not reduce outlier influence; it compresses the range.

Full explanation →

304

MCQhard

A team is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large ensemble of 10 deep learning models, each 500 MB. The inference latency requirement is under 200 ms. Currently, the endpoint using a single ml.p3.2xlarge instance takes 1.5 seconds per request. Which approach is MOST likely to meet the latency requirement?

A.Increase the batch size to process more requests per invocation.

B.Switch to a compute-optimized instance like c5.18xlarge.

C.Use SageMaker Neo to compile the model for the target instance.

D.Use model parallelism to split the ensemble across multiple GPUs on a single instance.

AnswerD

Model parallelism reduces per-device load and can achieve latency target.

Why this answer

Option C is correct because model parallelism distributes the model across multiple GPUs, reducing per-device memory and computation time. Option A is wrong because increasing batch size increases latency. Option B is wrong because changing instance type alone may not reduce latency enough.

Option D is wrong because SageMaker Neo does not support model parallelism.

Full explanation →

305

Multi-Selecthard

A company is using SageMaker to train a model and wants to ensure that the training data is encrypted at rest and in transit, and that the trained model artifacts are also encrypted. Which THREE actions should the company take?

Select 3 answers

A.Specify a KMS key in the SageMaker training job configuration to encrypt the ML storage volume

B.Enable SageMaker model encryption using a KMS key

C.Configure the training job to run in a VPC with no internet access

D.Enable AWS CloudTrail to log all API calls

E.Enable S3 server-side encryption (SSE-KMS) on the training data bucket

AnswersA, B, E

Encrypts the training instance's storage volume.

Why this answer

Options A, B, and D are correct. A: Enable S3 encryption for the data bucket. B: Use a KMS key for SageMaker training volumes.

D: Enable SageMaker model encryption. Option C (CloudTrail) is for auditing, not encryption. Option E (VPC) is for network isolation, not encryption.

Full explanation →

306

MCQmedium

A financial services company is building a fraud detection model using a large dataset of credit card transactions. The dataset contains 10 million rows with 50 features, including transaction amount, merchant category, time of day, and customer historical features. The label is binary: fraudulent (1% of data) or legitimate. The company wants to deploy a real-time inference endpoint using Amazon SageMaker that can score transactions with sub-100ms latency. The current model is a gradient boosting model (XGBoost) trained on a sample of 1 million rows due to memory constraints. The model achieves 0.95 AUC on a held-out test set but the fraud recall (sensitivity) is only 0.4, which is unacceptable because the cost of missing a fraud is high. The data science team has access to a larger compute instance (ml.m5.24xlarge) for training. Which course of action is most likely to improve fraud recall while maintaining latency requirements?

A.Train the XGBoost model on the full 10 million rows using an ml.p3.2xlarge instance with GPU support, and apply SMOTE oversampling to the minority class before training.

B.Engineer additional features from transaction time and merchant category, then retrain the XGBoost model on the same 1 million row sample.

C.Downsample the majority class to 1% of the original size to create a balanced dataset of 200,000 rows, then retrain the XGBoost model on this balanced sample.

D.Replace XGBoost with a logistic regression model trained on the full dataset, as linear models are faster to train and may generalize better on large data.

AnswerA

Using a GPU instance allows training on the full dataset efficiently, and SMOTE oversampling balances the classes, directly improving recall.

Why this answer

Option A is correct because training on the full 10 million rows with a GPU-accelerated instance (ml.p3.2xlarge) allows the XGBoost model to learn from the complete data distribution, addressing the bias introduced by the 1 million row sample. Applying SMOTE oversampling to the minority class (fraud) directly tackles the class imbalance (1% fraud), which is the root cause of the low recall (0.4). SMOTE generates synthetic fraudulent examples, improving the model's ability to detect fraud without significantly increasing inference latency, as the model architecture and deployment remain unchanged.

Exam trap

The trap here is that candidates may choose downsampling (Option C) as a quick fix for class imbalance, overlooking that it discards valuable majority class data and can harm model generalization, while SMOTE (Option A) preserves data and synthetically balances the classes to improve recall without sacrificing latency.

How to eliminate wrong answers

Option B is wrong because engineering additional features and retraining on the same 1 million row sample does not address the fundamental issue of insufficient fraudulent examples in the training data; the model will still suffer from low recall due to class imbalance. Option C is wrong because downsampling the majority class to 1% reduces the dataset to only 200,000 rows, discarding 99% of legitimate transactions, which can lead to loss of valuable patterns and degrade model generalization, while not guaranteeing improved fraud recall. Option D is wrong because replacing XGBoost with logistic regression, a linear model, is unlikely to capture complex non-linear interactions in transaction data, and while it may train faster, it will not improve recall to the required level and may even worsen performance.

Full explanation →

307

MCQeasy

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents 5% of the data. Which metric is most appropriate for evaluating model performance?

A.Accuracy

B.AUC-ROC

C.Root Mean Squared Error (RMSE)

D.R-squared

AnswerB

AUC-ROC evaluates the model's ability to distinguish between classes regardless of threshold and is robust to imbalance.

Why this answer

Option B is correct because AUC-ROC is robust to class imbalance and measures the trade-off between true positive rate and false positive rate. Option A is wrong because accuracy can be misleading with imbalanced data. Option C is wrong because RMSE is for regression.

Option D is wrong because R-squared is for regression.

Full explanation →

308

Multi-Selecthard

A company uses Amazon Athena to query a data lake in Amazon S3. The data is partitioned by year, month, day, and hour. The team notices that queries are slow and expensive. The team wants to improve performance and reduce costs. Which THREE actions should the team take?

Select 3 answers

A.Ensure queries filter on partition columns (year, month, day, hour).

B.Increase the number of partitions by adding a partition for minute.

C.Convert data from CSV to Parquet format.

D.Use CSV format with GZIP compression.

E.Use S3 storage classes like S3 Intelligent-Tiering for cost savings.

AnswersA, C, E

Partition pruning reduces scanned data.

Why this answer

Options A, D, and E are correct. Converting to columnar formats like Parquet reduces the amount of data scanned. Partition pruning using WHERE clauses on partition columns reduces scanned partitions.

Compressing data reduces storage and scan volume. Option B is wrong because increasing partitions beyond need can increase overhead. Option C is wrong because using CSV is less efficient.

Full explanation →

309

MCQmedium

A data scientist is performing EDA on a time-series dataset and observes a strong upward trend and seasonal patterns. The scientist needs to make the data stationary for modeling. Which transformation should be applied?

A.Apply one-hot encoding

B.Apply PCA

C.Apply min-max scaling

D.Apply differencing to the series

E.Apply logarithmic transformation

AnswerD

Differencing removes trends and seasonality, making the series stationary.

Why this answer

Differencing is a common technique to remove trends and seasonality to make a time series stationary. Option B is wrong because logarithmic transformation stabilizes variance but does not remove trends. Option C is wrong because min-max scaling does not address trends.

Option D is wrong because one-hot encoding is for categorical variables. Option E is wrong because PCA is for dimensionality reduction.

Full explanation →

310

MCQmedium

A company is using Amazon SageMaker to train a model on a dataset with many categorical features. They want to use SageMaker's built-in Linear Learner algorithm. What preprocessing step is required for the categorical features?

A.Apply one-hot encoding to convert them to numerical vectors.

B.Use label encoding to assign integers to categories.

C.Normalize the categorical features using min-max scaling.

D.Remove categorical features with high cardinality.

AnswerA

Linear models need numerical features; one-hot encoding is standard.

Why this answer

The SageMaker Linear Learner algorithm requires numerical input features. Categorical features must be converted to numerical vectors, typically via one-hot encoding, because the algorithm performs linear regression or classification on numerical data. Without this preprocessing, the algorithm cannot interpret categorical values directly.

Exam trap

The trap here is that candidates confuse label encoding (assigning integers) with one-hot encoding, assuming any numerical conversion suffices, but label encoding introduces false ordinality that degrades linear model performance.

How to eliminate wrong answers

Option B is wrong because label encoding assigns arbitrary integers to categories, which implies an ordinal relationship that can mislead the linear model into treating categories as ordered numerical values. Option C is wrong because normalization (min-max scaling) is a scaling technique for numerical features, not a method to convert categorical features to numerical form. Option D is wrong because removing high-cardinality categorical features is a data reduction strategy, not a required preprocessing step for the Linear Learner algorithm; the algorithm can handle one-hot encoded features regardless of cardinality.

Full explanation →

311

MCQhard

A data scientist is exploring a large dataset (10 TB) stored in Amazon S3. The dataset is in CSV format and has many columns. The scientist wants to quickly compute summary statistics (mean, min, max, count) for each column without moving the data. Which approach is most cost-effective and efficient?

A.Import the data into Amazon SageMaker Data Wrangler

B.Launch an Amazon EMR cluster with Spark

C.Use S3 Select to compute statistics

D.Use Amazon Athena with SQL queries

E.Use AWS Glue DataBrew to profile the data

AnswerD

Athena queries data in place with no data movement and pay-per-query pricing.

Why this answer

Using Amazon Athena with SQL queries allows serverless querying of data in S3, and is cost-effective (pay per query). It can compute summary statistics directly on the data without moving it. Option A is wrong because SageMaker Data Wrangler requires importing data into SageMaker, which may incur transfer costs and time.

Option B is wrong because Glue DataBrew also processes data but may be more expensive for large datasets. Option D is wrong because S3 Select works on single objects and is limited. Option E is wrong because launching an EMR cluster adds overhead and cost for a simple task.

Full explanation →

312

MCQmedium

A company is using Amazon SageMaker to train a XGBoost model on a large dataset. The training job is taking a long time. The data scientist wants to reduce training time without sacrificing model accuracy. The dataset is 100 GB in CSV format stored in S3. What is the most effective approach?

A.Reduce the number of instances to avoid communication overhead.

B.Use Pipe mode to stream data from S3 instead of downloading it first.

C.Use random sampling to reduce the dataset size to 10 GB.

D.Use SageMaker Managed Spot Training to reduce cost, but training time may increase due to interruptions.

AnswerB

Pipe mode reduces I/O time by streaming data directly to the algorithm.

Why this answer

Option B is correct because SageMaker's Pipe mode streams data directly from S3 to the training algorithm without writing it to disk, eliminating the I/O bottleneck of downloading the full 100 GB dataset. This reduces training time significantly by overlapping data loading with computation, while preserving model accuracy since the entire dataset is still used.

Exam trap

The trap here is that candidates often confuse cost optimization (Spot Training) with performance optimization, or incorrectly assume that reducing instances or data size is the only way to speed up training, ignoring SageMaker's specialized data streaming capability.

How to eliminate wrong answers

Option A is wrong because reducing the number of instances increases per-instance data load and can increase training time due to less parallelism, and communication overhead is negligible compared to I/O for large datasets. Option C is wrong because random sampling reduces dataset size, which sacrifices model accuracy by discarding potentially important data patterns, and the goal is to reduce time without sacrificing accuracy. Option D is wrong because SageMaker Managed Spot Training reduces cost, not training time; interruptions can actually increase training time due to checkpoint restarts, making it ineffective for the stated goal.

Full explanation →

313

MCQeasy

A Lambda function is triggered by S3 events. The event payload shown in the exhibit is received by the Lambda function. The function is supposed to process the CSV file and load it into DynamoDB. However, the function fails because it cannot read the file. What is the MOST likely cause?

A.The Lambda function lacks DynamoDB write permissions

B.The Lambda function's IAM role does not have s3:GetObject permission

C.The S3 bucket does not exist

D.The S3 event notification is misconfigured

AnswerB

Without read permission, the function cannot access the S3 object.

Why this answer

Option C is correct. The Lambda function needs an IAM role with s3:GetObject permission to read the object. Option A is wrong because the event is valid.

Option B is wrong because DynamoDB permissions are separate. Option D is wrong because the bucket exists.

Full explanation →

314

MCQhard

A company is designing a data pipeline to process log files from multiple sources. The logs are written to Amazon S3 every hour. The data is then transformed using AWS Glue ETL jobs and loaded into Amazon Redshift for analysis. The company needs to ensure that the data is available for analysis within 30 minutes of being written to S3. Currently, the Glue job is triggered hourly, but the company wants to reduce the latency. Which solution should the company implement?

A.Increase the frequency of the Glue crawler to run every 5 minutes

B.Use Amazon Redshift Spectrum to query the data directly from S3 without transformation

C.Use Amazon S3 event notifications to invoke an AWS Lambda function that starts the Glue job automatically

D.Reduce the Glue job trigger frequency to every 15 minutes

AnswerC

S3 events trigger Lambda immediately, which starts the Glue job with low latency.

Why this answer

Option C is correct because configuring an S3 event notification to invoke AWS Lambda, which starts the Glue job, allows near-real-time processing within minutes. Option A is wrong because hourly triggers do not reduce latency. Option B is wrong because increasing the crawler frequency does not trigger ETL jobs.

Option D is wrong because Redshift Spectrum does not transform data.

Full explanation →

315

MCQmedium

A data scientist is analyzing a dataset with missing values in several columns. The dataset contains both numerical and categorical features. Which approach should the data scientist use to handle missing values while minimizing bias and preserving relationships in the data?

A.Use multiple imputation (e.g., MICE) to impute missing values

B.Use forward-fill to propagate the last observed value

C.Delete all rows with missing values

D.Replace missing values with the mean or median of each column

AnswerA

MICE models each variable as a function of others, preserving relationships and reducing bias.

Why this answer

Option B is correct because MICE (Multiple Imputation by Chained Equations) models each variable with missing values as a function of other variables, preserving relationships. Option A is wrong because listwise deletion can introduce bias. Option C is wrong because mean/median imputation reduces variance.

Option D is wrong because forward-fill is for time series.

Full explanation →

316

MCQeasy

A company needs to ingest real-time clickstream data from thousands of web servers into AWS for near-real-time analytics. The data volume varies and can spike during promotions. Which service should be used to capture and buffer the data before processing?

A.Amazon SQS

B.Amazon Kinesis Data Firehose

C.Amazon Kinesis Data Streams

D.Amazon MQ

AnswerC

Kinesis Data Streams provides a durable buffer for real-time data, enabling multiple consumers.

Why this answer

Amazon Kinesis Data Streams is designed for real-time data ingestion and can buffer data for up to 7 days. It scales automatically and integrates with Kinesis Analytics and Lambda.

Full explanation →

317

Multi-Selecthard

Which THREE factors should be considered when choosing between SageMaker built-in algorithms and custom algorithms? (Choose THREE.)

Select 3 answers

A.Custom algorithms allow you to implement any architecture, including proprietary ones

B.Built-in algorithms are optimized for distributed training

C.Built-in algorithms can only be used with CSV and JSON formats

D.Custom algorithms require you to bring your own Docker container, but SageMaker built-in algorithms do not support frameworks like PyTorch

E.Built-in algorithms have predefined hyperparameters that may not fit all use cases

AnswersA, B, E

Custom algorithms offer full flexibility.

Why this answer

Option A is correct because custom algorithms in SageMaker allow you to implement any architecture, including proprietary or novel models that are not available as built-in algorithms. This flexibility is essential when you need to use a custom neural network, a unique loss function, or a model from a research paper that SageMaker does not natively support.

Exam trap

The trap here is that candidates often assume built-in algorithms are limited to CSV/JSON formats and do not support popular frameworks like PyTorch, when in fact SageMaker provides optimized built-in framework containers for PyTorch, TensorFlow, and others, and built-in algorithms support a wide variety of data formats.

Full explanation →

318

MCQmedium

A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?

A.Use Amazon Kinesis Data Analytics for Apache Flink to consume the stream, perform feature engineering, and invoke the SageMaker endpoint with exactly-once processing.

B.Use the Kinesis Client Library (KCL) to process the stream in an Amazon EC2 instance, and store the predictions in Amazon DynamoDB.

C.Increase the Lambda function timeout to 15 minutes and allocate more memory to reduce processing time.

D.Configure Amazon Kinesis Firehose to deliver the stream to an Amazon S3 bucket, then trigger a Lambda function to process the data in batches.

AnswerA

Kinesis Data Analytics provides stateful stream processing with checkpointing, ensuring no data loss and low-latency integration with SageMaker.

Why this answer

Option A is correct because Amazon Kinesis Data Analytics for Apache Flink provides a stateful, low-latency stream processing engine that can consume from Kinesis Data Streams, perform feature engineering in real-time, and invoke SageMaker endpoints with exactly-once processing semantics. This eliminates Lambda timeouts and data loss by using a long-running, scalable application instead of a short-lived function.

Exam trap

The trap here is that candidates often assume increasing Lambda resources (timeout/memory) or moving to a batch-based approach (Firehose/S3) can solve real-time streaming issues, but the exam tests the understanding that stateful, long-running stream processing engines like Flink are required for reliable, low-latency, exactly-once processing in production.

How to eliminate wrong answers

Option B is wrong because using the Kinesis Client Library (KCL) on an EC2 instance requires manual management of scaling, fault tolerance, and checkpointing, and does not natively integrate with SageMaker for low-latency predictions; it also adds operational overhead and potential for data loss if the instance fails. Option C is wrong because increasing the Lambda timeout to 15 minutes and allocating more memory only masks the underlying issue of Lambda's 15-minute maximum execution time and does not address the fundamental problem of stream processing at scale; Lambda is not designed for long-running, stateful stream processing and can still lose data if the function fails or throttles. Option D is wrong because Amazon Kinesis Firehose delivers data in batches to S3, which introduces significant latency (typically minutes) and is not suitable for real-time fraud detection; triggering a Lambda on S3 objects adds further delay and does not provide low-latency, per-record processing.

Full explanation →

319

MCQhard

A data engineer has attached the above IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from and writes to 'my-data-bucket'. The job is failing with an Access Denied error. What is the most likely cause?

A.The condition restricts access to a specific IP range that does not include the AWS Glue service IPs.

B.The IAM role needs to have s3:ListBucket permission.

C.The IAM role does not have permission to list the bucket.

D.The resource ARN should include the bucket itself, not just the objects.

AnswerA

The condition requires the request source IP to be in 10.0.0.0/24, but Glue's IPs are different.

Why this answer

The policy restricts access to requests originating from the IP range 10.0.0.0/24. AWS Glue jobs run in a VPC that uses private IPs, but the source IP condition is evaluated based on the IP address of the Glue service principal, which is not within that range. The condition should be removed or modified to allow access from the Glue service.

Full explanation →

320

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. They notice that the data is delivered in 5-minute intervals even though they set the buffer interval to 60 seconds. What could be the cause?

A.The source Kinesis stream has insufficient shards.

B.The buffer size is set to a value larger than the incoming data rate.

C.The S3 bucket is in a different region.

D.The IAM role does not have permission to write to S3.

AnswerB

If the buffer size is large and data rate low, Firehose waits longer.

Why this answer

Firehose has a minimum buffer interval of 60 seconds and a maximum of 900 seconds. The actual delivery interval is controlled by both buffer size and interval; if the buffer size is not reached, Firehose waits up to the maximum interval. The default buffer size is 5 MB.

If data rate is low, Firehose will wait 5 minutes (default max interval) unless the buffer size is lowered.

Full explanation →

321

Multi-Selecthard

A machine learning engineer is evaluating a multi-class classification model that predicts product categories. The model outputs probabilities for 10 classes. The engineer wants to improve the model's calibration so that the predicted probabilities reflect the true likelihood of each class. Which THREE techniques can help?

Select 3 answers

A.Use temperature scaling

B.Apply isotonic regression

C.Increase model complexity

D.Apply Platt scaling

E.Use focal loss

AnswersA, B, D

Temperature scaling adjusts the softmax temperature to improve calibration for neural networks.

Why this answer

Platt scaling and isotonic regression are common calibration methods for classification models. Temperature scaling is a variant of Platt scaling for neural networks. Using a different loss function like cross-entropy helps but is not a calibration technique per se.

Full explanation →

322

MCQmedium

A media company uses SageMaker to deploy a real-time inference endpoint for content recommendation. The model is a PyTorch model that uses GPU. The endpoint is deployed with an ml.p3.2xlarge instance. Over time, the endpoint's latency increases significantly during peak hours. The company has enabled auto scaling based on CPU utilization. However, the latency spikes occur even when CPU utilization is low. The model is stateless and the inference code is efficient. What is the MOST likely cause of the latency spikes?

A.The model uses stateful processing that accumulates requests

B.Auto scaling is configured based on CPU utilization, but the bottleneck is GPU utilization

C.The inference container has a memory leak that causes gradual slowdown

D.The instance type is too small for the model

AnswerB

GPU metrics should be used for auto scaling.

Why this answer

Option A is correct because GPU utilization is the correct metric for GPU-based models. Auto scaling based on CPU utilization does not capture GPU load. Option B is wrong because there is no mention of memory issues.

Option C is wrong because the model is stateless. Option D is wrong because the model is already on GPU.

Full explanation →

323

MCQeasy

Refer to the exhibit. A data scientist examines a sample of data and notices that all columns are numeric. The scientist wants to check for multicollinearity. Which statistic should be computed from this sample?

A.Correlation matrix (Pearson)

B.Chi-square test of independence

C.Variance Inflation Factor (VIF)

D.Covariance matrix

AnswerA

A correlation matrix can reveal high pairwise correlations.

Why this answer

Option A is correct because the correlation matrix shows pairwise Pearson correlations, which can indicate high collinearity. Option B is wrong because VIF requires more variables than observations. Option C is wrong because chi-square is for categorical.

Option D is wrong because covariance alone is scale-dependent.

Full explanation →

324

Multi-Selecteasy

Which TWO techniques can help reduce overfitting in a decision tree model?

Select 2 answers

A.Increase the number of trees in the forest

B.Increase the number of features considered per split

C.Limit the maximum depth of the tree

D.Prune the tree after training

E.Increase the maximum depth of the tree

AnswersC, D

Shallower trees generalize better.

Why this answer

Limiting the maximum depth of the tree (Option C) directly restricts the number of splits, preventing the model from learning overly specific patterns in the training data. Pruning the tree after training (Option D) removes branches that have little predictive power, reducing variance and improving generalization. Both techniques combat overfitting by controlling the complexity of the decision tree.

Exam trap

AWS often tests the distinction between techniques that reduce overfitting in a single decision tree versus ensemble methods, so candidates mistakenly apply Random Forest concepts (like increasing trees or features) to a standalone tree.

Full explanation →

325

MCQeasy

A team needs to automatically retrain a model every week using new data. Which SageMaker feature is designed to schedule and automate this workflow?

A.SageMaker Pipelines

B.SageMaker Automatic Model Tuning

C.SageMaker Model Monitor

D.SageMaker Data Wrangler

AnswerA

Pipelines can define and schedule training workflows.

Why this answer

SageMaker Pipelines allows building end-to-end ML workflows with scheduling. Option A is correct. Option B is for model monitoring.

Option C is for feature engineering. Option D is for automatic model tuning.

Full explanation →

326

MCQhard

A company is using Amazon SageMaker to train a deep learning model for image classification. The training job is using a single p3.2xlarge instance and takes 10 hours. The data scientist wants to reduce training time using distributed training. Which SageMaker feature should be used?

A.Use the SageMaker distributed data parallelism library with multiple p3.2xlarge instances.

B.Use SageMaker Managed Spot Training to reduce cost, but training time remains the same.

C.Use SageMaker Hyperparameter Tuning to find optimal hyperparameters faster.

D.Use the SageMaker distributed model parallelism library with a single p3dn.24xlarge instance.

AnswerA

Data parallelism divides the batch across GPUs and synchronizes gradients, scaling training.

Why this answer

Option A is correct because SageMaker's distributed data parallelism automatically splits batches across GPUs and synchronizes gradients, reducing training time. Option B is wrong because model parallelism is for models too large for a single GPU. Option C is wrong because Hyperparameter Tuning does not distribute training.

Option D is wrong because Managed Spot Training saves cost but does not reduce training time.

Full explanation →

327

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format, and the company wants to convert it to Parquet for efficient querying. Which configuration should be used?

A.Enable data transformation in Firehose using an AWS Lambda function to convert JSON to Parquet, and set the output format to Parquet.

B.Use an AWS Glue job to convert the JSON files in S3 to Parquet after delivery.

C.Use Amazon Kinesis Data Analytics to convert the stream to Parquet before sending to Firehose.

D.Configure Firehose to deliver data directly to Amazon Redshift, which automatically converts to Parquet.

AnswerA

Firehose can invoke a Lambda function for transformation and write Parquet to S3.

Why this answer

Option A is correct because Firehose supports data transformation using Lambda and can convert to Parquet via the output format setting. Option B is wrong because Glue is not directly integrated with Firehose. Option C is wrong because Kinesis Data Analytics does not convert to Parquet for S3 delivery.

Option D is wrong because Firehose cannot directly write to Redshift in Parquet format without transformation.

Full explanation →

328

MCQeasy

A data engineer needs to set up a data pipeline that ingests data from an Amazon RDS MySQL database into Amazon S3. The pipeline should run daily and capture incremental changes (inserts, updates, deletes) from the source database. Which AWS service should be used as the data ingestion tool?

A.AWS Database Migration Service (DMS) with continuous change data capture (CDC).

B.Amazon Kinesis Data Streams with a Lambda function.

C.AWS Data Pipeline with a SQL activity.

D.AWS Glue with a scheduled crawler.

AnswerA

Correct: DMS with CDC can capture incremental changes.

Why this answer

AWS Database Migration Service (DMS) supports continuous replication and can capture changes using CDC. Option A (DMS with CDC) is correct. Option B (Glue) can do batch but not native CDC.

Option C (Data Pipeline) is older and less suited for CDC. Option D (Kinesis) is for streaming, not database replication.

Full explanation →

329

MCQeasy

A startup is building a data pipeline that ingests data from multiple sources into an Amazon S3 data lake. The data includes CSV files from legacy systems, JSON from web APIs, and Avro from mobile apps. The data must be transformed into Parquet format and cataloged for querying with Amazon Athena. The pipeline must be serverless and minimize operational overhead. The team has decided to use AWS Glue for ETL and cataloging. However, they are concerned about the cost of running Glue jobs continuously. The data arrives in small batches every 10 minutes. Which approach should the team use to minimize cost while meeting the requirements?

A.Use AWS Lambda functions to transform each file upon arrival and store as Parquet

B.Use Amazon Kinesis Data Firehose to stream data directly into S3 and use Glue to catalog it

C.Use scheduled Glue jobs to process the data every hour, consolidating multiple batches

D.Use a single daily Glue job to process all data at once

AnswerC

Hourly batch processing balances cost and latency.

Why this answer

Triggering Glue jobs on schedule (e.g., every hour) to process accumulated data reduces the number of job runs and cost, while still meeting near-real-time needs. Option A is wrong because continuous streaming with Firehose may not handle all source formats. Option C is wrong because using Lambda for transformation is limited by timeout and memory.

Option D is wrong because running a single daily job may introduce too much latency.

Full explanation →

330

MCQhard

A company is running a real-time inference endpoint on Amazon SageMaker. The endpoint is using an ml.c5.xlarge instance. Over the past month, the CPU utilization has been consistently below 10%, and the latency is well within requirements. The company wants to reduce costs. What should they do?

A.Use a smaller instance type

B.Set up a scaling policy to scale down to zero

C.Switch to a multi-model endpoint

D.Use a batch transform job instead

E.Move to a serverless inference endpoint

AnswerA

A smaller instance can reduce cost while meeting performance.

Why this answer

Option B is correct because a smaller instance type (e.g., ml.c5.large) can handle the load with lower cost. Option A (multi-model endpoint) can reduce cost by sharing instance among models, but may not be beneficial if only one model. Option C (scaling policy) is not needed if utilization is low.

Option D (serverless) may have cold starts and is suitable for sporadic traffic, not constant low utilization. Option E (batch transform) is for offline inference.

Full explanation →

331

MCQhard

A data scientist is performing EDA on a dataset with missing values in 3 of 20 features. The missing rate is 5% for each feature. The scientist wants to preserve as much data as possible while avoiding bias. Which imputation strategy is most appropriate?

A.Remove rows with any missing values.

B.Impute missing values with the mean of each feature.

C.Use K-Nearest Neighbors (KNN) imputation.

D.Impute missing values with the median of each feature.

AnswerD

Median is robust and retains data.

Why this answer

Option A is correct because median imputation is robust to outliers and preserves the dataset size. Option B is wrong because dropping rows with missing values would lose 14% of data. Option C is wrong because mean imputation can be affected by outliers.

Option D is wrong because KNN imputation may introduce bias and is computationally expensive.

Full explanation →

332

MCQhard

A company is using Amazon SageMaker Ground Truth to create labeled datasets for a text classification task. The labeling job uses a private workforce of 10 annotators. After labeling 10,000 items, the quality of labels is inconsistent. Which approach will MOST effectively improve labeling consistency?

A.Remove annotations from annotators with low agreement after the job completes.

B.Increase the number of annotators to 20 to average out inconsistencies.

C.Configure the labeling job to use annotation consolidation with majority voting and require multiple annotations per item.

D.Use active learning to automatically label the most confident samples and only send uncertain ones to annotators.

AnswerC

Consensus from multiple annotators and majority voting yields more consistent labels.

Why this answer

Option D is correct because using a consensus pipeline with majority voting and annotation consolidation reduces individual annotator bias. Option A is wrong because increasing workforce size does not directly improve consistency. Option B is wrong because removing outliers after labeling may discard valid data.

Option C is wrong because active learning selects examples for labeling, but does not improve annotator consistency.

Full explanation →

333

MCQeasy

A data scientist is deploying a model using Amazon SageMaker. The model endpoint needs to handle real-time inference requests with low latency. The model is a large ensemble of 10 deep learning models, each approximately 500 MB. What is the most cost-effective deployment strategy that meets the low-latency requirement?

A.Deploy each model to a separate endpoint and use a load balancer.

B.Use a single endpoint with multiple instances behind it.

C.Use a SageMaker batch transform job to process inference requests in batches.

D.Use a SageMaker multi-model endpoint to host all models on one or more instances.

AnswerD

Multi-model endpoints efficiently host multiple models on shared instances, reducing cost.

Why this answer

A SageMaker multi-model endpoint (MME) allows hosting multiple models on a single or few instances, dynamically loading them from Amazon S3 into memory as needed. This is the most cost-effective option for a large ensemble of 500 MB models because it avoids the expense of separate endpoints or multiple instances per model, while still supporting low-latency real-time inference by keeping frequently used models cached.

Exam trap

The trap here is that candidates may confuse multi-model endpoints with multi-container endpoints or assume that a single endpoint cannot host multiple models, leading them to choose the expensive separate-endpoint approach (Option A) or the memory-inefficient single-endpoint approach (Option B).

How to eliminate wrong answers

Option A is wrong because deploying each model to a separate endpoint and using a load balancer would incur high costs (10 endpoints × instance costs) and add network latency from the load balancer, making it neither cost-effective nor optimal for low latency. Option B is wrong because a single endpoint with multiple instances behind it would require all 10 models to be loaded on every instance, consuming excessive memory (5 GB per instance) and increasing cost without leveraging model-sharing efficiencies. Option C is wrong because SageMaker batch transform is designed for asynchronous, offline inference on large datasets, not for real-time requests, and would introduce unacceptable latency for live inference.

Full explanation →

334

MCQmedium

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

A.The imputation will introduce bias if the missing values are not random.

B.Imputation using median is computationally expensive for large datasets.

C.The imputed values may reduce the variance of the 'age' distribution.

D.The imputed values will increase the variance of the feature, leading to overfitting.

AnswerC

Replacing missing values with a constant reduces the variability of the feature.

Why this answer

Imputing missing values with the median of the observed data artificially concentrates imputed values around the center of the distribution. This reduces the overall variance of the 'age' column because the imputed values do not reflect the natural spread of the data, potentially distorting downstream analyses like regression or clustering that rely on variance structure.

Exam trap

Cisco often tests the subtle distinction between bias (which is a general risk of any imputation under non-random missingness) and variance reduction (which is a specific, guaranteed statistical consequence of constant-value imputation).

How to eliminate wrong answers

Option A is wrong because while imputation can introduce bias if data are not missing at random (MNAR), the question specifically asks about a drawback of using median imputation; the bias concern is not unique to median imputation and is a general risk of any imputation method under MNAR, not the primary technical drawback described. Option B is wrong because computing the median is O(n) with efficient algorithms and is not computationally expensive even for large datasets; mean or median imputation is among the cheapest imputation methods. Option D is wrong because median imputation reduces variance, not increases it; increased variance would be a concern with methods like mean imputation with added noise, not with simple median imputation.

Full explanation →

335

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The company has a 1 Gbps internet connection. Which service would complete the transfer in the shortest time?

A.AWS Snowball

B.Amazon S3 Transfer Acceleration

C.AWS Direct Connect

D.AWS DataSync

AnswerA

Snowball can transfer 50 TB physically in days.

Why this answer

AWS Snowball is a physical device that can transfer large amounts of data faster than internet due to bandwidth limitations. DataSync is for network transfers. Direct Connect helps but still limited by bandwidth.

S3 Transfer Acceleration speeds up internet transfers but cannot match physical shipment for 50 TB.

Full explanation →

336

Multi-Selectmedium

A company is migrating on-premises data to AWS. The data includes both structured and unstructured files, totaling 200 TB. The company has a 1 Gbps dedicated network connection to AWS. They want to minimize migration time and cost. Which TWO AWS services or features should they use together? (Choose two.)

Select 2 answers

A.AWS Snowball Edge

B.Amazon S3 Transfer Acceleration

C.AWS Glue

D.AWS DataSync

E.AWS Direct Connect

AnswersA, D

Snowball Edge can be used to transfer large amounts of data physically, reducing network load.

Why this answer

Correct options: A and D. AWS DataSync can efficiently transfer data over the network to S3, and Snowball Edge can be used for the largest files to reduce network load. Option B (Direct Connect) is already implied by the 1 Gbps connection; it's not an additional service.

Option C (S3 Transfer Acceleration) is for speeding up internet transfers, not needed over a dedicated connection. Option E (Glue) is used for ETL, not for bulk data transfer.

Full explanation →

337

MCQhard

A company uses AWS Glue to run ETL jobs on a daily schedule. The jobs are failing intermittently with 'OutOfMemory' errors. The data volume has grown 5x over the past month. Which is the MOST cost-effective fix?

A.Increase the number of partitions in the source S3 data

B.Increase the number of DPUs for the Glue job

C.Reduce the data volume by sampling

D.Switch from AWS Glue to Amazon EMR

AnswerB

More DPUs provide more memory and parallelism.

Why this answer

Increasing the number of DPUs (data processing units) in the Glue job configuration provides more memory and parallelism cost-effectively. Switching to EMR is more expensive and complex. Reducing data is not a solution.

Increasing S3 partitions does not affect memory.

Full explanation →

338

MCQhard

A data science team is deploying a machine learning model to production using SageMaker. The model is a PyTorch model that requires custom inference logic including image preprocessing. The team needs to ensure that the endpoint can handle variable batch sizes and has low latency. Which deployment approach should the team use?

A.Use SageMaker Inference Pipelines with a preprocessing container followed by the PyTorch model container.

B.Deploy the model as an AWS Lambda function and use API Gateway.

C.Use the SageMaker Python SDK's Predictor class with the model artifact.

D.Use a SageMaker multi-model endpoint to host the model with a custom container.

E.Use SageMaker Batch Transform for real-time inference.

AnswerA

Inference Pipelines allow custom preprocessing and model serving with low latency.

Why this answer

Option A is correct because SageMaker Inference Pipelines allow chaining of preprocessing and prediction containers, enabling custom logic and efficient batching. Option B (Lambda) is not designed for real-time inference with large models. Option C (SageMaker Batch Transform) is for offline batch inference, not real-time.

Option D (multi-model endpoints) is for hosting multiple models, not custom inference logic. Option E (SageMaker SDK's Predictor) is a client, not a deployment approach.

Full explanation →

339

Multi-Selecteasy

A data scientist is using Amazon SageMaker to train a linear regression model. The training data contains missing values. Which TWO techniques are appropriate for handling missing values in the dataset?

Select 2 answers

A.Use a decision tree model that can handle missing values internally

B.Set missing values to zero

C.Remove rows with missing values if the proportion is small

D.Impute missing values with the mean of the column

E.Create a separate category for missing values

AnswersC, D

If a small fraction of rows have missing values, removing them is acceptable.

Why this answer

Options A (Impute with the mean of the column) and C (Remove rows with missing values if the proportion is small) are correct. Imputation is a common technique; removing rows is acceptable if few missing. Option B (Set missing values to zero) can bias the model.

Option D (Use a decision tree that handles missing values internally) is not applicable for linear regression. Option E (Use a separate category for missing) is not suitable for linear regression numeric features.

Full explanation →

340

Multi-Selecteasy

Which TWO AWS services can be used to visualize data distributions as part of exploratory data analysis? (Select TWO.)

Select 2 answers

A.AWS Glue

B.Amazon QuickSight

C.Amazon Athena

D.Amazon Comprehend

E.Amazon SageMaker Data Wrangler

AnswersB, E

QuickSight provides interactive dashboards and visualizations.

Why this answer

Amazon QuickSight is a cloud-native business intelligence service that can visualize data distributions through histograms, box plots, scatter plots, and other chart types, making it suitable for exploratory data analysis. Amazon SageMaker Data Wrangler provides a visual interface to create data distribution charts (e.g., histograms, bar charts) directly within the data preparation workflow, enabling quick inspection of feature distributions before model building.

Exam trap

Cisco often tests the misconception that AWS Glue or Athena can visualize data distributions because they are used in data preparation or querying, but neither provides native charting or plotting capabilities—they only return raw data or tabular results.

Full explanation →

341

Multi-Selecthard

An ML team trains a deep learning model using Amazon SageMaker with a custom Docker container. Training completes successfully, but the model's accuracy on the test set is significantly lower than expected. The team suspects overfitting. Which two actions should they take to mitigate overfitting? (Choose TWO.)

Select 2 answers

A.Use data augmentation

B.Add dropout layers

C.Reduce the batch size

D.Increase the number of layers

E.Increase the number of training epochs

AnswersA, B

Data augmentation increases effective training data size, reducing overfitting.

Why this answer

Dropout and data augmentation are effective regularization techniques to reduce overfitting. Option A (increasing epochs) would worsen overfitting. Option B (reducing batch size) can introduce noise but is not a primary regularization method.

Option D (adding more layers) increases model capacity, likely worsening overfitting.

Full explanation →

342

MCQmedium

A company runs a nightly batch job that reads data from Amazon RDS for PostgreSQL, transforms it using AWS Glue, and writes the output to Amazon S3 in Parquet format. The job takes 2 hours to complete, but the data volume has grown, and the job now takes 4 hours, exceeding the allowed window. The team needs to reduce the job duration without increasing cost. Which action is MOST effective?

A.Increase the number of Glue DPUs (Data Processing Units) allocated to the job.

B.Enable AWS Glue job bookmark to skip already processed data.

C.Change the output format from Parquet to CSV to reduce write time.

D.Partition the output S3 data by a high-cardinality column used in filtering during transformation.

AnswerD

Partitioning can reduce data shuffling and improve write performance.

Why this answer

Option D is correct because partitioning the Parquet output by a frequently filtered column reduces the amount of data processed downstream and can speed up the Glue job if the transformation includes filtering. However, the question asks to reduce job duration: increasing parallelism by increasing the number of DPUs (option C) would increase cost, but the team does not want to increase cost. Option A is wrong because Glue's job bookmark can help incremental loads but does not speed up the full load.

Option B is wrong because converting to CSV may increase size and processing time. Option D is the best: partitioning the output can improve write performance and reduce data volume for subsequent steps, but note that it might not drastically reduce the job duration itself. Actually, the most effective without increasing cost is to optimize the transformation logic.

But among given options, D is plausible. Let's reconsider: Option C increases cost. Option D partitions output, which may slightly reduce write time but not transformation time.

Option A and B are not effective. Perhaps the best is to use a larger instance with same cost? Not possible. Option D is the only one that does not increase cost and may help.

I'll stick with D.

Full explanation →

343

Multi-Selectmedium

A data scientist is analyzing a dataset and finds that two features have a Pearson correlation coefficient of 0.95. Which TWO actions should the data scientist consider? (Choose two.)

Select 2 answers

A.Combine the two features into a single feature using PCA or averaging

B.Add interaction terms between the features

C.Increase regularization strength in the model

D.Remove one of the correlated features

E.Apply standard scaling to both features

AnswersA, D

Combining captures information from both while reducing dimensionality.

Why this answer

Options B and C are correct. High correlation can lead to multicollinearity, so removing one feature (B) or combining them (C) are valid approaches. Option A is wrong because increasing regularization is a remedy for multicollinearity but does not directly address the correlation.

Option D is wrong because scaling does not affect correlation. Option E is wrong because adding interaction terms can increase multicollinearity.

Full explanation →

344

MCQmedium

A company is using Amazon SageMaker to train a deep learning model for image segmentation. The training job uses a single ml.p3.2xlarge instance and takes 48 hours to complete. The team needs to reduce training time to under 12 hours to meet a deadline. The dataset is 50 GB of images stored in S3. The team currently uses File mode to download the data to the training instance. The model architecture is a convolutional neural network (CNN) with 50 layers. The team has access to multiple instances of the same type. Which approach will most effectively reduce training time?

A.Reduce the number of layers in the CNN to speed up training.

B.Increase the batch size on the single instance to process more data per iteration.

C.Use SageMaker's distributed data parallelism with multiple instances.

D.Switch to Pipe mode to stream data from S3, reducing data loading time.

AnswerC

Distributed training across instances parallelizes computation and can achieve near-linear speedup.

Why this answer

Option C is correct because SageMaker's distributed data parallelism splits the 50 GB dataset across multiple ml.p3.2xlarge instances, allowing each instance to process a subset of the data in parallel. This can reduce training time from 48 hours to under 12 hours, assuming near-linear scaling with the number of instances (e.g., 4 instances for a 4x speedup). The approach directly addresses the need to reduce wall-clock time without altering the model architecture or data loading method.

Exam trap

The trap here is that candidates may confuse data loading optimization (Pipe mode) with compute parallelism, overlooking that the 48-hour bottleneck is GPU compute time, not I/O, and that distributed training is the only viable method to achieve a 4x speedup without altering the model.

How to eliminate wrong answers

Option A is wrong because reducing the number of layers in the CNN would degrade model accuracy for image segmentation, and the goal is to reduce training time without compromising model quality; it also does not leverage the available multiple instances. Option B is wrong because increasing the batch size on a single instance may improve GPU utilization but is unlikely to reduce training time from 48 hours to under 12 hours, as it does not address the fundamental bottleneck of sequential processing on one GPU; it can also cause out-of-memory errors or convergence issues. Option D is wrong because switching to Pipe mode streams data directly from S3 without downloading, reducing I/O overhead, but the primary bottleneck is compute (GPU processing), not data loading; the training time is dominated by forward/backward passes through 50 layers, not by data transfer.

Full explanation →

345

Multi-Selectmedium

A data scientist is training a model using SageMaker and wants to use spot instances to reduce costs. Which THREE considerations should the scientist evaluate? (Choose THREE.)

Select 3 answers

A.Spot instances have a fixed, lower price than on-demand.

B.The training job must support checkpointing to save progress.

C.Spot instances are only available for inference, not training.

D.The training algorithm must be fault-tolerant to handle interruptions.

E.Spot instances can be reclaimed with a two-minute notice.

AnswersB, D, E

Needed to resume after interruption.

Why this answer

Option A is correct because spot instances can be interrupted, so the training job must be checkpointed to resume. Option C is correct because spot instances are typically cheaper, but they can be reclaimed, affecting cost savings if interruptions are frequent. Option D is correct because model training is often fault-tolerant and can handle interruptions.

Option B is wrong because spot instances are dynamically priced, not fixed. Option E is wrong because spot instances are available for training, not just inference.

Full explanation →

346

MCQeasy

A company wants to analyze historical data stored in Amazon S3 using Amazon Athena. The data is in CSV format and is partitioned by date. Which action will provide the best query performance and cost optimization?

A.Use AWS Glue to compress the CSV files with gzip

B.Create an S3 event notification to trigger a Lambda function that warms up Athena

C.Keep CSV format but ensure partitions are in the format year=YYYY/month=MM/day=DD

D.Convert the data to Parquet format and use the existing partition structure

AnswerD

Parquet is columnar and compressed, reducing scanned data and improving performance.

Why this answer

Converting data to Parquet and partitioning improves query performance and reduces cost because Athena scans less data. Option B (only partitioning) helps but Parquet is columnar and more efficient. Option C (increasing S3 events) is irrelevant.

Option D (using Glue to compress) may help but Parquet already includes compression.

Full explanation →

347

MCQeasy

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transaction records. The dataset includes a column 'transaction_date' with timestamps. The engineer wants to derive features such as day of the week, hour, and month for modeling. Which AWS service can be used directly to extract these features without writing custom code?

A.AWS Glue ETL with built-in timestamp transforms

B.Amazon Athena with SQL date functions

C.Amazon QuickSight

D.Amazon SageMaker Data Wrangler

AnswerA

AWS Glue provides transforms like 'ExtractTimestamp' to derive date components without custom code.

Why this answer

Option B is correct because AWS Glue provides built-in transforms in its ETL jobs to parse timestamps and extract date/time components. Option A is wrong because Athena is a query engine and can extract date parts using SQL, but that requires writing SQL queries, not a no-code solution. Option C is wrong because SageMaker Data Wrangler is a visual tool that can create features, but it requires a SageMaker Studio environment.

Option D is wrong because QuickSight is a visualization tool, not for feature engineering.

Full explanation →

348

MCQhard

A data engineer is performing EDA on a time-series dataset of server metrics (CPU, memory, disk I/O) collected every minute. The dataset contains 2 years of data. The engineer suspects there are seasonal patterns and wants to decompose the time series for one metric. Which AWS service can be used to perform this decomposition natively?

A.Amazon SageMaker Canvas

B.Amazon Athena with SQL window functions

C.Amazon QuickSight

D.AWS Glue DataBrew

AnswerA

Canvas provides time-series analysis and decomposition.

Why this answer

Option A is correct because Amazon SageMaker Canvas supports time-series forecasting with built-in decomposition. Option B is wrong because AWS Glue does not have time-series decomposition. Option C is wrong because Amazon QuickSight does not decompose time series.

Option D is wrong because Amazon Athena does not have decomposition functions.

Full explanation →

349

MCQhard

A company uses Amazon SageMaker to deploy a model for real-time predictions. The model is updated weekly. The company wants to ensure that the new model version is gradually rolled out to a small percentage of traffic before full deployment, and that it can be rolled back quickly if issues are detected. Which deployment strategy should be used?

A.Blue/green deployment

B.A/B testing with a holdout group

C.Canary deployment using SageMaker endpoint variants

D.Rolling deployment across multiple endpoints

AnswerC

Canary deployment allows sending a small percentage of traffic to the new variant and can be rolled back by shifting traffic back.

Why this answer

Option D is correct because SageMaker supports canary deployments using endpoint variants, allowing a small percentage of traffic to be directed to the new model version. Option A is wrong because blue/green deployment switches all traffic at once. Option B is wrong because A/B testing is for comparing models, not gradual rollout.

Option C is wrong because rolling deployment is not natively supported by SageMaker endpoints.

Full explanation →

350

MCQhard

A data scientist is using SageMaker to train a custom TensorFlow model. The training script reads data from S3 using TensorFlow's tf.data API. The training is bottlenecked by I/O. Which strategy would MOST effectively improve data throughput?

A.Compress the data files in S3

B.Use Amazon FSx for Lustre as a mounted filesystem

C.Increase the number of parallel workers in tf.data

D.Use SageMaker Pipe mode and shard the S3 dataset

AnswerD

Pipe mode streams data directly, and sharding distributes data across instances, improving throughput.

Why this answer

Using SageMaker Pipe mode with a sharded S3 dataset allows the training instances to stream data in parallel, reducing I/O bottlenecks. Increasing workers in tf.data may help but not as effectively as optimizing data ingestion. Using FSx for Lustre provides high throughput but adds cost and complexity.

Full explanation →

351

Multi-Selecthard

A data engineer needs to set up a data lake on S3 that supports both batch and streaming ingestion. The data must be queryable by Athena, Redshift Spectrum, and EMR. Which TWO configurations are essential? (Choose two.)

Select 2 answers

A.Store data in columnar formats like Parquet or ORC.

B.Use the AWS Glue Data Catalog as a central metadata repository.

C.Enable S3 Select on the target buckets.

D.Enable S3 versioning on all buckets.

E.Set up Kinesis Data Firehose for streaming ingestion.

AnswersA, B

Columnar formats improve query performance and reduce scan costs for Athena and Redshift Spectrum.

Why this answer

Options A and B are correct. Option A ensures the Glue Data Catalog is used as a metastore for all services. Option B stores data in columnar format for query efficiency.

Option C is wrong because S3 Select is not required. Option D is wrong because Kinesis is needed only for streaming. Option E is wrong because S3 versioning is not essential for querying.

Full explanation →

352

MCQeasy

A data scientist needs to create a SageMaker notebook instance with access to a private S3 bucket. The bucket uses SSE-KMS encryption. Which additional configuration is required?

A.Add a lifecycle configuration script

B.Modify the bucket policy to allow s3:GetObject

C.Place the notebook instance in a VPC

D.Attach a policy to the notebook's IAM role that allows kms:Decrypt

AnswerD

Needed to decrypt objects encrypted with SSE-KMS.

Why this answer

Option D is correct because the notebook instance needs permission to use the KMS key. Option A is wrong because VPC is optional. Option B is wrong because lifecycle configuration does not handle encryption.

Option C is wrong because bucket policy is separate.

Full explanation →

353

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline that will receive up to 5 GB of data per hour from thousands of IoT devices. The data must be stored in Amazon S3 and analyzed in near real-time. Which TWO services should be used together to meet these requirements? (Choose TWO.)

Select 2 answers

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon Athena

D.Amazon Kinesis Data Firehose

E.Amazon Simple Queue Service (Amazon SQS)

AnswersB, D

Kinesis Data Analytics can run SQL queries on streaming data for near real-time analysis.

Why this answer

Amazon Kinesis Data Firehose can ingest streaming data and deliver to S3. Amazon Kinesis Data Analytics can perform near real-time analysis on the data stream. Option A (Amazon SQS) is for decoupling applications, not streaming analytics.

Option B (AWS Lambda) can be used for processing but is not a streaming analytics service. Option E (Amazon Athena) is for ad-hoc queries on S3, not real-time.

Full explanation →

354

MCQhard

A data engineer is designing a data lake on Amazon S3. The data comes from various sources and must be stored in a way that supports both batch and real-time analytics. The engineer needs to partition the data to optimize query performance in Amazon Athena. Which partitioning strategy is MOST appropriate?

A.Partition by a hash of the record ID to distribute data evenly

B.Do not partition; use a single prefix for all data

C.Partition by year, then month, then day, then hour

D.Partition by source type and then by date

AnswerC

Hierarchical time partitioning is standard for time-series data and works well with Athena.

Why this answer

Option C is correct. Partitioning by year/month/day/hour allows efficient querying for both batch (daily) and real-time (hourly) use cases, and is a common practice. Option A (source type) may cause small files.

Option B (random) is not helpful. Option D (single partition) defeats the purpose.

Full explanation →

355

Multi-Selectmedium

A company is designing a data pipeline to ingest data from multiple sources into an Amazon S3 data lake. The data must be encrypted at rest and in transit. Which TWO actions should be taken to meet these requirements?

Select 2 answers

A.Enable Server-Side Encryption on the S3 bucket

B.Enable S3 Transfer Acceleration

C.Enforce HTTPS for all S3 API requests using bucket policy

D.Use client-side encryption before uploading

E.Use S3 VPC Endpoint

AnswersA, C

Encrypts objects at rest.

Why this answer

Server-Side Encryption (SSE-S3 or KMS) encrypts data at rest in S3. Using HTTPS (SSL/TLS) for all API calls encrypts data in transit. Client-side encryption is an alternative but not the standard AWS approach.

VPC endpoints and CloudFront do not encrypt in transit by themselves.

Full explanation →

356

MCQeasy

A data engineer needs to move 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. Which AWS service should be used to transfer the data most efficiently?

A.Amazon S3 Transfer Acceleration to speed up the transfer.

B.AWS Snowball Edge device to physically ship the data.

C.AWS Direct Connect to establish a dedicated network connection.

D.AWS Site-to-Site VPN to connect and copy data.

AnswerB

Snowball bypasses network limitations by shipping data physically.

Why this answer

Option C is correct because AWS Snowball is designed for large data transfers over limited bandwidth. Option A is wrong because Direct Connect requires high bandwidth and doesn't physically move data. Option B is wrong because S3 Transfer Acceleration speeds up transfers over the internet but still limited by bandwidth.

Option D is wrong because VPN is not efficient for 50 TB.

Full explanation →

357

MCQmedium

A data scientist is using SageMaker to train a model using the built-in XGBoost algorithm. The training job fails with the error 'AlgorithmError: Framework error: No module named 'xgboost''. What is the most likely cause?

A.The training data is not in CSV format.

B.The training job is using a custom container that does not have XGBoost installed.

C.The IAM role does not have permission to access SageMaker.

D.The S3 output path is incorrect.

AnswerB

Missing module indicates container issue.

Why this answer

Option B is correct because the built-in XGBoost algorithm requires the 'xgboost' Python package; SageMaker's built-in algorithms provide the necessary environment, but if the container is overridden or the wrong image is used, the module may be missing. Option A is wrong because the error is about missing module, not data format. Option C is wrong because the error is not about permissions.

Option D is wrong because the error is not about output path.

Full explanation →

358

MCQmedium

A machine learning team needs to preprocess large volumes of clickstream data stored in Amazon S3 before training a model. The preprocessing includes data cleaning, feature engineering, and normalization. The team wants to use a serverless solution that minimizes operational overhead. Which combination of services should the team use?

A.Amazon SageMaker Notebooks with custom Python scripts.

B.Amazon EMR with Spark clusters.

C.AWS Glue ETL jobs reading from and writing to S3.

D.Amazon Athena with SQL queries.

AnswerC

AWS Glue is serverless and designed for ETL on data lakes.

Why this answer

AWS Glue provides a serverless Spark environment for running ETL jobs on data in S3. Amazon SageMaker Processing jobs are also serverless but are more suited for post-training tasks. Option B is wrong because EMR requires cluster management.

Option C is wrong because SageMaker Notebooks are interactive, not automated. Option D is wrong because Athena is for ad-hoc queries, not complex transformations.

Full explanation →

359

MCQmedium

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to a sink. The application is failing with an 'OutOfMemoryError'. The application has parallelism set to 4 and uses 1 Kinesis Processing Unit (KPU). What is the MOST likely cause and solution?

A.The application is using too many operators; reduce parallelism to 2.

B.The heap memory per operator is too low; increase parallelism to 8.

C.The checkpoint interval is too short; increase it to 5 minutes.

D.The buffer timeout is too high; reduce it to 50 ms.

AnswerB

Higher parallelism allocates more total memory across tasks.

Why this answer

With parallelism set to 4 but only 1 KPU, each operator slot receives a fraction of the available heap memory, leading to an OutOfMemoryError. Increasing parallelism to 8 distributes the workload across more slots, but more importantly, it forces Kinesis Data Analytics to allocate additional KPUs (each KPU provides 4 GB of memory), thereby increasing the total heap memory available to the application.

Exam trap

The trap here is that candidates assume increasing parallelism always reduces per-operator memory, but in Kinesis Data Analytics, parallelism is tied to KPU allocation, so increasing parallelism can actually increase total memory by provisioning more KPUs.

How to eliminate wrong answers

Option A is wrong because reducing parallelism would further decrease the number of operator slots, concentrating memory usage and worsening the OutOfMemoryError. Option C is wrong because a short checkpoint interval can cause backpressure and increased memory usage, but the primary issue here is insufficient heap memory per operator, not checkpoint timing. Option D is wrong because buffer timeout affects latency and batching behavior, not heap memory allocation; reducing it would increase the number of small records processed, potentially increasing memory pressure.

Full explanation →

360

Multi-Selectmedium

Which TWO metrics are MOST appropriate for evaluating a regression model that predicts house prices, where the business is most sensitive to large errors?

Select 2 answers

A.Root Mean Squared Error (RMSE)

B.Mean Absolute Percentage Error (MAPE)

C.Accuracy

D.Mean Absolute Error (MAE)

E.R-squared

AnswersA, B

RMSE squares errors, so large errors are penalized heavily.

Why this answer

RMSE is most appropriate because it squares the errors before averaging, which heavily penalizes large errors. Since the business is most sensitive to large errors in house price predictions, RMSE directly aligns with this requirement by amplifying the impact of outliers, making it a suitable metric for evaluating model performance in this context.

Exam trap

The trap here is that candidates often choose MAE (Option D) because it is a common regression metric, but they fail to recognize that MAE does not penalize large errors more heavily, which is the key business requirement in this scenario.

Full explanation →

361

MCQmedium

A data scientist trains a model using SageMaker and notices that the training loss decreases but validation loss increases after a few epochs. What is the MOST likely issue?

A.The learning rate is too low.

B.There is data leakage from validation to training.

C.The model is underfitting the training data.

D.The model is overfitting the training data.

AnswerD

Classic sign of overfitting: training loss decreases, validation loss increases.

Why this answer

Overfitting occurs when the model performs well on training data but poorly on validation data. Option A (underfitting) would show high training loss. Option B (data leakage) would show good performance on both.

Option C (learning rate too low) would slow convergence.

Full explanation →

362

MCQhard

A research team is developing a deep learning model to classify medical images into 10 disease categories. They have a dataset of 50,000 labeled images, but the class distribution is highly imbalanced: the most common class has 20,000 images, while the rarest class has only 200 images. To address this, they apply data augmentation (random rotations, flips, and brightness adjustments) to the minority classes until each class has 20,000 images. They then train a convolutional neural network (CNN) from scratch using cross-entropy loss. The model achieves 95% overall accuracy but only 30% recall on the rarest class. Which change is MOST likely to improve recall on the rarest class without significantly reducing overall accuracy?

A.Increase dropout rate from 0.2 to 0.5 to reduce overfitting

B.Replace cross-entropy loss with focal loss

C.Switch from Adam optimizer to SGD with momentum

D.Reduce the batch size from 64 to 16 to increase stochasticity

AnswerB

Focal loss reduces the loss contribution from easy examples and focuses on hard, minority examples, improving recall.

Why this answer

Focal loss is specifically designed to address class imbalance by down-weighting the loss contribution from well-classified examples (majority classes) and focusing training on hard, misclassified examples (minority classes). This directly improves recall on the rarest class, while cross-entropy loss treats all classes equally, causing the model to be biased toward the majority classes.

Exam trap

Cisco often tests the distinction between regularization techniques (dropout, batch size) and loss function modifications (focal loss) for class imbalance, trapping candidates who think overfitting is the primary issue when the real problem is the model's bias toward majority classes.

How to eliminate wrong answers

Option A is wrong because increasing dropout from 0.2 to 0.5 is a regularization technique that reduces overfitting, but the model already achieves 95% overall accuracy, indicating it is not overfitting; this change would likely reduce capacity and hurt recall on the rare class without addressing the imbalance. Option C is wrong because switching from Adam to SGD with momentum changes the optimization dynamics (e.g., learning rate scheduling, convergence speed) but does not directly address the class imbalance problem; it may even slow convergence and fail to improve minority class recall. Option D is wrong because reducing batch size from 64 to 16 increases gradient stochasticity, which can help escape local minima but does not specifically target the imbalance; it may cause training instability and does not re-weight the loss to focus on minority classes.

Full explanation →

363

Multi-Selectmedium

A company is using SageMaker Autopilot to automatically build ML models. They want to ensure that the generated models are reproducible. Which TWO settings should they configure?

Select 2 answers

A.Set a random seed.

B.Specify a validation split.

C.Use multiple trials.

D.Enable early stopping.

E.Enable automatic feature engineering.

AnswersA, B

Random seeds make train/test split and model initialization deterministic.

Why this answer

Options B and D are correct. Setting a random seed ensures that random processes (like train/test split) are deterministic. Disabling automatic data splitting and manually providing a validation split gives control over the split.

Option A is incorrect because enabling early stopping can vary based on training dynamics. Option C is incorrect because using multiple trials introduces randomness. Option E is incorrect because automatic feature engineering may introduce non-deterministic transforms.

Full explanation →

364

MCQhard

A data scientist is performing EDA on a dataset containing text reviews. To understand the most common words, the data scientist generates a word cloud. Which preprocessing step is most important to ensure the word cloud reflects meaningful content?

A.Stop word removal

B.Part-of-speech tagging

C.Stemming

D.Tokenization

AnswerA

Stop word removal eliminates common, uninformative words.

Why this answer

Option C is correct because removing stop words (common words like 'the', 'and') ensures that the word cloud highlights meaningful words. Stemming (A) may not be necessary for a word cloud. Tokenization (B) is fundamental but not the most critical for meaningfulness.

POS tagging (D) is overkill.

Full explanation →

365

Multi-Selectmedium

Which THREE techniques are commonly used in exploratory data analysis to understand the relationships between features and the target variable? (Select THREE.)

Select 3 answers

A.Use box plots to compare feature distributions across target classes.

B.Perform K-means clustering on the features.

C.Compute the correlation matrix between features and target.

D.Generate scatter plots or pair plots to visualize feature interactions.

E.Apply Principal Component Analysis (PCA) to reduce dimensions.

AnswersA, C, D

Box plots by class reveal differences in feature distributions.

Why this answer

Options A, C, and E are correct. A: Correlation matrix quantifies linear relationships. C: Pair plots allow visual inspection of multiple relationships.

E: Box plots by target show distribution differences. B is wrong because PCA is for dimensionality reduction, not EDA of relationships. D is wrong because clustering is unsupervised and not directly for feature-target relationships.

Full explanation →

366

Multi-Selecteasy

A data scientist is training a k-means clustering model on a dataset with 1,000 points. The scientist uses the elbow method to choose the number of clusters. The elbow plot shows a clear bend at k=4. After running k-means with k=4, the scientist wants to evaluate the quality of the clustering. Which THREE of the following are suitable internal clustering validation metrics? (Choose THREE.)

Select 3 answers

A.Adjusted Rand index

B.Rand index

C.Calinski-Harabasz index

D.Silhouette score

E.Davies-Bouldin index

AnswersC, D, E

Ratio of between-cluster variance to within-cluster variance; higher is better.

Why this answer

Silhouette score, Davies-Bouldin index, and Calinski-Harabasz index are all internal validation metrics that do not require ground truth labels. They measure compactness and separation. Rand index and adjusted Rand index require ground truth labels (external validation).

Full explanation →

367

MCQeasy

A data scientist is training a linear regression model on a dataset with 50 features. After training, they notice that the model performs well on training data but poorly on test data. They suspect overfitting. Which action should they take to reduce overfitting?

A.Use a larger learning rate

B.Add L2 regularization (Ridge regression)

C.Add more features to the model

D.Increase the number of training epochs

AnswerB

L2 regularization penalizes large coefficients, reducing overfitting.

Why this answer

Regularization is a standard technique to combat overfitting in linear models.

Full explanation →

368

MCQmedium

A machine learning engineer is working on a customer churn prediction project. The dataset contains 100,000 records with 15 features, including customer demographics, account information, and usage patterns. The target variable 'churned' is binary with 15% positive examples. During EDA, the engineer notices that the feature 'tenure' (number of months the customer has been with the company) has a multimodal distribution with peaks at 1, 12, 24, and 36 months. Also, the feature 'monthly_charges' has a strong positive correlation with 'total_charges' (correlation coefficient = 0.95). The engineer wants to build a logistic regression model. Which preprocessing steps should the engineer take to address these issues? (Select TWO.)

A.Bin the 'tenure' feature into categorical groups (e.g., 0-6, 7-12, 13-24, 25-36, 36+) to capture the non-linear relationship.

B.Remove one of the correlated features, such as 'total_charges', to reduce multicollinearity.

C.Apply log transformation to the 'tenure' feature to make it unimodal.

D.Create polynomial features up to degree 3 for 'tenure' to capture non-linearity.

E.Standardize all numerical features to have mean 0 and variance 1.

AnswerA, B

Binning can effectively capture the peaks in the distribution and model the non-linear effect of tenure on churn.

Why this answer

Option A is correct because binning the 'tenure' feature into categorical groups (e.g., 0-6, 7-12, 13-24, 25-36, 36+) captures the multimodal distribution and non-linear relationship with churn, which logistic regression (a linear model) cannot model directly. This transforms the feature into a format that allows the model to learn different churn probabilities for each tenure segment without imposing a linear assumption.

Exam trap

Cisco often tests the misconception that standardizing or transforming features alone can fix non-linear relationships or multicollinearity, but these steps do not address the root cause of multimodal distributions or high feature correlation in linear models like logistic regression.

How to eliminate wrong answers

Option C is wrong because applying a log transformation to 'tenure' would not make a multimodal distribution unimodal; log transformations are used to reduce right skewness, not to eliminate multiple peaks, and would distort the natural grouping at contract milestones. Option D is wrong because creating polynomial features up to degree 3 for 'tenure' would introduce multicollinearity and overfitting without addressing the multimodal nature, and logistic regression still assumes a linear relationship in the log-odds space, which polynomial terms do not resolve for discrete peaks. Option E is wrong because standardizing numerical features (mean 0, variance 1) is a general best practice for gradient descent convergence but does not address the multimodal distribution of 'tenure' or the multicollinearity between 'monthly_charges' and 'total_charges'; it is not a targeted preprocessing step for the issues described.

Full explanation →

369

MCQeasy

A data scientist is exploring a dataset and wants to understand the distribution of a continuous feature. Which visualization is most appropriate for identifying skewness and potential outliers?

A.Bar chart

B.Scatter plot

C.Box plot

D.Heatmap

AnswerC

Box plots display median, quartiles, and outliers, ideal for assessing skewness and outliers.

Why this answer

Option C is correct because a box plot explicitly shows median, quartiles, and outliers. Option A is wrong because scatter plots show relationships between two variables, not distribution. Option B is wrong because bar charts are for categorical data.

Option D is wrong because heatmaps show correlations, not distribution.

Full explanation →

370

MCQeasy

A data scientist is training a linear regression model and observes that the training loss is low but validation loss is high. Which step should the data scientist take to address this issue?

A.Apply L2 regularization to the model

B.Increase the number of training epochs

C.Reduce the size of the training dataset

D.Add more features to the model

AnswerA

Regularization penalizes large weights, reducing overfitting.

Why this answer

Option D is correct because high validation loss indicates overfitting; regularization reduces overfitting. Option A is wrong because adding more features may increase overfitting. Option B is wrong because increasing training time typically increases overfitting.

Option C is wrong because reducing training data may worsen overfitting.

Full explanation →

371

Matchingmedium

Match each hyperparameter tuning strategy to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Exhaustive search over specified hyperparameter values

Random sampling of hyperparameter combinations

Probabilistic model to guide search

Early stopping and resource allocation

SageMaker automatic tuning

Why these pairings

These strategies are used to optimize hyperparameters.

Full explanation →

372

MCQmedium

A company is deploying a machine learning model for real-time fraud detection. The model must respond within 100ms. Which SageMaker endpoint deployment strategy should be used?

A.Deploy the model to a SageMaker Serverless Inference endpoint.

B.Deploy the model to a SageMaker Real-Time Inference endpoint with a Multi-Model Endpoint configuration.

C.Deploy the model as an AWS Lambda function with an API Gateway trigger.

D.Use SageMaker Batch Transform to process requests in batches.

AnswerB

Multi-Model Endpoints provide low latency and cost efficiency for real-time serving.

Why this answer

Option B is correct because a SageMaker Real-Time Inference endpoint with a Multi-Model Endpoint configuration provides low-latency (sub-100ms) responses by keeping models loaded in memory and routing requests efficiently. This architecture is ideal for real-time fraud detection where multiple models may be needed, and it meets the strict latency requirement without the cold-start overhead of serverless options.

Exam trap

The trap here is that candidates may confuse serverless or Lambda-based solutions as inherently low-latency, overlooking the cold-start penalty and network overhead that make them unsuitable for sub-100ms real-time inference in SageMaker.

How to eliminate wrong answers

Option A is wrong because SageMaker Serverless Inference endpoints have a cold-start latency that can exceed 100ms, especially for infrequent or bursty traffic, making them unsuitable for real-time fraud detection with strict latency constraints. Option C is wrong because deploying as an AWS Lambda function with API Gateway introduces additional network hops and cold-start delays, and Lambda has a maximum execution timeout of 15 minutes but is not optimized for sub-100ms ML inference with large models or frameworks. Option D is wrong because SageMaker Batch Transform is designed for asynchronous, offline processing of large datasets in batches, not for real-time, low-latency inference required by fraud detection.

Full explanation →

373

MCQmedium

A company is using Amazon SageMaker to train a model. The training job is taking too long. The data scientist notices that the GPU utilization is low. Which action should be taken to improve training performance?

A.Increase the number of training instances.

B.Use spot instances to reduce cost.

C.Decrease the batch size to reduce memory usage.

D.Increase the batch size in the training script.

AnswerD

Larger batch size keeps GPU busy.

Why this answer

Option C is correct because increasing the batch size can improve GPU utilization by keeping the GPU busy. Option A is wrong because increasing the number of instances may not help if each instance is underutilized. Option B is wrong because using spot instances can reduce cost but does not improve utilization.

Option D is wrong because decreasing batch size would reduce utilization further.

Full explanation →

374

MCQmedium

A data scientist is working on a binary classification problem to predict loan default. The dataset has 200,000 samples and 50 features. The target variable is imbalanced: 5% default, 95% non-default. The scientist trains a logistic regression model and achieves 95% accuracy, but the recall for the default class is only 20%. The business requires that at least 70% of actual defaults be identified (recall >= 0.7). Which approach should the scientist take to improve recall without significantly sacrificing precision?

A.Use random undersampling of the majority class to balance the dataset

B.Use oversampling techniques like SMOTE to create synthetic samples of the minority class

C.Change the decision threshold to 0.3

D.Increase the regularization strength (C) in logistic regression

AnswerB

SMOTE generates synthetic minority samples, helping the model learn better decision boundaries for the minority class, improving recall with less precision loss.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) creates synthetic samples for the minority class by interpolating between existing minority instances, which increases the representation of the default class in the training data. This directly addresses the low recall (20%) by providing the logistic regression model with more balanced class distributions, enabling it to learn decision boundaries that capture more true positives without discarding majority class information. Unlike simple oversampling, SMOTE reduces overfitting risk by generating novel samples rather than duplicating existing ones, which helps maintain precision while improving recall.

Exam trap

The trap here is that candidates often choose threshold adjustment (Option C) as a quick fix for recall, failing to recognize that it is a superficial change that does not improve the model's learned decision boundary and typically sacrifices precision disproportionately, whereas SMOTE addresses the root cause of imbalance in the training data.

How to eliminate wrong answers

Option A is wrong because random undersampling of the majority class discards 95% of the data (190,000 samples), which can lead to significant information loss and reduced model precision due to a smaller, less representative training set. Option C is wrong because changing the decision threshold to 0.3 is a post-hoc adjustment that does not address the underlying class imbalance; while it may increase recall by classifying more instances as positive, it typically causes a sharp drop in precision as many false positives are introduced. Option D is wrong because increasing regularization strength (lower C value) penalizes model complexity more heavily, which can cause underfitting and further reduce recall by making the decision boundary too simplistic to capture the minority class patterns.

Full explanation →

375

MCQmedium

A data science team is using Amazon SageMaker to train a model. The training job is failing with an 'OutOfMemory' error. The team is using a p3.2xlarge instance with 61 GB of memory. They need to resolve this issue as quickly as possible. Which action should they take?

A.Use a larger instance type, such as p3.8xlarge

B.Reduce the batch size in the training script

C.Use a spot instance to save costs

D.Enable distributed training across multiple instances

AnswerA

Larger instance types have more memory and can handle the workload.

Why this answer

Option A is correct because switching to a larger instance type with more memory will immediately resolve the out-of-memory error. Option B is wrong because reducing batch size may help but requires code changes and might still not be enough. Option C is wrong because using spot instances does not affect memory.

Option D is wrong because using distributed training adds complexity and may not resolve memory issues on a single instance.

Full explanation →

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 301–375