Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 376–450

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 6 of 24

376

MCQmedium

A company is migrating its on-premises Hadoop cluster to AWS. They have a large amount of historical data stored in HDFS. Which approach is the most efficient for transferring this data to Amazon S3?

A.Use AWS Snowball Edge devices.

B.Use AWS Direct Connect.

C.Use AWS DataSync over the internet.

D.Use S3 Transfer Acceleration.

AnswerA

Snowball is designed for large offline data transfers.

Why this answer

AWS Snowball Edge is ideal for large data transfers when network bandwidth is limited. AWS DataSync is for network transfers, but slower for huge datasets. S3 Transfer Acceleration improves speed but still network.

Direct Connect is network-based.

Full explanation →

377

Matchingmedium

Match each SageMaker built-in algorithm to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Gradient boosted trees for regression and classification

Word2Vec and text classification

Learning embeddings for pairs of objects

Anomaly detection in IP traffic

Time series forecasting

Why these pairings

These are some of the built-in algorithms in SageMaker.

Full explanation →

378

MCQeasy

Refer to the exhibit. A data scientist checks the status of a SageMaker endpoint and sees the output above. The endpoint is receiving traffic, but the data scientist notices that the number of instances has not increased to the desired count. What is the most likely reason?

A.The endpoint is performing a rolling update

B.The endpoint is currently being updated

C.The account has reached its instance limit

D.Automatic scaling is not configured for the endpoint

AnswerD

The desired instance count will not be applied automatically without a scaling policy; it's just a target.

Why this answer

Option D is correct because the endpoint is receiving traffic but not scaling out, which indicates that automatic scaling (Application Auto Scaling) has not been configured for the SageMaker endpoint. Without a scaling policy, the endpoint will only use the initial instance count, regardless of traffic load. The status shown does not indicate any update or quota issue, so the lack of scaling is the most likely cause.

Exam trap

AWS often tests the distinction between endpoint status (e.g., 'InService' vs. 'Updating') and scaling configuration, trapping candidates who assume that any traffic increase automatically triggers scaling without an explicit scaling policy.

How to eliminate wrong answers

Option A is wrong because a rolling update would show a status like 'Updating' or 'RollingUpdate', not the current steady state, and would not prevent scaling beyond the desired count. Option B is wrong because if the endpoint were being updated, the status would reflect an 'InService' transition or 'Updating', and the instance count would not remain static at the initial value. Option C is wrong because an instance limit would cause a scaling failure or error message, not simply a failure to increase instances while the endpoint remains healthy and receiving traffic.

Full explanation →

379

MCQhard

Refer to the exhibit. A data scientist ran a SageMaker training job using a built-in XGBoost algorithm. The job failed with the error shown. Which step should the data scientist take to fix the issue?

A.Write a custom algorithm that calculates accuracy

B.Remove the metric definition from the training job configuration

C.Change the metric to 'validation:rmse'

D.Use a different built-in algorithm that supports accuracy

AnswerB

XGBoost will use its default metrics (rmse for regression) if not specified, avoiding the error.

Why this answer

The error indicates that SageMaker's built-in XGBoost algorithm does not support a custom metric named 'accuracy' because XGBoost's built-in objective functions (e.g., 'binary:logistic', 'reg:squarederror') do not compute accuracy natively. Removing the metric definition from the training job configuration resolves the issue by allowing SageMaker to use the default metrics that XGBoost does support, such as 'validation:rmse' or 'validation:error'.

Exam trap

AWS often tests the misconception that any metric name can be used with built-in algorithms, when in fact the metric must be one of the predefined strings supported by the algorithm's container (e.g., 'validation:error' for XGBoost classification).

How to eliminate wrong answers

Option A is wrong because writing a custom algorithm to calculate accuracy is unnecessary and over-engineered; the built-in XGBoost already supports accuracy-like metrics (e.g., 'validation:error') if configured correctly, and the error is about an unsupported metric name, not a missing metric calculation. Option C is wrong because 'validation:rmse' is a valid metric for regression tasks, but the error is about an unsupported metric name 'accuracy', and simply changing to 'validation:rmse' does not address the root cause—the metric definition should be removed or corrected to a supported metric like 'validation:error' for classification. Option D is wrong because using a different built-in algorithm is an overreaction; the XGBoost algorithm fully supports classification and can compute error rate (which is 1 - accuracy) via the 'validation:error' metric, so the issue is purely a misconfiguration of the metric name.

Full explanation →

380

MCQmedium

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance. The data scientist runs a training job that reads from s3://my-bucket/training-data/ and writes to s3://my-bucket/output/. The training job fails with an access denied error. What is the most likely cause?

A.The policy does not allow sagemaker:CreateTrainingJob

B.The policy does not allow s3:PutObject on the output location

C.The policy is missing the sagemaker:InvokeEndpoint action

D.The policy does not allow s3:GetObject on the training data

AnswerB

The s3:PutObject action is restricted to the training-data prefix only.

Why this answer

The policy allows s3:PutObject only for the training-data prefix, not the output prefix. The training job needs write access to the output bucket. Option A is wrong because the policy does include s3:GetObject.

Option B is wrong because SageMaker actions are allowed. Option D is wrong because the policy allows specific actions.

Full explanation →

381

MCQmedium

A financial services company is developing a fraud detection model using gradient boosting. The dataset contains 10 million transactions with 0.1% fraudulent. The model is trained on a SageMaker ml.m5.2xlarge instance and takes 8 hours. The team needs to reduce training time without sacrificing model performance. They have permission to use up to 4 instances. What should they do?

A.Switch to a built-in XGBoost with GPU support and use a p3.2xlarge instance

B.Use SageMaker hyperparameter tuning to find faster hyperparameters

C.Use SageMaker's distributed training with data parallelism across 4 ml.m5.2xlarge instances

D.Use SageMaker managed spot training with checkpointing

AnswerA

GPU acceleration can significantly reduce training time for gradient boosting.

Why this answer

GPU instances like p3.2xlarge accelerate XGBoost training substantially.

Full explanation →

382

MCQmedium

A data scientist is analyzing a dataset with a time series component. They suspect there is a weekly seasonality. Which technique should they use to confirm this?

A.Plot the time series line chart

B.Compute autocorrelation function (ACF)

C.Perform Fourier transform

D.Compute a 7-day moving average

AnswerB

Correct: ACF at lag 7 will be significant if weekly seasonality exists.

Why this answer

Option C is correct because autocorrelation function (ACF) can show peaks at lag 7 indicating weekly seasonality. Option A is wrong because line plot can show patterns but is subjective. Option B is wrong because moving average smooths data and may hide seasonality.

Option D is wrong because spectral analysis is for frequency, but ACF is simpler.

Full explanation →

383

Multi-Selectmedium

Which TWO actions can help reduce overfitting in a neural network? (Choose 2.)

Select 2 answers

A.Increase the number of layers.

B.Decrease the learning rate.

C.Apply L1 or L2 regularization.

D.Increase the training dataset size.

E.Add dropout layers.

AnswersC, E

Regularization penalizes large weights, reducing overfitting.

Why this answer

Option A is correct because dropout randomly drops units, preventing co-adaptation. Option D is correct because L1/L2 regularization penalize large weights. Option B is wrong because adding more layers increases model complexity.

Option C is wrong because increasing training data helps underfitting, not overfitting. Option E is wrong because reducing learning rate may not prevent overfitting.

Full explanation →

384

Multi-Selecteasy

A company stores IoT sensor data in Amazon S3 and uses Amazon Athena for ad-hoc queries. The data is partitioned by date, but queries are still slow and expensive. Which TWO actions can improve query performance and reduce cost? (Choose TWO.)

Select 2 answers

A.Use S3 lifecycle policies to compact small files into larger ones

B.Convert the data from CSV to Parquet format

C.Disable server-side encryption on the S3 bucket

D.Use AWS Glue instead of Athena for querying

E.Increase the number of partitions to hour-level granularity

AnswersA, B

Fewer, larger files reduce the overhead of opening many files in Athena.

Why this answer

Option A (convert to Parquet) reduces data scanned. Option C (compact small files) reduces overhead. Option B (increase partitions) can create many small files.

Option D (use Glue instead) changes service. Option E (disable encryption) is not related to performance.

Full explanation →

385

MCQmedium

A data scientist is building a model to predict insurance claim amounts. The target variable is right-skewed with many small claims and a few very large claims. The scientist wants to minimize the impact of outliers. Which loss function or transformation is MOST appropriate?

A.Use mean squared error loss without any transformation

B.Use quantile loss to predict the median

C.Use Poisson loss assuming the target follows a Poisson distribution

D.Apply a log transformation to the target variable

AnswerD

Log transformation reduces skewness and makes the distribution more symmetric, reducing outlier impact.

Why this answer

Using a log transformation or modeling with a log-link function can reduce skewness and impact of outliers. Option A (Mean squared error) is sensitive to outliers. Option B (Quantile loss) is robust but less common for mean prediction.

Option D (Poisson loss) is for count data. Option C (Log transformation of target) is standard for skewed continuous targets.

Full explanation →

386

MCQmedium

A data scientist is using principal component analysis (PCA) for dimensionality reduction before training a classifier. The classifier's performance on the test set is poor. What is the most likely cause?

A.The classifier is overfitting

B.The data was not scaled before applying PCA

C.Too few principal components were retained, losing important information

D.Too many principal components were retained, including noise

AnswerC

Discards discriminative features.

Why this answer

C is correct because PCA is an unsupervised dimensionality reduction technique that projects data onto principal components capturing the maximum variance. If too few components are retained, the reduced representation may discard features that are critical for the classifier to distinguish between classes, leading to poor test performance due to underfitting.

Exam trap

AWS often tests the misconception that PCA always improves classifier performance by removing noise, but the trap here is that candidates may overlook the risk of underfitting when too few components are retained, especially when the discarded variance contains critical discriminative features.

How to eliminate wrong answers

Option A is wrong because overfitting would cause high training accuracy but poor test accuracy, whereas the question states the classifier's performance on the test set is poor without mentioning training performance, making underfitting from information loss more likely. Option B is wrong because while scaling is a best practice for PCA (since PCA is sensitive to variances), unscaled data would typically distort component directions and degrade performance, but the most likely cause given poor test performance is retaining too few components, not scaling alone. Option D is wrong because retaining too many components, including noise, would typically lead to overfitting (high variance, poor generalization), but the question's scenario of poor test performance without context of training performance points more directly to underfitting from insufficient components.

Full explanation →

387

Multi-Selecthard

A data scientist is analyzing a dataset of customer reviews. The dataset contains a text column 'review' and a numerical rating from 1 to 5. The data scientist wants to create features for sentiment analysis. Which THREE preprocessing steps should be applied to the text data before feature extraction? (Choose THREE.)

Select 3 answers

A.Standardize the text data using z-score normalization.

B.Apply stemming to reduce words to their root form.

C.Tokenize the text into individual words.

D.Convert all text to lowercase.

E.Remove common stop words (e.g., 'the', 'and', 'is').

AnswersB, D, E

Stemming groups related words, reducing feature dimensionality.

Why this answer

Option B is correct because stemming reduces words to their root form (e.g., 'running' to 'run'), which consolidates variations of the same word and reduces feature dimensionality. This is a standard preprocessing step before feature extraction in NLP tasks like sentiment analysis, as it helps the model generalize across different word forms.

Exam trap

Cisco often tests the distinction between preprocessing steps that are specific to text (like stemming, lowercasing, stop word removal) versus those meant for numerical data (like normalization), and candidates may mistakenly apply scaling techniques to text or forget that tokenization is a prerequisite but not always listed as a separate 'correct' step in multi-select questions.

Full explanation →

388

Multi-Selectmedium

Which TWO techniques are appropriate for detecting outliers in a univariate numeric dataset?

Select 2 answers

A.Cook's distance

B.Mahalanobis distance

C.Z-score method

D.Interquartile range (IQR) method

E.DBSCAN clustering

AnswersC, D

Z-score flags points beyond a threshold (e.g., |z|>3).

Why this answer

Options A and C are correct: Z-score identifies outliers based on standard deviations from the mean; IQR uses the interquartile range. DBSCAN (B) is for multivariate clustering. Mahalanobis distance (D) is multivariate.

Cook's distance (E) is for regression influence.

Full explanation →

389

MCQeasy

A data scientist is using Amazon SageMaker to train a classification model. The dataset contains categorical features with high cardinality. Which encoding method is most appropriate for handling high-cardinality categorical features in a linear model?

A.Target encoding

B.Label encoding

C.One-hot encoding

D.Ordinal encoding

AnswerA

Target encoding replaces categories with the mean of the target variable, reducing dimensionality and capturing predictive power.

Why this answer

One-hot encoding creates many binary columns, which can cause the curse of dimensionality for high-cardinality features. Label encoding assigns arbitrary integers, which linear models may interpret as ordinal. Target encoding (mean encoding) replaces categories with the mean of the target variable, which captures information without expanding dimensionality.

This is often used for high-cardinality features. Ordinal encoding is similar to label encoding.

Full explanation →

390

MCQhard

A data engineering team is building a real-time fraud detection pipeline. The pipeline ingests transaction data from an Amazon Kinesis Data Stream with 10 shards. Each shard produces about 500 records per second, each record is 2 KB. The data is processed by a Lambda function that runs for about 200 ms and then writes results to an Amazon DynamoDB table. The team notices that the Lambda function is experiencing a high number of throttles, and there are increasing numbers of records being retried. The Lambda function's reserved concurrency is set to 100. The DynamoDB table has 100 read capacity units and 100 write capacity units. Which change would most effectively reduce throttling and improve processing throughput?

A.Decrease the Lambda function's batch size to 10.

B.Increase the DynamoDB write capacity units to 1000.

C.Increase the number of shards in the Kinesis stream to 100.

D.Increase the Lambda function's reserved concurrency to 1000.

AnswerD

More concurrency allows the function to handle more concurrent invocations.

Why this answer

The Lambda function is throttling because the concurrent executions needed exceed the reserved concurrency. Each shard invokes Lambda with batches, and with 10 shards and a batch size of 100 (default), the number of concurrent invocations can be high. Increasing reserved concurrency to a higher value (e.g., 1000) allows more concurrent executions, reducing throttling.

However, if DynamoDB write capacity is also a bottleneck, increasing it might help. But the primary issue is Lambda throttling.

Full explanation →

391

Multi-Selectmedium

Which TWO metrics are appropriate for evaluating a binary classification model trained on imbalanced data? (Select TWO.)

Select 2 answers

A.Log loss

B.F1 score

C.Accuracy

D.Precision-recall curve

E.ROC-AUC

AnswersB, D

F1 balances precision and recall.

Why this answer

The F1 score is appropriate for imbalanced binary classification because it balances precision and recall, making it robust when the positive class is rare. Unlike accuracy, it does not get inflated by a majority negative class, and it directly penalizes models that predict the majority class for all instances.

Exam trap

Cisco often tests the misconception that ROC-AUC is always the best metric for imbalanced data, but the trap here is that ROC-AUC can be misleadingly high when the positive class is rare, whereas precision-recall curve and F1 score better reflect model performance on the minority class.

Full explanation →

392

MCQhard

A data engineer is setting up a data lake on Amazon S3 for a large retail company. The data includes customer transactions, inventory, and web logs. The company wants to use AWS Glue for ETL and Amazon Athena for ad-hoc queries. The data is partitioned by year, month, day, and hour. The engineer notices that Athena queries are slow and often scan large amounts of data even when only a specific hour is needed. The engineer has already enabled partitioning and used columnar formats like Parquet. What additional step should the engineer take to optimize query performance and reduce data scanned?

A.Use a coarser partition layout, such as partitioning only by date, and leverage Hive-style partitioning with AWS Glue Crawlers to avoid excessive small files.

B.Convert the Parquet files to CSV format to reduce the overhead of columnar storage and improve compression.

C.Use S3 Select to push down filters to S3, reducing the amount of data scanned by Athena.

D.Increase the granularity of partitioning to include minute-level partitions to further limit data scanned.

AnswerA

Coarser partitions reduce the number of partitions and improve query planning.

Why this answer

Option C is correct because partitioning by hour alone can lead to many small files, which increases metadata overhead. Using a coarser partition like day and then using Hive-style partitioning with AWS Glue Crawlers will reduce the number of partitions and improve query performance. Option A is incorrect because S3 Select is for filtering within a single object, not for query optimization across multiple objects.

Option B is incorrect because increasing the number of partitions further (e.g., adding minute) would worsen the small files problem. Option D is incorrect because converting to CSV would increase scan size and slow down queries.

Full explanation →

393

MCQhard

A data engineer is running an Amazon SageMaker Data Wrangler flow on a dataset with 5 million rows. The flow includes several transformations. The engineer wants to validate the data quality by checking for missing values and outliers before training. Which approach is most efficient?

A.Use Data Wrangler's data quality and insights report to generate a report with statistics and visualizations.

B.Export the transformed data to S3 and query with Amazon Athena.

C.Use Amazon EMR with Spark to compute statistics.

D.Import the data into Amazon QuickSight and create dashboards.

AnswerA

Data Wrangler has a built-in report for data quality.

Why this answer

Using Data Wrangler's built-in data quality and insights report is the most efficient way to get statistics and detect issues without custom code. Option B (Athena) requires writing SQL queries. Option C (QuickSight) needs exporting.

Option D (EMR) is overkill.

Full explanation →

394

MCQmedium

A data scientist is analyzing a dataset with 100 features and 10,000 samples. The target variable is highly imbalanced (1% positive class). Which exploratory data analysis step is most critical before model training?

A.Apply PCA and visualize the first two principal components

B.Compute pairwise correlation matrix among all features

C.Impute missing values using mean imputation

D.Plot the histogram of the target variable

AnswerD

Why C is correct

Why this answer

Option C is correct because understanding the distribution of the target variable is essential for imbalanced datasets to choose appropriate sampling techniques or evaluation metrics. Option A is wrong because correlation analysis is less critical than target distribution. Option B is wrong because PCA is a dimensionality reduction technique not primarily for EDA.

Option D is wrong because missing value imputation is important but not the most critical for imbalance.

Full explanation →

395

MCQmedium

A company uses Amazon Kinesis Data Streams for real-time clickstream analysis. The data is consumed by a Lambda function that enriches the records and stores them in Amazon S3. Recently, the Lambda function has been failing with throttling errors, and the consumer is falling behind. The team needs to increase the throughput of the consumer without changing the data format or the Lambda function code. What should the team do?

A.Add a second Kinesis data stream and send duplicate records to both.

B.Increase the batch size in the event source mapping for Lambda.

C.Increase the number of shards in the Kinesis data stream.

D.Increase the reserved concurrency of the Lambda function.

AnswerC

More shards increase the stream's capacity and number of Lambda consumers.

Why this answer

Option A is correct because increasing the number of shards increases the parallelism of the stream, allowing more Lambda invocations in parallel. Option B is wrong because increasing Lambda concurrency limits may help but the bottleneck is the stream's throughput. Option C is wrong because changing the batch size may help but not as effectively as increasing shards.

Option D is wrong because adding a second stream would require splitting the data, which is not a direct solution.

Full explanation →

396

Multi-Selecthard

Which THREE techniques are effective for reducing overfitting in a deep neural network?

Select 3 answers

A.Increasing model complexity

B.Early stopping

C.Dropout

D.Reducing the amount of training data

E.L2 regularization

AnswersB, C, E

Early stopping prevents overfitting by stopping training.

Why this answer

Dropout randomly drops neurons during training, L2 regularization penalizes large weights, and early stopping halts training before overfitting. Increasing model complexity (more layers) would increase overfitting. Data augmentation is also effective but not listed as a separate option; here we have dropout, L2, and early stopping.

So correct are A, B, D. C: Increasing model complexity is opposite. E: Reducing training data would worsen overfitting.

Full explanation →

397

MCQeasy

During EDA, a data scientist discovers that a numerical feature 'income' has a skewness of 3.5. Which transformation should the scientist apply to make the distribution more symmetric?

A.Standardization (Z-score)

B.Square transformation

C.Log transformation

D.Min-Max scaling

AnswerC

Log transformation compresses the tail and reduces right skewness.

Why this answer

Option D is correct because a log transformation is commonly used for right-skewed positive data to reduce skewness. Option A is wrong because StandardScaler does not change skewness. Option B is wrong because Min-Max scaling does not change shape.

Option C is wrong because a square transformation would increase skewness.

Full explanation →

398

MCQhard

A machine learning team is using Amazon SageMaker to train a PyTorch model on a dataset that is 500 GB in size. The training job runs on a single ml.p3.2xlarge instance, but the training takes over 48 hours, which exceeds the maximum allowed time. The team wants to reduce training time to under 24 hours. They are open to using multiple instances and have budget for up to 4 instances. The dataset is stored in Amazon S3 and can be split into shards by a key. The model architecture must remain unchanged. What should the team do?

A.Use SageMaker distributed data parallelism with 4 ml.p3.2xlarge instances.

B.Use SageMaker Processing to split the data and train separate models.

C.Change the instance type to ml.p3.16xlarge.

D.Switch to Pipe input mode to stream data faster.

AnswerA

Distributed training can reduce time proportionally with data parallelism.

Why this answer

Option D is correct because SageMaker's distributed data parallelism library (SMDDP) can efficiently split data across multiple GPUs with minimal code changes. Option A is wrong because increasing instance type alone may not halve training time. Option B is wrong because Pipe mode reduces I/O but not computation time.

Option C is wrong because SageMaker Processing is for preprocessing, not training.

Full explanation →

399

Multi-Selecthard

You are building a CI/CD pipeline for SageMaker using AWS CodePipeline. Which THREE components are essential for a fully automated model training and deployment pipeline?

Select 3 answers

A.AWS CodeCommit to store the training script and model code

B.AWS CodeBuild to run the training job as a build step

C.AWS Lambda function to create or update the SageMaker endpoint

D.AWS CodeDeploy to deploy the model to an endpoint

E.AWS CloudFormation to define the infrastructure

AnswersA, B, C

Source control is essential for CI/CD.

Why this answer

Options A, B, and D are correct. Option A: CodeCommit stores code. Option B: CodeBuild can run training jobs.

Option D: Lambda can trigger SageMaker endpoints. Option C is wrong because CodeDeploy is for EC2, not SageMaker. Option E is wrong because CloudFormation is optional.

Full explanation →

400

MCQeasy

A company wants to deploy a machine learning model that requires very low latency predictions (under 10ms). The model is a small ensemble of decision trees. Which SageMaker deployment option is most suitable?

A.SageMaker Notebook instance

B.AWS Lambda function with the model packaged

C.SageMaker endpoint with a single instance

D.SageMaker Batch Transform

AnswerC

Provides real-time low-latency inference.

Why this answer

C is correct because a SageMaker endpoint with a single instance provides a persistent, real-time inference API that can achieve sub-10ms latency for a small ensemble of decision trees. The endpoint keeps the model loaded in memory and uses synchronous HTTP requests, minimizing cold start and network overhead, which is essential for low-latency predictions.

Exam trap

The trap here is that candidates often confuse batch processing (Batch Transform) with real-time inference, or assume that serverless options like Lambda are always the fastest, ignoring cold start and timeout constraints.

How to eliminate wrong answers

Option A is wrong because a SageMaker Notebook instance is an interactive development environment, not a deployment target; it cannot serve real-time predictions with a stable endpoint. Option B is wrong because AWS Lambda has a maximum execution timeout of 15 minutes and a cold start latency that often exceeds 10ms, especially when loading a model package; it is designed for short, stateless functions, not persistent low-latency inference. Option D is wrong because SageMaker Batch Transform is an asynchronous, batch processing service that processes large datasets offline; it does not provide real-time endpoints and has no latency guarantee under 10ms.

Full explanation →

401

MCQmedium

A data scientist needs to run complex ETL transformations on a large dataset stored in Amazon S3. The transformations are written in PySpark and require occasional access to Hive metastore. The solution should minimize operational overhead and allow the data scientist to focus on code development. Which AWS service should be used?

A.Amazon Redshift

B.Amazon EMR

C.AWS Glue

D.Amazon SageMaker

AnswerB

EMR provides a managed Spark environment with Hive support and allows custom PySpark code.

Why this answer

Amazon EMR is a managed Hadoop framework that supports PySpark and Hive metastore. AWS Glue is good for simpler ETL but has limitations on custom PySpark code. Amazon SageMaker is for ML training, not general ETL.

Amazon Redshift is a data warehouse.

Full explanation →

402

MCQeasy

A data scientist is training a binary classifier on an imbalanced dataset (95% negative, 5% positive). The model achieves 99% accuracy but only correctly identifies 2% of the positive samples. Which metric should the data scientist focus on to improve the model's performance?

A.Precision

B.RMSE

C.Recall

D.Accuracy

AnswerC

Recall measures the proportion of actual positives correctly identified.

Why this answer

Option B is correct because recall measures the proportion of actual positives correctly identified, which is critical for imbalanced datasets. Option A is wrong because accuracy is misleading when classes are imbalanced. Option C is wrong because RMSE is for regression.

Option D is wrong because precision does not directly address the low identification of positives.

Full explanation →

403

MCQmedium

A data scientist is training a binary classification model on a highly imbalanced dataset (0.1% positive class). To improve recall, the team decides to use SageMaker's built-in XGBoost algorithm. Which parameter adjustment is most likely to increase recall without significantly sacrificing precision?

A.Increase max_depth from 5 to 10

B.Reduce num_round from 100 to 50

C.Increase subsample from 0.8 to 1.0

D.Set scale_pos_weight to the ratio of negative to positive samples

AnswerD

scale_pos_weight adjusts class weights to focus on the minority class, improving recall.

Why this answer

Setting scale_pos_weight to the ratio of negative to positive samples (approximately 999:1) tells XGBoost to assign a higher penalty to misclassifications of the minority positive class. This directly increases the gradient contribution from positive samples during training, which shifts the decision boundary to improve recall while maintaining a balance that avoids excessive false positives, thus preserving precision.

Exam trap

Cisco often tests the misconception that simply increasing model complexity (max_depth) or data usage (subsample) will fix imbalance, when the correct approach is to use a class-weighting parameter like scale_pos_weight that directly addresses the skewed gradient contributions.

How to eliminate wrong answers

Option A is wrong because increasing max_depth from 5 to 10 makes the model more complex and prone to overfitting, which can actually hurt generalization and may not specifically target recall improvement for the minority class. Option B is wrong because reducing num_round from 100 to 50 decreases the number of boosting iterations, which typically reduces model capacity and can lower recall by underfitting the minority class patterns. Option C is wrong because increasing subsample from 0.8 to 1.0 uses all training data for each tree, which reduces randomness and can increase overfitting without addressing class imbalance; it does not directly influence recall for the positive class.

Full explanation →

404

MCQeasy

A data scientist is building a model to predict customer churn. The dataset includes both numerical features (e.g., account age, usage minutes) and categorical features (e.g., region, plan type). The data scientist wants to use a linear classifier. Which feature engineering step is required before training?

A.Normalize numerical features

B.Impute missing values

C.Remove outliers

D.One-hot encode categorical features

AnswerD

Linear models require numerical input; one-hot encoding converts categories to binary vectors.

Why this answer

Linear classifiers (e.g., logistic regression, linear SVM) require numerical input and cannot directly process categorical text labels. One-hot encoding converts each categorical feature into binary indicator columns, allowing the linear model to learn separate weights for each category. Without this step, the model would either fail to train or treat categorical strings as ordinal values, which is mathematically invalid for linear decision boundaries.

Exam trap

The trap here is that candidates may assume normalization (A) is the most critical step for linear models, overlooking that categorical features must be converted to numerical form before any linear classifier can process them.

How to eliminate wrong answers

Option A is wrong because normalizing numerical features is beneficial for convergence speed and weight interpretation but is not strictly required before training a linear classifier; many implementations handle unscaled data. Option B is wrong because imputing missing values is a data cleaning step that may be necessary but is not specific to the requirement of using a linear classifier with categorical features. Option C is wrong because removing outliers is a data preprocessing technique that can improve model robustness but is not a mandatory step for linear classifiers to function with categorical data.

Full explanation →

405

MCQhard

A bank is building a credit risk model using a large dataset with 500 features and 2 million samples. The dataset contains many categorical features with high cardinality (e.g., zip code, occupation). The model must be deployed on SageMaker and provide real-time predictions with low latency. They also need to explain individual predictions for regulatory compliance. Which approach is most appropriate?

A.Use a linear model with target encoding for categorical features and deploy with SageMaker's built-in linear learner algorithm

B.Use a deep neural network with embedding layers for categorical features and use SageMaker's built-in Debugger for explanations

C.Use XGBoost with one-hot encoding for categorical features and deploy with SageMaker's built-in SHAP explainer

D.Use a gradient boosting model with ordinal encoding for categorical features and use SageMaker's built-in XGBoost with SHAP

AnswerD

Ordinal encoding handles high cardinality without explosion; XGBoost captures interactions; SHAP provides explanations.

Why this answer

XGBoost with ordinal encoding and SHAP balances performance, latency, and explainability.

Full explanation →

406

MCQhard

A machine learning engineer is deploying a model using SageMaker and wants to use automatic scaling for the endpoint based on the number of concurrent requests. The engineer has defined a scaling policy using the SageMakerVariantInvocationsPerInstance metric. However, the scaling is not triggering as expected. What could be the issue?

A.A scheduled scaling action must be created first.

B.The scaling policy does not have a cooldown period configured, or the cooldown period is too long.

C.The metric must be published to CloudWatch manually.

D.The metric is not available for automatic scaling.

AnswerB

Cooldown prevents scaling actions from triggering too frequently.

Why this answer

Option D is correct because scaling policies require a cooldown period (default 300 seconds) to prevent rapid scaling. Without it, the policy may not activate. Option A is wrong because the metric is valid.

Option B is wrong because the metric is emitted by default. Option C is wrong because scaling policy can be defined without a scheduled action.

Full explanation →

407

MCQeasy

A data scientist is using Amazon SageMaker to train a model. The training job is taking longer than expected. The scientist wants to reduce training time without changing the algorithm or the hardware. Which action is most likely to help?

A.Increase the batch size used during training.

B.Add regularization to the loss function.

C.Use data augmentation to increase the dataset size.

D.Reduce the number of training epochs.

AnswerA

Increasing batch size reduces the number of iterations per epoch, speeding up training. It may require tuning the learning rate, but it is a common technique to reduce training time.

Why this answer

Using data augmentation increases the dataset size, which would increase training time. Increasing the batch size can speed up training because it processes more samples per step, but it may affect convergence. Reducing the number of epochs reduces the number of passes over the data, directly reducing training time.

However, this might affect model quality. Among the options, reducing epochs is a direct way to reduce time. But note: increasing batch size can also reduce time, but it's not guaranteed to be safe for model quality.

The question says 'without changing the algorithm or the hardware', and asks for 'most likely to help'. Reducing epochs is straightforward. Data augmentation increases time.

Changing optimizer could be considered changing algorithm. Adding regularization does not reduce time.

Full explanation →

408

MCQeasy

A company has customer feedback data stored in CSV files in S3. The data includes a 'feedback_text' column. Which AWS service is best suited for performing sentiment analysis as part of exploratory data analysis?

A.Amazon Comprehend

B.Amazon Rekognition

C.Amazon Textract

D.Amazon Lex

AnswerA

Comprehend provides sentiment analysis as a managed service.

Why this answer

Option A is correct because Amazon Comprehend is a natural language processing (NLP) service that can perform sentiment analysis directly. Option B is wrong because Amazon Lex is for conversational interfaces, not text analysis. Option C is wrong because Amazon Rekognition is for image and video analysis.

Option D is wrong because Amazon Textract is for extracting text from documents, not sentiment.

Full explanation →

409

MCQeasy

A data engineer is tasked with building a system to process a continuous stream of IoT sensor data. The data must be processed in near real-time, and the results must be stored in Amazon S3 partitioned by hour. Which AWS service is the most cost-effective and simplest to implement?

A.Amazon Simple Queue Service (SQS) with AWS Lambda

B.Amazon Kinesis Data Firehose

C.Amazon Kinesis Data Streams with Amazon EC2 consumers

D.AWS Database Migration Service (DMS) for continuous replication

AnswerB

Serverless, automatic partitioning, and direct delivery to S3.

Why this answer

Amazon Kinesis Data Firehose is the simplest and most cost-effective way to ingest streaming data and deliver it to S3 with automatic partitioning by time. Option A (Kinesis Data Streams) requires custom consumers. Option C (Amazon SQS) is for message queues, not streaming.

Option D (AWS Database Migration Service) is for database migration.

Full explanation →

410

MCQeasy

A data scientist is analyzing a dataset with 1,000 features. They suspect many features are redundant and want to reduce dimensionality before training a model. Which technique is most appropriate for identifying the most important features?

A.Apply principal component analysis (PCA) and select the top components

B.Use L1 regularization (Lasso) to shrink coefficients to zero

C.Train a random forest and remove features with low importance

D.Compute the correlation matrix and remove features with high correlation

AnswerA

Why B is correct

Why this answer

Option B is correct because principal component analysis (PCA) is a dimensionality reduction technique that identifies the principal components capturing the most variance. Option A is wrong because correlation matrix only shows pairwise linear relationships, not importance. Option C is wrong because regularization can shrink coefficients but is not a dedicated dimensionality reduction technique.

Option D is wrong because random forests can provide feature importance but are not a dimensionality reduction technique per se.

Full explanation →

411

MCQhard

A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on a stream of IoT sensor data. The application is experiencing high latency. The data volume has doubled. Which action would MOST effectively reduce latency?

A.Increase the Parallelism setting of the Kinesis Data Analytics application

B.Change the record format from JSON to Avro

C.Decrease the retention period of the source stream

D.Increase the number of shards in the source Kinesis stream

AnswerA

More KPUs allow parallel processing of records.

Why this answer

Increasing the parallelism (number of KPUs) in Kinesis Data Analytics allows processing more data in parallel, reducing latency. Changing record format may help but not as much as scaling. Reducing retention is not relevant.

Using Lambda adds overhead.

Full explanation →

412

MCQeasy

A data scientist is using Amazon SageMaker to train a model using a built-in algorithm. The training job uses a large dataset stored in Amazon S3, and the scientist wants to use pipe mode to stream the data directly from S3 to the training instance, reducing the time needed to download the data. The training job is configured with 'InputMode' set to 'Pipe'. However, the training job fails with an error indicating that the algorithm does not support pipe mode. What should the scientist do to resolve this issue?

A.Change the 'InputMode' to 'File'

B.Use a different instance type that supports pipe mode

C.Use AWS Glue to stream the data to the training instance

D.Switch to a different built-in algorithm that supports pipe mode

AnswerA

File mode downloads the data first; it is supported by all algorithms.

Why this answer

Option A is correct because not all built-in algorithms support pipe mode; the scientist should use a file mode instead. Option B is wrong because the issue is not with the instance type. Option C is wrong because changing to a different algorithm may not be necessary if the current algorithm works with file mode.

Option D is wrong because SageMaker does not support using Glue to stream data directly to training jobs.

Full explanation →

413

MCQeasy

A team uses AWS Glue ETL jobs to preprocess data for SageMaker training. The job runs successfully but the output data is empty. What is the most likely cause?

A.There is a data type mismatch between source and target

B.The source data is partitioned and only a subset of partitions is read

C.The filter transformation condition is too restrictive, removing all rows

D.The Glue job runs out of memory and fails silently

AnswerC

Filtering all rows results in empty output.

Why this answer

Option A is correct: If the filter condition excludes all records, output is empty. Option B (partition pruning) would not cause empty output if data exists. Option C (data type mismatch) causes errors, not empty output.

Option D (insufficient memory) causes job failure, not empty output.

Full explanation →

414

MCQhard

A machine learning engineer is using Amazon SageMaker to train a deep learning model. The training job is taking longer than expected. The engineer notices that the GPU utilization is low (around 30%) while CPU utilization is high. Which action is most likely to improve training speed?

A.Increase the number of data loading workers

B.Use a smaller instance type with fewer GPUs

C.Decrease the number of data loading workers

D.Increase the batch size

AnswerA

More workers can parallelize data loading and reduce I/O bottleneck, improving GPU utilization.

Why this answer

Low GPU utilization with high CPU utilization suggests a data loading bottleneck. Increasing the number of data loading workers keeps the GPU fed. Reducing batch size or using a smaller instance would not help.

Using Pipe mode (streaming) might help but not as directly as increasing workers.

Full explanation →

415

MCQhard

A company has a large dataset of customer transactions stored in Amazon Redshift. A data scientist wants to perform EDA using Python libraries like pandas and matplotlib. The dataset is too large to fit into memory on a single EC2 instance. What is the most efficient approach?

A.Launch an Amazon SageMaker notebook instance with an attached EBS volume large enough to hold the data

B.Use Amazon Athena Federated Query to run SQL queries against Redshift and retrieve aggregated results

C.Use a SQLAlchemy connection to read the entire table into a pandas DataFrame and sample it

D.Export the Redshift table to Amazon S3 in Parquet format, then use pandas to read the Parquet files

AnswerB

Why C is correct

Why this answer

Option C is correct because Amazon Athena allows querying Redshift data directly via federated queries, returning only aggregated results, avoiding the need to move large datasets. Option A is wrong because reading all data to a local DataFrame would exceed memory. Option B is wrong because writing to S3 and then reading with pandas still requires loading all data into memory.

Option D is wrong because SageMaker notebook's local memory is still limited.

Full explanation →

416

Multi-Selectmedium

A company uses Amazon SageMaker to train models. The data scientist wants to automate the retraining process whenever new data arrives in an S3 bucket. Which THREE services can be used together to achieve this? (Choose THREE.)

Select 3 answers

A.Amazon S3

B.Amazon EC2

C.AWS Lambda

D.Amazon SageMaker

E.AWS Glue

AnswersA, C, D

S3 events can trigger the pipeline.

Why this answer

Options A, C, and D are correct. A: Amazon S3 can trigger events on new data. C: AWS Lambda can process the event and start the training job.

D: SageMaker can run the training job. Option B (Amazon EC2) is not needed. Option E (AWS Glue) is for ETL, not directly for triggering retraining.

Full explanation →

417

MCQhard

A data scientist trains a gradient boosting model on a large dataset using SageMaker. The training completes successfully, but when deploying the model to a real-time endpoint, inference latency is too high. Which change is MOST likely to reduce latency without significant accuracy loss?

A.Use a larger instance type for the endpoint

B.Prune the trees by removing nodes with low importance

C.Increase the number of trees in the ensemble

D.Use SageMaker Batch Transform instead of real-time

AnswerB

Pruning reduces model size and inference time.

Why this answer

Pruning trees by removing nodes with low importance reduces the model's complexity, which directly decreases inference latency because fewer decision paths need to be evaluated. In gradient boosting, this can be done with minimal accuracy loss if the removed nodes correspond to splits that contribute little to the overall prediction, as measured by feature importance or gain.

Exam trap

The trap here is that candidates often confuse scaling the endpoint (Option A) as the primary fix for latency, when the real issue is model complexity that can be reduced through pruning without significant accuracy loss.

How to eliminate wrong answers

Option A is wrong because using a larger instance type may reduce latency through more CPU/memory, but it does not address the root cause of high latency from model complexity and increases cost; it is a scaling workaround, not a model optimization. Option C is wrong because increasing the number of trees in the ensemble would increase model size and inference computation, making latency worse, not better. Option D is wrong because SageMaker Batch Transform is designed for offline, asynchronous inference on large datasets and does not provide real-time endpoints; switching to batch transform would not meet the requirement for a real-time endpoint and introduces significant latency for individual predictions.

Full explanation →

418

MCQmedium

A machine learning engineer is performing exploratory data analysis on a large dataset stored in S3 using Amazon Athena. The dataset contains a timestamp column 'event_time' of type string. The engineer wants to analyze daily trends. Which approach is the most cost-effective and efficient?

A.Create a view that casts the column to timestamp and query the view.

B.Use the CAST function in the SELECT statement to convert the string to timestamp.

C.Convert the data to Parquet format with a timestamp column and re-query.

D.Partition the table by date derived from the event_time string and query using partition filtering.

AnswerD

Partition pruning reduces data scanned; can use date_format or substring to derive partition key.

Why this answer

Option D is correct because converting the string to a date type in the query allows Athena to use partition pruning if the table is partitioned by date, reducing scanned data. Option A is wrong because CAST in SELECT still scans all data. Option B is wrong because creating a view does not reduce data scanned.

Option C is wrong because converting to Parquet is beneficial but not the most direct for the given task.

Full explanation →

419

Multi-Selectmedium

Which TWO of the following are appropriate techniques for handling missing data during exploratory data analysis? (Select TWO.)

Select 2 answers

A.Ignore missing values and proceed with modeling

B.Replace missing values with -1 to indicate missing

C.Impute missing values using mean or median for numerical features

D.Visualize the missing data pattern using heatmaps or bar charts

E.Delete all rows with any missing values

AnswersC, D

Mean/median imputation is a common EDA technique.

Why this answer

Options A and C are correct. Visualizing missing data patterns (A) helps understand the missing mechanism. Using imputation methods like mean/median (C) is common during EDA.

Option B is wrong because deleting all rows with missing values may discard too much data. Option D is wrong because ignoring missing values can lead to errors. Option E is wrong because replacing with -1 can distort data.

Full explanation →

420

MCQmedium

A machine learning engineer is responsible for deploying a model that was trained using a custom algorithm in Amazon SageMaker. The engineer has built a Docker container that includes the inference code and has tested it locally. The engineer now wants to deploy the container to a SageMaker endpoint for real-time inference. The engineer has already created the model in SageMaker by specifying the image URI and the model artifacts location in S3. However, when the engineer tries to create an endpoint configuration, the operation fails with an error indicating that the model is not in an 'Active' state. What should the engineer do to resolve this issue?

A.Check the CloudWatch logs for the container to ensure the inference server starts correctly

B.Create the endpoint configuration with a different model name

C.Delete and re-create the model, then wait for a few minutes

D.Re-create the model using a different image URI

AnswerA

The health check requires the container to respond to a ping request. Logs will show if the server failed to start.

Why this answer

Option C is correct because the model must be in an 'Active' state before it can be deployed, and this requires the container to pass SageMaker's health check. The engineer should check the CloudWatch logs for the container to diagnose the health check failure. Option A is wrong because re-creating the model with the same image will not fix the health check issue.

Option B is wrong because the model is already created; the issue is the state. Option D is wrong because the endpoint configuration cannot be created if the model is not active.

Full explanation →

421

Multi-Selecthard

A data engineer is designing a data pipeline to process streaming data from Amazon Kinesis Data Streams and store the results in Amazon S3 in Parquet format. The data must be available for querying in Amazon Athena within minutes of arrival. Which THREE services should be used together? (Choose THREE.)

Select 2 answers

A.Amazon EMR

B.Amazon Redshift

C.Amazon Kinesis Data Firehose

D.Amazon Kinesis Data Analytics

E.AWS Glue

AnswersC, E

Firehose can deliver streaming data to S3 in Parquet format.

Why this answer

Kinesis Data Firehose can write data to S3 in Parquet format with near-real-time delivery. AWS Glue provides the Data Catalog for table metadata, and Athena queries the data. Option A (Kinesis Data Analytics) is for real-time analytics on streams, not for storage.

Option C (EMR) is for batch processing, not streaming. Option E (Redshift) is for data warehousing, not immediate S3 querying.

Full explanation →

422

MCQhard

An IAM policy attached to an AWS Glue job allows reading and writing to an S3 bucket and accessing Glue Data Catalog. The job fails with an access denied error when trying to create a table in the Data Catalog. What is the likely issue?

A.The Glue Data Catalog is not enabled for the account.

B.The job does not have permission to write to the S3 bucket.

C.The S3 bucket is encrypted with a KMS key that the job cannot access.

D.The policy does not include the glue:CreateTable action.

AnswerD

Only GetTable and GetDatabase are allowed, not CreateTable.

Why this answer

The policy allows GetTable and GetDatabase actions, but not CreateTable. The job needs glue:CreateTable permission. The S3 actions are sufficient.

The error is specifically about creating a table.

Full explanation →

423

MCQeasy

A data analyst is examining the distribution of a continuous variable and notices that its histogram is heavily skewed to the right. Which transformation should the analyst apply to make the distribution more symmetrical?

A.Box-Cox transformation with lambda=2.

B.Logarithmic transformation (log).

C.Standardization (z-score).

D.Square root transformation.

AnswerB

Log transformation reduces right skewness.

Why this answer

Option B is correct because log transformation is commonly used to reduce right skewness by compressing the long tail. Option A is wrong because the square root transformation is less effective for severe skewness. Option C is wrong because Box-Cox requires all values positive and is a family that includes log, but the log is a specific case.

Option D is wrong because standardization does not change the shape of the distribution.

Full explanation →

424

MCQeasy

A company is using Amazon SageMaker to build a binary classification model. The dataset is highly imbalanced, with 95% negative class and 5% positive class. Which technique should be used to address the class imbalance?

A.Use a weighted loss function during training.

B.Use accuracy as the primary evaluation metric.

C.Perform random under-sampling of the majority class.

D.Remove all examples from the majority class.

AnswerA

Weighted loss penalizes errors on minority class more heavily.

Why this answer

Option D is correct because using a weighted loss function during training assigns higher weight to the minority class, helping the model learn better from imbalanced data. Option A is wrong because removing the majority class reduces data size and may lose important patterns. Option B is wrong because random under-sampling can discard useful data.

Option C is wrong because using accuracy as the evaluation metric is inappropriate for imbalanced data; precision/recall or AUC are better.

Full explanation →

425

MCQmedium

A data scientist uses SageMaker to train a model and wants to automatically stop the training job if the loss is not improving after a certain number of steps. Which feature should be used?

A.SageMaker Experiments

B.SageMaker Debugger

C.SageMaker Automatic Model Tuning

D.SageMaker Ground Truth

AnswerB

Debugger can monitor and stop jobs based on rules.

Why this answer

SageMaker Debugger can monitor loss and trigger actions like stopping the job. Option D is correct. Option A is wrong because automatic tuning is for hyperparameter optimization.

Option B is wrong because Experiments is for tracking. Option C is wrong because Ground Truth is for labeling.

Full explanation →

426

MCQmedium

A company is building a data pipeline that ingests data from multiple sources into a centralized data lake on Amazon S3. The data must be transformed before it is available for analysis. The pipeline should be event-driven, automatically triggering transformation jobs when new data arrives. Which combination of AWS services should be used?

A.Amazon Kinesis Data Analytics for transformation

B.Amazon S3 event notifications to invoke AWS Lambda, which triggers an AWS Glue job

C.Amazon EMR with automatic scaling

D.AWS Step Functions to orchestrate the pipeline

AnswerB

S3 events trigger Lambda, which starts a Glue ETL job; this is event-driven and serverless.

Why this answer

Amazon S3 can send events to AWS Lambda or SQS when new objects are created. AWS Glue can be triggered by Lambda to run ETL jobs. Step Functions (option A) can orchestrate but adds complexity.

Kinesis Data Analytics (option B) is for streaming analytics, not batch. EMR (option D) requires cluster management.

Full explanation →

427

Multi-Selecteasy

Which TWO actions are appropriate when handling missing data in a dataset for machine learning? (Select TWO.)

Select 2 answers

A.Use a machine learning model to predict missing values based on other features

B.Drop all rows that contain any missing value

C.Impute missing values with the mean or median of the feature

D.Remove the feature entirely if it contains missing values

E.Fill missing values with zero

AnswersA, C

Why D is correct

Why this answer

Options A and D are correct. Imputing with mean/median is a common technique, and using a model to predict missing values is also valid. Option B is wrong because dropping all rows with missing values can discard too much data.

Option C is wrong because filling with zeros may not be appropriate for all features. Option E is wrong because removing the feature entirely may lose important information.

Full explanation →

428

MCQhard

A data scientist notices that a linear regression model trained on a dataset has high variance. The model performs well on the training data but poorly on the test data. Which action is most likely to reduce the variance?

A.Decrease the amount of training data

B.Apply L2 regularization to the model

C.Increase the number of gradient descent iterations

D.Add more features to the model

AnswerB

L2 regularization shrinks coefficients and reduces model complexity, thereby reducing variance.

Why this answer

High variance indicates the model is overfitting to the training data. L2 regularization (ridge regression) adds a penalty proportional to the square of the magnitude of the coefficients, which shrinks them toward zero. This reduces the model's sensitivity to noise in the training data, thereby lowering variance and improving generalization to the test set.

Exam trap

Cisco often tests the bias-variance tradeoff by making candidates confuse regularization with optimization steps or feature engineering, so the trap here is assuming that more training data or more iterations always improve model performance without considering their effect on variance.

How to eliminate wrong answers

Option A is wrong because decreasing the amount of training data typically increases variance, as the model has fewer examples to learn from and is more likely to overfit. Option C is wrong because increasing gradient descent iterations does not reduce variance; it only ensures the optimization converges to a minimum, which may even worsen overfitting if the model is already complex. Option D is wrong because adding more features increases model complexity, which generally raises variance and exacerbates overfitting, not reduces it.

Full explanation →

429

MCQhard

A data scientist queried an Athena table and got only one row back, but the CSV file is 1 MB. What is the most likely reason?

A.The table is partitioned but the partition is not correctly defined

B.The CSV file contains only one row

C.The table is not an external table

D.Athena does not support CSV format

AnswerA

Correct: If date partition is not correctly mapped, the filter may return no data.

Why this answer

Option B is correct because the file is large but the query returned only one row, suggesting the table's partition mapping is wrong; the WHERE clause on date may not match the actual partition. Option A is wrong because if the table were external, it would still read all data. Option C is wrong because 1 MB file likely has many rows.

Option D is wrong because Athena supports CSV.

Full explanation →

430

Multi-Selecthard

A data engineer is designing an ETL pipeline using AWS Glue to process data from Amazon S3 and load it into Amazon Redshift. The pipeline must handle incremental data loads and ensure data consistency. Which THREE features should the engineer use to achieve this? (Choose THREE.)

Select 3 answers

A.Pushdown predicates to filter partitions in S3

B.Glue data preview to validate transformation logic

C.Glue partition filters to limit data scanned

D.Redshift transactional tables with automatic commit

E.Glue job bookmarks to track processed data

AnswersA, D, E

Pushdown predicates reduce the amount of data read from S3, improving performance.

Why this answer

Option A (job bookmark) enables incremental processing. Option C (pushdown predicate) reduces data scanned. Option E (transactional table) ensures consistency.

Option B (partition filter) is less efficient. Option D (data preview) is for development.

Full explanation →

431

MCQeasy

A data scientist is training a linear regression model and notices high bias in the training set. What action is most likely to reduce bias?

A.Apply L1 regularization.

B.Increase the learning rate.

C.Increase the amount of training data.

D.Add more relevant features to the model.

AnswerD

Adding features increases model capacity, which can reduce high bias.

Why this answer

High bias indicates that the model is underfitting the training data, meaning it is too simple to capture the underlying patterns. Adding more relevant features increases the model's capacity to learn complex relationships, directly reducing bias. This is a standard approach in linear regression to address underfitting.

Exam trap

The trap here is that candidates confuse high bias with high variance and incorrectly choose increasing training data (Option C) or regularization (Option A), which are solutions for overfitting, not underfitting.

How to eliminate wrong answers

Option A is wrong because L1 regularization (Lasso) reduces overfitting by shrinking coefficients to zero, which increases bias rather than reducing it. Option B is wrong because increasing the learning rate affects the convergence speed of gradient descent, not the model's bias; it may cause divergence or oscillation. Option C is wrong because increasing the amount of training data helps reduce variance (overfitting) but does not address high bias; with high bias, the model is already too simple to fit the data well.

Full explanation →

432

MCQmedium

A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?

A.Use a custom classifier to detect partition patterns.

B.Increase the crawler schedule to run every hour.

C.Configure the crawler to update all partitions on each run.

D.Enable partition indexing in the Glue table properties.

AnswerD

Partition indexing helps Athena query without full scan.

Why this answer

Option D is correct because enabling partition indexing in the Glue table properties allows the Glue Data Catalog to automatically discover and register new partitions as they are added to S3, without relying solely on crawler runs. This feature uses the Hive-style partition structure (e.g., year=2024/month=01/day=15) to index partitions, ensuring that even if the crawler misses a run, new partitions are still cataloged via the partition index.

Exam trap

The trap here is that candidates often assume increasing crawler frequency or using custom classifiers will solve partition discovery issues, but the correct solution is to leverage Glue's built-in partition indexing feature, which decouples partition discovery from crawler runs.

How to eliminate wrong answers

Option A is wrong because custom classifiers are used to infer the schema of data formats (e.g., CSV, JSON) and do not affect partition discovery or cataloging. Option B is wrong because increasing the crawler schedule to run every hour does not guarantee that all partitions are cataloged if the crawler fails or if partitions are added between runs; it only reduces the window of missed partitions but does not solve the underlying issue of missed partitions. Option C is wrong because configuring the crawler to update all partitions on each run would be inefficient and does not address the root cause of missed partitions; the crawler still depends on its schedule and may skip partitions if they are not present during the crawl.

Full explanation →

433

Multi-Selectmedium

A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)

Select 2 answers

A.AWS Lake Formation

B.AWS Identity and Access Management (IAM)

C.AWS Glue Data Catalog

D.Amazon RDS for PostgreSQL

E.Amazon DynamoDB

AnswersA, C

Lake Formation provides fine-grained access control and integrates with Glue Catalog.

Why this answer

AWS Lake Formation is correct because it provides a centralized service to build, secure, and manage data lakes on AWS. It enables fine-grained access control at the column and row level for sensitive data, which directly meets the requirement for enforcing such controls. Additionally, Lake Formation integrates with Amazon Athena and Amazon EMR for querying and processing the cataloged data.

Exam trap

The trap here is that candidates often assume IAM alone can handle fine-grained data access control, but IAM lacks the column- and row-level filtering capabilities that Lake Formation provides through its integration with the Glue Data Catalog and query engines.

Full explanation →

434

Matchingmedium

Match each AWS AI service to its capability.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Natural language processing

Language translation

Text-to-speech

Speech-to-text

Conversational chatbots

Why these pairings

These are AWS AI services for various NLP tasks.

Full explanation →

435

MCQhard

An ML team is using SageMaker Processing jobs to run feature engineering scripts. The scripts require a specific Python package not included in the default SageMaker image. How should the team provide this package?

A.Include 'pip install <package>' in the processing script

B.Use the SageMaker prebuilt deep learning container with the package

C.Place a requirements.txt file in the input data S3 bucket

D.Create a custom Docker image that includes the package and use it for the Processing job

AnswerD

Standard best practice for custom dependencies.

Why this answer

A custom container allows full control over dependencies. Option A is wrong because pip install in script is not persisted. Option B is wrong because SageMaker doesn't support requirements.txt directly.

Option D is wrong because a prebuilt image may not have the package.

Full explanation →

436

MCQmedium

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance role. When the data scientist tries to run a training job that writes model artifacts to 's3://my-bucket/models/', the job fails with an access denied error. What is the MOST likely cause?

A.The IAM role does not have a trust policy

B.Missing s3:PutObject permission for the output S3 bucket

C.The policy does not include any S3 actions

D.The sagemaker:CreateTrainingJob action is not allowed on the specific resource

AnswerB

Write access is needed for model artifacts.

Why this answer

The error occurs because the IAM policy attached to the SageMaker notebook instance role does not grant the s3:PutObject permission on the 's3://my-bucket/models/' path. SageMaker training jobs require this permission to write model artifacts to the specified S3 output bucket. Without it, the API call to upload the model fails with an access denied error, even if other S3 actions are allowed.

Exam trap

The trap here is that candidates often assume the error is due to a missing trust policy or a missing sagemaker:CreateTrainingJob permission, but the actual failure is at the S3 write step, which requires explicit s3:PutObject on the output bucket.

How to eliminate wrong answers

Option A is wrong because a trust policy is not required for the SageMaker notebook instance role to assume itself; trust policies are needed for cross-account access or service-to-service role assumption, not for the role's own permissions. Option C is wrong because the statement says the policy is attached to the role, and while it may include S3 actions, the specific s3:PutObject action is missing for the output bucket; the problem is not the absence of all S3 actions but the missing write permission. Option D is wrong because the sagemaker:CreateTrainingJob action is allowed on the SageMaker resource (the notebook role has permissions to create training jobs), but the failure occurs at the S3 write step, not at the training job creation step.

Full explanation →

437

MCQmedium

A team is training an XGBoost model using SageMaker with a large dataset in S3 (100 GB). Training is taking too long. Which change will most likely reduce training time without sacrificing accuracy?

A.Reduce the number of training instances

B.Configure Pipe mode for data input

C.Enable SageMaker Managed Spot Training

D.Use a larger instance type with more vCPUs

AnswerB

Pipe mode streams data directly from S3, reducing I/O bottleneck and training time.

Why this answer

Option D is correct: enabling SageMaker Managed Spot Training reduces cost but does not accelerate training. Option A (increasing instance count) can reduce training time via distributed training. Option B (using Pipe mode) streams data for faster I/O.

Option C (reducing instance count) would increase time. Option D is about cost, not speed.

Full explanation →

438

MCQhard

A data scientist is training a deep learning model on a large dataset using SageMaker. The training job is taking too long. Upon reviewing the CloudWatch logs, the scientist notices that the GPU utilization is below 10% most of the time. Which change is MOST likely to improve GPU utilization and reduce training time?

A.Increase the batch size in the training script.

B.Use a different optimizer that requires less computation.

C.Switch to a smaller instance type to reduce data transfer overhead.

D.Reduce the size of the training dataset.

AnswerA

Increasing batch size can improve GPU utilization by processing more data per step.

Why this answer

Low GPU utilization often indicates a data loading bottleneck. Increasing the batch size can improve GPU utilization by feeding more data at once, but it may also cause memory issues. Using a larger instance type with more GPU memory could help if the model is large.

However, the most common fix is to use SageMaker Pipe Mode or Fast File Mode to stream data efficiently, reducing I/O wait. Among the options, increasing batch size is a direct way to increase GPU utilization.

Full explanation →

439

Multi-Selectmedium

A company wants to use Amazon SageMaker to train a model using data stored in Amazon S3. The data is sensitive and must be encrypted at rest and in transit. Which THREE steps should be taken to ensure data security?

Select 3 answers

A.Configure the SageMaker training job to use an IAM role with least privilege and enable network isolation

B.Enable default encryption on the S3 bucket using AWS KMS

C.Use an S3 VPC endpoint to keep traffic within the AWS network

D.Store the data in Amazon Redshift instead of S3

E.Allow internet access for the SageMaker notebook instance

AnswersA, B, C

Network isolation ensures no internet egress.

Why this answer

Encrypting the S3 bucket with KMS ensures encryption at rest. Using VPC endpoints for S3 ensures data does not traverse the public internet. Enabling encryption in transit between SageMaker and S3 (using HTTPS) is essential.

Option A (public internet) is not secure; Option E (Redshift) is irrelevant.

Full explanation →

440

Multi-Selectmedium

A company is using Amazon SageMaker to train an XGBoost model. The training data contains missing values. Which TWO methods can XGBoost handle missing values internally?

Select 2 answers

A.Use surrogate splits to handle missing values.

B.Drop rows with missing values.

C.Learn the best direction to go when a value is missing.

D.Treat missing values as a separate category.

E.Impute missing values with the mean of the feature.

AnswersC, D

XGBoost uses a sparsity-aware algorithm that learns the optimal split direction for missing values.

Why this answer

XGBoost can handle missing values by learning the best direction to go when a value is missing (sparsity-aware algorithm). You don't need to impute or drop. However, the question asks 'which TWO methods can XGBoost handle missing values internally?' The correct answer is that XGBoost can treat missing values as a separate category and learn the optimal split direction.

Also, you can set a default direction. But the options: A (impute with mean) is not internal; B (drop rows) is not; C (treat missing as a separate category) is correct; D (learn best split direction) is correct; E (use surrogate splits) is not XGBoost default. So C and D.

Full explanation →

441

MCQeasy

A team is building a data pipeline using Amazon Kinesis Data Firehose to deliver real-time clickstream data to an Amazon S3 bucket. The data must be partitioned by year, month, day, and hour. Which configuration should the team use to achieve this?

A.Configure an S3 lifecycle rule to move data into partition folders after delivery

B.Use an AWS Lambda function to write data to S3 with the desired partition structure

C.Enable dynamic partitioning in Firehose and configure the partition keys as YYYY/MM/dd/HH

D.Use Amazon Athena partition projection to dynamically create partitions

AnswerC

Firehose dynamic partitioning automatically creates folder structures.

Why this answer

Option D is correct because Firehose has a built-in feature to add dynamic partitioning using keys like YYYY/MM/dd/HH based on the timestamp. Option A is wrong because Lambda can partition but adds complexity. Option B is wrong because S3 lifecycle rules do not repartition on delivery.

Option C is wrong because partition projection is for Athena, not Firehose.

Full explanation →

442

MCQmedium

A data scientist is performing EDA on a dataset with 500 features. The dataset has a mix of numeric and categorical features. The scientist wants to identify which features have a strong nonlinear relationship with the target variable. Which technique is most appropriate?

A.Use ANOVA to compare feature means across target classes.

B.Compute Pearson correlation coefficients.

C.Calculate mutual information between each feature and the target.

D.Perform chi-squared tests for each feature.

AnswerC

Mutual information measures any dependency, including nonlinear.

Why this answer

Mutual information can capture any kind of dependency (including nonlinear) between features and target. Option A (Pearson correlation) only linear. Option B (Chi-squared test) is for categorical features.

Option D (ANOVA) is for comparing means across groups.

Full explanation →

443

Multi-Selecthard

A company uses SageMaker to train a model. The training job fails with 'ResourceLimitExceeded' error. Which TWO actions should the company take to resolve this?

Select 2 answers

A.Launch the training job in a different AWS region.

B.Use a different instance type that is not at its limit.

C.Use SageMaker Managed Spot Training to reduce cost.

D.Compress the training data to reduce storage requirements.

E.Request a service limit increase for SageMaker training job resources.

AnswersB, E

Different instance types may have separate limits.

Why this answer

Option B is correct because the 'ResourceLimitExceeded' error indicates that the requested instance type has reached its concurrent usage limit in the current AWS region. Switching to a different instance type that is not at its limit allows the training job to proceed without exceeding the service quota. Option E is correct because requesting a service limit increase for SageMaker training job resources directly raises the cap on the number of concurrent instances or total instance count, resolving the underlying quota issue.

Exam trap

The trap here is that candidates confuse 'ResourceLimitExceeded' with cost or storage issues, leading them to select Managed Spot Training or data compression, which do not address the underlying AWS service quota limit.

Full explanation →

444

MCQeasy

A data scientist has trained a model using SageMaker and wants to deploy it to an endpoint. Which step is required before deployment?

A.Upload the training data to S3

B.Create a custom Docker image

C.Retrain the model with more data

D.Register the model in SageMaker Model Registry

AnswerD

Model must be registered to be deployable.

Why this answer

Option A is correct because a model must be registered. Option B is wrong because training data is not needed after training. Option C is wrong because the model is already trained.

Option D is wrong because Docker images are not required for built-in algorithms.

Full explanation →

445

MCQmedium

A data engineer runs the AWS CLI command above to inspect an object in S3. The engineer wants to query this metadata (kafka-offset) using Amazon Athena to track processing progress. How can the engineer make this metadata available for Athena queries without modifying the existing data pipeline?

A.Use S3 object tags instead of metadata and query the tags using Athena.

B.Use an AWS Lambda function to copy the metadata into the object's content as a new line.

C.Use AWS Glue to create a table that includes the metadata as a column by running an ETL job.

D.Use Amazon Athena to query the object metadata directly by referencing the metadata field.

AnswerC

A Glue ETL job can read objects, extract metadata, and write to a table that Athena can query.

Why this answer

Option B is correct. S3 object metadata is not automatically available in Athena. The engineer can use AWS Glue to crawl the S3 bucket and extract metadata into the Data Catalog; however, custom metadata is not crawled by default.

A better approach is to store the metadata in a separate table or use S3 object tagging. But among options, Option B is correct: configure a Glue crawler to extract metadata? Actually, Glue crawlers do not extract custom metadata. Option D is correct: use S3 object tags, which can be queried via Athena using the $metadata column? Not exactly.

Let's rethink. The best practice is to store metadata in a separate manifest file. Option B is correct because you can create a Glue table with a custom classifier to extract metadata? Actually, the correct answer is to use S3 Object Lambda to add metadata to the object content? Not listed.

Given the options, Option B is correct: Use AWS Glue to create a table that includes the metadata? But Glue crawlers don't capture custom metadata. Option A is wrong because you cannot query metadata directly. Option C is wrong because Lambda cannot add metadata to existing objects without rewriting.

Option D is correct: Use S3 object tags, which can be queried via Athena? Actually, Athena does not query tags. The best answer is to store metadata in a separate manifest file in S3 and query that. But the most practical is to use a Glue ETL job to read the objects and extract metadata into a table.

Option B is the closest: 'Use AWS Glue to create a table that includes the metadata as a column' - you can use a Glue ETL job to extract metadata and store in Parquet. So Option B is correct.

Full explanation →

446

MCQmedium

A team wants to build a data pipeline that processes incoming JSON files from an S3 bucket and loads them into a Redshift table. The pipeline must handle schema evolution and data validation. Which combination of services would be MOST appropriate?

A.Amazon S3 + AWS Glue + Amazon Redshift

B.Amazon S3 + Amazon SQS + Amazon Redshift

C.Amazon S3 + AWS Data Pipeline + Amazon Redshift

D.Amazon S3 + AWS Lambda + Amazon Redshift

AnswerA

Glue provides schema inference and ETL.

Why this answer

AWS Glue can crawl the S3 data to infer schema, perform ETL transformations, and load into Redshift. SQS is not needed. Lambda is event-driven but lacks built-in schema evolution.

Data Pipeline is older and less flexible.

Full explanation →

447

MCQeasy

Which AWS service can be used to generate a data profile (including histograms, correlations, and statistics) for a dataset stored in Amazon S3 without writing code?

A.Amazon QuickSight

B.AWS Glue DataBrew

C.Amazon Athena

D.Amazon SageMaker Data Wrangler

AnswerD

Data Wrangler provides visual data profiling.

Why this answer

Option D is correct because Amazon SageMaker Data Wrangler provides a visual interface to create data profiles. Option A (QuickSight) is for visualization, not profiling; Option B (Glue DataBrew) also profiles but Data Wrangler is more integrated with SageMaker; Option C (Athena) is for querying.

Full explanation →

448

MCQeasy

A data scientist is using Amazon SageMaker to train a deep learning model with a large dataset. The training job fails with a 'CUDA out of memory' error. What is the MOST efficient way to resolve this issue?

A.Switch to a CPU-only instance

B.Use a larger instance type with more GPUs

C.Increase the batch size

D.Reduce the batch size

AnswerD

Smaller batch size reduces memory consumption per GPU.

Why this answer

The 'CUDA out of memory' error occurs when the GPU's memory is insufficient to hold the model parameters, gradients, optimizer states, and the current batch of data. Reducing the batch size decreases the memory footprint per training step, allowing the model to fit within the available GPU memory without requiring a more expensive instance or sacrificing GPU acceleration.

Exam trap

AWS often tests the misconception that 'more resources' (larger instance or more GPUs) is always the best fix, when in fact adjusting hyperparameters like batch size is the most efficient and cost-effective first step.

How to eliminate wrong answers

Option A is wrong because switching to a CPU-only instance would eliminate GPU acceleration entirely, drastically slowing training for deep learning workloads, and does not address the root cause of memory pressure. Option B is wrong because using a larger instance with more GPUs is an expensive overprovisioning solution that does not optimize resource usage; it may also introduce additional complexity with multi-GPU data parallelism. Option C is wrong because increasing the batch size would increase GPU memory consumption, exacerbating the out-of-memory error rather than resolving it.

Full explanation →

449

MCQeasy

An ML engineer is troubleshooting why an automated CI/CD pipeline cannot deploy an updated model to an existing SageMaker endpoint. The pipeline uses the IAM role that has the attached policy shown in the exhibit. What is the MOST likely cause of the failure?

A.The pipeline tries to update an existing endpoint, but the sagemaker:UpdateEndpoint action is not allowed.

B.The pipeline tries to create a new endpoint, but the sagemaker:CreateEndpoint action is denied.

C.The pipeline tries to delete the old endpoint, but the sagemaker:DeleteEndpoint action is denied by a Deny statement.

D.The pipeline attempts to invoke the endpoint, but the sagemaker:InvokeEndpoint action is denied.

AnswerA

The policy does not include sagemaker:UpdateEndpoint, which is required to update an existing endpoint. Without this permission, the update fails.

Why this answer

The pipeline is attempting to deploy an updated model to an existing SageMaker endpoint, which requires the sagemaker:UpdateEndpoint action. The IAM policy shown in the exhibit (not provided here but implied) does not include this action, so the API call fails with an access denied error. Without explicit permission to update the endpoint, the CI/CD pipeline cannot modify the deployed configuration.

Exam trap

The trap here is that candidates may confuse the actions required for updating an existing endpoint (UpdateEndpoint) with those for creating a new one (CreateEndpoint), leading them to incorrectly select Option B when the pipeline is actually performing an update.

How to eliminate wrong answers

Option B is wrong because the pipeline is not creating a new endpoint; it is updating an existing one, so sagemaker:CreateEndpoint is not the required action. Option C is wrong because the pipeline does not need to delete the old endpoint; SageMaker endpoints are updated in-place via UpdateEndpoint, which handles traffic shifting automatically. Option D is wrong because the pipeline is not invoking the endpoint during deployment; InvokeEndpoint is used for inference requests, not for model deployment operations.

Full explanation →

450

Multi-Selecthard

A company is using Amazon DynamoDB as a source for a machine learning pipeline. The data is exported nightly to Amazon S3 using DynamoDB Streams and an AWS Glue job. The Glue job reads the stream records, transforms them, and writes to S3 in Parquet format. The team notices that the Glue job is taking too long and consuming high DynamoDB read capacity. Which THREE actions would reduce the load on DynamoDB and improve performance? (Choose THREE.)

Select 3 answers

A.Use Amazon DynamoDB export to S3 (incremental) feature instead of Glue

B.Increase the DynamoDB write capacity units to handle the stream writes

C.Use DynamoDB Streams with AWS Lambda to write data directly to S3 in near-real-time, bypassing Glue

D.Increase the DynamoDB read capacity units to handle Glue's workload

E.Configure Glue to read from a S3 snapshot exported earlier instead of directly from DynamoDB

AnswersA, C, E

The export feature does not consume read capacity and can be automated.

Why this answer

Option A is correct because enabling DynamoDB Streams with a Lambda function to write to S3 directly avoids Glue's read from DynamoDB. Option B is correct because using DynamoDB export to S3 (incremental) does not consume read capacity. Option D is correct because using S3 as the source for Glue reduces DynamoDB reads.

Option C is wrong because increasing read capacity increases load. Option E is wrong because increasing write capacity does not affect reads.

Full explanation →

Page 6 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →