Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 601–675

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 9 of 24

601

Multi-Selecthard

Which TWO techniques can be used to detect multicollinearity among numerical features during exploratory data analysis? (Choose two.)

Select 2 answers

A.Apply Principal Component Analysis (PCA) and examine loadings.

B.Compute a correlation matrix and look for pairs with absolute correlation > 0.8.

C.Perform a t-test between each pair of features.

D.Calculate Variance Inflation Factor (VIF) for each feature.

E.Use a chi-square test of independence.

AnswersB, D

High correlation indicates multicollinearity.

Why this answer

Options A and B are correct. A: Correlation matrix shows pairwise correlations; high values indicate collinearity. B: Variance Inflation Factor (VIF) quantifies how much a feature is explained by others.

C: PCA reduces dimensionality but does not detect collinearity directly; D: t-tests compare means; E: Chi-square tests independence for categorical variables.

Full explanation →

602

MCQeasy

A data engineer needs to extract data from an Amazon RDS for MySQL database into Amazon S3 for further processing. The data volume is 2 TB and the job must run daily within a 1-hour window. Which AWS service is most suitable for this task?

A.Amazon Kinesis Data Firehose

B.AWS Database Migration Service (DMS)

C.Amazon Athena

D.AWS Glue

AnswerD

AWS Glue provides managed ETL jobs that can extract from JDBC sources and write to S3 on a schedule.

Why this answer

AWS Glue is designed for extract, transform, and load (ETL) jobs and can connect to JDBC sources like RDS and write to S3. DMS is for database migration, not scheduled ETL. Athena is for querying data in S3.

Kinesis is for streaming data.

Full explanation →

603

MCQmedium

A company uses Amazon SageMaker to train an XGBoost model on a large dataset. Training takes a long time. Which action can reduce training time without significantly affecting model accuracy?

A.Use a deep neural network instead

B.Increase the learning rate

C.Use a larger instance type

D.Enable early stopping

AnswerD

Early stopping stops when no improvement.

Why this answer

Early stopping halts training when the model's performance on a validation set stops improving for a specified number of rounds. This prevents overfitting and reduces training time by eliminating unnecessary iterations, while typically preserving accuracy because the optimal model is already found.

Exam trap

AWS often tests the misconception that increasing learning rate or using more powerful hardware always speeds up training without side effects, but the correct answer focuses on algorithmic efficiency rather than resource scaling.

How to eliminate wrong answers

Option A is wrong because replacing XGBoost with a deep neural network generally increases training time and requires more data and tuning, not reducing time. Option B is wrong because increasing the learning rate can cause the model to converge to a suboptimal solution or diverge, significantly reducing accuracy. Option C is wrong because using a larger instance type increases computational resources and may reduce wall-clock time, but it does not reduce the total compute time or algorithmic iterations; it also incurs higher cost and does not address the root cause of long training.

Full explanation →

604

Multi-Selecthard

Which TWO statements about data leakage in machine learning are correct? (Select TWO.)

Select 2 answers

A.Using the target variable to filter features before splitting leads to data leakage

B.Applying SMOTE after splitting the dataset prevents data leakage

C.Applying standardization on the entire dataset before splitting into training and test sets can cause data leakage

D.Using cross-validation eliminates all possible data leakage

E.For time series data, using a random train-test split is recommended to avoid data leakage

AnswersA, C

Why C is correct

Why this answer

Options A and C are correct. Scaling before splitting is a classic source of data leakage. Using target information to filter features is also leakage.

Option B is wrong because proper cross-validation prevents leakage if done correctly. Option D is wrong because time-based split is a valid way to prevent leakage in time series. Option E is wrong because SMOTE should be applied after splitting to avoid leakage.

Full explanation →

605

MCQeasy

A machine learning team needs to create a training dataset by joining two large datasets (10 TB and 5 TB) stored in S3. The join key is 'user_id'. They want to minimize data movement and cost. Which approach should they use?

A.Use AWS Glue ETL to read both datasets, join them using Spark DataFrames, and write the result to S3.

B.Launch an Amazon EMR cluster with Spark, read data from S3, perform the join, and write results back to S3.

C.Use Amazon Athena to run a SQL query joining the two datasets directly on S3.

D.Load both datasets into Amazon Redshift using COPY commands, then perform the join in Redshift.

AnswerC

Athena queries data in place, charges per query scanned, and requires no infrastructure management.

Why this answer

Option C is correct because Amazon Athena allows serverless SQL joins directly on S3 data without moving it, and is cost-effective for large ad-hoc queries. Option A is wrong because Redshift Spectrum still requires moving data into Redshift for optimal performance. Option B is wrong because EMR requires provisioning clusters and incurs compute costs even when idle.

Option D is wrong because Glue ETL typically moves data into a transformation environment.

Full explanation →

606

MCQmedium

A company is using Amazon SageMaker to host a model for real-time inference. The model was trained using SageMaker's built-in Linear Learner algorithm. The endpoint has been running for a week, and the operations team notices that the endpoint's latency has increased from 50 ms to 150 ms over the past few days. The number of requests per second has remained steady at about 200. The team suspects a memory leak in the inference container. What should the team do to diagnose the issue?

A.Enable CloudWatch Logs and use Container Insights to view memory utilization.

B.Use Amazon CloudWatch to monitor the endpoint's latency metric.

C.Use SageMaker Debugger to inspect the inference container.

D.Use SageMaker Model Monitor to detect data drift.

AnswerA

Container Insights shows memory usage trends, helping diagnose leaks.

Why this answer

Option B is correct because CloudWatch Logs and container insights can show memory usage over time. Option A is wrong because CloudWatch metrics show latency but not memory directly. Option C is wrong because SageMaker Debugger is for training.

Option D is wrong because SageMaker Model Monitor is for data drift, not memory leaks.

Full explanation →

607

MCQmedium

A data scientist is using Amazon SageMaker to train a model on a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., postal codes, product IDs). Which feature engineering approach is most suitable for handling these high-cardinality categorical features in a tree-based model?

A.One-hot encode the categorical features

B.Use label encoding

C.Apply target encoding

D.Apply binary encoding

AnswerB

Tree-based models like XGBoost can effectively use label-encoded features because they make splits based on ordering.

Why this answer

Label encoding is suitable for tree-based models because these models split on feature values and can handle ordinal relationships implicitly. Unlike linear models, tree-based models do not assume any distance metric between categories, so label encoding avoids the dimensionality explosion of one-hot encoding while preserving the ability to capture splits based on high-cardinality features.

Exam trap

The trap here is that candidates often default to one-hot encoding for categorical features without considering the model type, failing to recognize that tree-based models can effectively use label encoding for high-cardinality features without the drawbacks of dimensionality explosion.

How to eliminate wrong answers

Option A is wrong because one-hot encoding high-cardinality features (e.g., thousands of unique postal codes) creates an extremely sparse feature matrix with many columns, leading to increased memory usage, slower training, and potential overfitting in tree-based models. Option C is wrong because target encoding, while effective for high-cardinality features, introduces target leakage and can cause overfitting if not carefully regularized, making it less robust than label encoding for tree-based models in a straightforward SageMaker training pipeline. Option D is wrong because binary encoding, though more compact than one-hot, still creates multiple binary columns per feature and can complicate interpretability; tree-based models can handle label encoding directly without needing this transformation.

Full explanation →

608

Drag & Dropmedium

Drag and drop the steps to create an Amazon SageMaker notebook instance in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Creating a notebook instance requires navigating the SageMaker console, configuring instance settings, IAM role, and VPC, then launching.

Full explanation →

609

MCQhard

A data engineer is investigating why an Athena query against the my-data-lake bucket is slow. The query filters on year, month, and day. The exhibit shows the metadata of one Parquet file. What is the MOST likely cause of the slow query?

A.The version ID is null, causing data inconsistency

B.The file is too large, causing Athena to process it in a single task

C.The partition columns are not being used in the query

D.The storage class is STANDARD, which is slower than GLACIER

AnswerB

Large files limit parallelism; Athena works best with files 128-512 MB.

Why this answer

The file is 1 GB (1073741824 bytes), which is large for a single Parquet file. Athena splits files into tasks; a single large file cannot be parallelized, causing slow performance. Partitioning is fine, but file size matters.

The metadata is not missing. Storage class is standard. Versioning is not enabled.

Full explanation →

610

MCQeasy

A team is training a linear regression model to predict house prices. After training, they observe that the model has high bias (underfitting). Which action is most likely to reduce bias?

A.Increase the regularization strength.

B.Reduce the amount of training data.

C.Decrease the number of model parameters.

D.Add more relevant features and increase model complexity.

AnswerD

Adding features reduces bias.

Why this answer

High bias (underfitting) means the model is too simple to capture the underlying patterns in the data. Adding more relevant features and increasing model complexity (e.g., using polynomial features or more interaction terms) gives the linear regression model greater capacity to fit the training data, directly reducing bias. This aligns with the bias-variance tradeoff, where increasing complexity lowers bias at the cost of potentially increasing variance.

Exam trap

The trap here is that candidates often confuse regularization (which controls overfitting) with bias reduction, mistakenly thinking increasing regularization or reducing parameters will fix underfitting, when in fact those actions increase bias.

How to eliminate wrong answers

Option A is wrong because increasing regularization strength (e.g., L1 or L2 penalty) forces the model to shrink coefficients toward zero, which increases bias and worsens underfitting. Option B is wrong because reducing the amount of training data does not address model simplicity; it typically increases variance and can exacerbate bias if the model cannot learn the true distribution. Option C is wrong because decreasing the number of model parameters (e.g., removing features or using a simpler model) reduces complexity, which directly increases bias and makes underfitting worse.

Full explanation →

611

MCQmedium

A data scientist is using Amazon SageMaker to train a natural language processing model using a custom Docker container. The training script reads data from an S3 bucket and writes checkpoints to an S3 bucket. The training job is failing with the error 'Unable to write to checkpoint path: s3://my-bucket/checkpoints/'. The IAM role associated with the training job has the following policy: {'Effect': 'Allow', 'Action': 's3:PutObject', 'Resource': 'arn:aws:s3:::my-bucket/checkpoints/*'}. The bucket 'my-bucket' exists and the prefix 'checkpoints/' is empty. What is the most likely cause of the failure?

A.The IAM role is missing the s3:ListBucket permission

B.The IAM role does not have s3:PutObject permission

C.The S3 bucket does not exist

D.The checkpoint prefix already contains objects

AnswerA

SageMaker needs ListBucket to access the bucket.

Why this answer

The error 'Unable to write to checkpoint path' occurs because the SageMaker training job's IAM role lacks the `s3:ListBucket` permission. Even though the role has `s3:PutObject` on the checkpoint prefix, SageMaker's S3 client first performs a `ListObjects` (or `HeadObject`) call to verify the bucket exists and to check the prefix state before writing. Without `s3:ListBucket` on the bucket itself, the API call fails, causing the write operation to abort.

Exam trap

The trap here is that candidates assume `s3:PutObject` alone is sufficient for writing to S3, but AWS requires `s3:ListBucket` on the bucket to verify the path before writing, a nuance frequently tested in MLS-C01 and SAA exams.

How to eliminate wrong answers

Option B is wrong because the policy explicitly includes `s3:PutObject` on the checkpoint path, so the permission is present. Option C is wrong because the question states the bucket 'my-bucket' exists, so the bucket is not missing. Option D is wrong because the prefix is explicitly described as empty, and even if it contained objects, `s3:PutObject` would still succeed; the error is about the inability to write, not about overwriting existing objects.

Full explanation →

612

MCQeasy

A data scientist is working on a project to predict customer churn. The dataset contains 50,000 rows and 20 features, including categorical variables like 'Region' (10 categories) and 'SubscriptionType' (5 categories). The target variable is binary (churn or not). During exploratory data analysis, they plot the distribution of each feature and notice that 'Region' has a highly imbalanced distribution: one region accounts for 80% of the data. Which of the following is the most appropriate next step?

A.Apply one-hot encoding to the 'Region' feature.

B.Remove the 'Region' feature from the dataset.

C.Group rare categories into an 'Other' category.

D.Oversample the minority classes in the target variable.

AnswerC

This reduces sparsity and helps the model learn patterns for rare categories.

Why this answer

Option B is correct because imbalanced categorical features may cause the model to ignore rare categories; grouping rare levels into an 'Other' category can improve model performance. Option A is wrong because removing the feature could discard useful information. Option C is wrong because one-hot encoding does not address imbalance.

Option D is wrong because oversampling addresses target imbalance, not feature imbalance.

Full explanation →

613

MCQmedium

A company uses SageMaker to host a real-time inference endpoint for a classification model. The endpoint receives traffic spikes that cause high latency. The team wants a solution that automatically scales based on demand while keeping costs low. Which approach is BEST?

A.Use provisioned concurrency for the endpoint

B.Use a multi-model endpoint to serve multiple models

C.Deploy the endpoint on Spot Instances

D.Enable automatic scaling on the endpoint using Application Auto Scaling

AnswerD

Automatic scaling adjusts instance count based on demand, balancing cost and latency.

Why this answer

SageMaker endpoints support automatic scaling with Application Auto Scaling based on custom metrics like 'InvocationsPerInstance' or 'SageMakerVariantInvocationsPerInstance'. Provisioned concurrency is not available for SageMaker endpoints. Spot instances are not recommended for real-time endpoints due to interruptions.

Multi-model endpoints help but scaling is still needed.

Full explanation →

614

Multi-Selecthard

A machine learning engineer is tuning a Gradient Boosting model for a regression task. The dataset contains 50 features and 100,000 samples. The engineer wants to speed up training without sacrificing predictive performance significantly. Which THREE hyperparameters should the engineer consider adjusting? (Choose THREE.)

Select 3 answers

A.Reduce the subsample ratio (e.g., from 1.0 to 0.5)

B.Increase learning_rate and decrease n_estimators proportionally

C.Increase the number of estimators

D.Decrease max_depth of trees

E.Reduce max_features (e.g., from 'auto' to 0.5)

AnswersA, D, E

Using fewer samples per tree speeds training.

Why this answer

Option A (subsample) uses a fraction of samples per tree, reducing overfitting and training time. Option B (max_features) limits features considered for splits, reducing computation. Option C (learning_rate) with more estimators trades off; lowering learning rate often requires more estimators, not faster.

Option D (max_depth) lower depth speeds up training. Option E (n_estimators) increasing slows training.

Full explanation →

615

Multi-Selectmedium

A company is using Amazon SageMaker to train and deploy machine learning models. The data science team wants to track and compare model versions, hyperparameters, and metrics across multiple training jobs. Which TWO AWS services should they use together to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon RDS

B.Amazon CloudWatch Logs

C.AWS Glue

D.Amazon S3

E.Amazon SageMaker Experiments

AnswersD, E

S3 stores experiment artifacts and outputs.

Why this answer

Options A and C are correct. SageMaker Experiments provides tracking of training jobs, metrics, and parameters. Amazon S3 stores the experiment outputs and artifacts.

Option B is wrong because AWS Glue is for ETL, not for experiment tracking. Option D is wrong because Amazon RDS is a relational database. Option E is wrong because CloudWatch Logs is for logs, not for experiment tracking.

Full explanation →

616

MCQhard

A research team is training a deep learning model for object detection using SageMaker's built-in SSD algorithm. The dataset contains 50,000 images with bounding box annotations. The team uses a single ml.p3.2xlarge instance. After 24 hours of training, the model's loss has plateaued, but the mean average precision (mAP) on validation is only 0.45. The team wants to improve mAP without increasing training time. Which action should they take?

A.Increase the learning rate by a factor of 2

B.Use a pre-trained model as the backbone (e.g., ResNet-50 pre-trained on ImageNet)

C.Increase the batch size to 64

D.Add more convolutional layers to the backbone

AnswerB

Transfer learning boosts accuracy with no additional training time.

Why this answer

Option B (use a pre-trained backbone) transfers learned features, often improving accuracy. Option A (increase batch size) may not improve mAP and could slow convergence. Option C (add more layers) increases training time.

Option D (increase learning rate) may destabilize training.

Full explanation →

617

MCQmedium

A machine learning team is analyzing a dataset with a target variable that is highly imbalanced (99% negative class, 1% positive class). They want to understand the distribution and relationships before modeling. Which exploratory data analysis technique is most appropriate to visualize the imbalance and guide resampling strategy?

A.Confusion matrix on a sample of the data

B.Scatterplot matrix of all features colored by class

C.Box plots of each feature grouped by the target class

D.Bar chart of class frequencies and a correlation heatmap

AnswerD

Bar chart shows imbalance clearly; correlation heatmap helps identify features related to the target.

Why this answer

Option D is correct because a bar chart of class counts clearly shows the imbalance, and a correlation heatmap helps understand feature relationships with the target. Option A is wrong because a scatterplot matrix is for continuous variables, not for a binary target. Option B is wrong because box plots show distribution of continuous features by class, but not the imbalance itself.

Option C is wrong because a confusion matrix is for model evaluation, not for initial data exploration.

Full explanation →

618

MCQhard

A company is streaming data from thousands of devices using Amazon Kinesis Data Streams. The data is consumed by a AWS Lambda function that processes each record. The Lambda function is experiencing high error rates and throttling due to the volume of data. Which action would MOST effectively improve the processing throughput and reduce errors?

A.Send the data to Amazon SQS first and then process with Lambda

B.Use Amazon Kinesis Data Firehose instead of Kinesis Data Streams

C.Increase the Lambda function's batch size and reduce the batch window

D.Increase the number of shards in the Kinesis stream

AnswerD

More shards increase parallelism and throughput, reducing throttling.

Why this answer

Increasing the number of shards increases the stream's capacity. Using a larger batch size in Lambda (option B) can improve throughput but may cause timeout. Using SQS (option C) introduces another queue.

Using Firehose (option D) changes the architecture. Increasing shards is the most direct way to increase throughput.

Full explanation →

619

MCQmedium

A machine learning team is using SageMaker to train a deep learning model. The training job is failing due to insufficient GPU memory. Which approach should the team take to resolve this issue without changing the model architecture?

A.Increase the batch size.

B.Use gradient accumulation to reduce the effective batch size per step.

C.Add more GPUs to the training instance.

D.Decrease the learning rate.

AnswerB

Gradient accumulation allows training with larger effective batches while keeping per-step memory low.

Why this answer

Option B is correct because gradient accumulation simulates larger batch sizes without increasing memory per step. Option A is wrong because increasing batch size increases memory usage. Option C is wrong because reducing learning rate does not affect memory.

Option D is wrong because adding more GPUs to a single instance may not help if memory is already exhausted on each GPU; but the key is to reduce per-GPU memory, which gradient accumulation achieves by using a smaller effective batch size per step.

Full explanation →

620

Multi-Selectmedium

Which TWO configuration steps are necessary to deploy a custom Docker container for training in Amazon SageMaker? (Choose two.)

Select 2 answers

A.Expose a REST API endpoint for inference

B.Implement the train function in the container that saves model artifacts to /opt/ml/model

C.Define a Docker Compose file to manage multi-container training

D.Include a training script that reads hyperparameters from /opt/ml/input/config/hyperparameters.json

E.Push the container image to Docker Hub

AnswersB, D

SageMaker expects the model to be saved in /opt/ml/model.

Why this answer

Options B and D are correct. SageMaker requires the container to have a training script at /opt/ml/code and to implement the train function. Option A is incorrect because the container must be stored in ECR, not Docker Hub.

Option C is optional. Option E is incorrect because SageMaker does not use Docker Compose.

Full explanation →

621

MCQmedium

A company wants to deploy a machine learning model that predicts customer churn. The model must provide interpretable predictions to explain why a customer is likely to churn. Which algorithm is most appropriate?

A.Gradient boosting machine

B.Support vector machine (SVM)

C.Decision tree

D.Deep neural network

AnswerC

Decision trees are highly interpretable.

Why this answer

Decision trees are inherently interpretable because they produce a clear, rule-based structure that shows exactly which features and thresholds lead to a churn prediction. This white-box nature allows stakeholders to trace the reasoning for each prediction, meeting the requirement for interpretability without needing post-hoc explanation methods.

Exam trap

Cisco often tests the trade-off between model accuracy and interpretability, where candidates mistakenly choose a high-performance black-box model (like gradient boosting or neural networks) without recognizing that the question explicitly prioritizes interpretability over raw predictive power.

How to eliminate wrong answers

Option A is wrong because gradient boosting machines are ensemble models that combine many weak learners, making them highly accurate but difficult to interpret directly; they require techniques like SHAP or LIME for explanation, which adds complexity. Option B is wrong because support vector machines operate in high-dimensional feature spaces using kernel functions, producing decision boundaries that are not easily interpretable without additional tools. Option D is wrong because deep neural networks are black-box models with multiple hidden layers and non-linear transformations, making their predictions opaque and requiring external interpretability methods.

Full explanation →

622

Multi-Selecthard

Which THREE considerations are important when designing a data lake on Amazon S3?

Select 3 answers

A.Setting up S3 Lifecycle policies to transition data to colder storage

B.Using Provisioned IOPS for S3

C.Partitioning data by date to improve query performance

D.Using a single Availability Zone for data storage

E.Encrypting data at rest using AWS KMS

AnswersA, C, E

Lifecycle policies manage cost.

Why this answer

Organizing data with partitioning, encrypting data at rest, and setting up lifecycle policies are key. Single Availability Zone is not an S3 consideration (S3 is regional). Using Provisioned IOPS is for block storage.

Full explanation →

623

Multi-Selecthard

A data scientist is training a random forest model for regression. The model shows high variance on the validation set. Which TWO actions are most likely to reduce variance? (Choose 2.)

Select 2 answers

A.Use bootstrap sampling with replacement

B.Decrease the maximum depth of trees

C.Increase the minimum samples per leaf

D.Increase the number of trees in the forest

E.Increase the number of features considered at each split

AnswersB, C

Shallow trees reduce overfitting, lowering variance.

Why this answer

Decreasing depth and increasing min samples per leaf both reduce complexity, lowering variance.

Full explanation →

624

MCQeasy

A machine learning team is training a deep learning model on Amazon SageMaker and notices that the training loss is decreasing but the validation loss is increasing. What is the most likely cause?

A.Vanishing gradients

B.Overfitting the training data

C.Learning rate is too high

D.Underfitting the training data

AnswerB

Overfitting occurs when model learns noise, causing validation loss to increase after a point.

Why this answer

When training loss continues to decrease while validation loss increases, the model is memorizing the training data rather than learning generalizable patterns. This is the classic symptom of overfitting, where the model's capacity exceeds what is needed for the underlying data distribution, causing it to fit noise in the training set. In Amazon SageMaker, this can be observed by monitoring the validation loss metric during training jobs.

Exam trap

Cisco often tests the distinction between overfitting and high learning rate by presenting a scenario where training loss decreases but validation loss increases, and candidates mistakenly attribute it to a learning rate that is too high, not recognizing that a high learning rate would cause both losses to diverge or oscillate.

How to eliminate wrong answers

Option A is wrong because vanishing gradients cause the model to stop learning entirely, resulting in both training and validation loss stagnating or decreasing very slowly, not a divergence between the two. Option C is wrong because a learning rate that is too high typically causes the loss to oscillate or diverge on both training and validation sets, not a monotonic decrease in training loss with an increase in validation loss. Option D is wrong because underfitting means the model is too simple to capture patterns, leading to high loss on both training and validation sets, not a decreasing training loss.

Full explanation →

625

Multi-Selecthard

A machine learning team is using Amazon SageMaker to train a deep learning model on a large dataset stored in Amazon S3. The training job is taking too long. The team wants to reduce training time without modifying the model architecture. Which THREE actions should the team take? (Choose 3.)

Select 3 answers

A.Enable SageMaker Managed Spot Training to use cheaper spot instances.

B.Use a larger instance type with more vCPUs and memory.

C.Use distributed training with multiple GPU instances.

D.Use Pipe input mode to stream data from S3 instead of downloading it.

E.Use SageMaker Processing to preprocess the data.

AnswersA, C, D

Spot instances can reduce cost and training time if interruptions are handled.

Why this answer

Option B is correct because Pipe mode streams data directly from S3, reducing download time. Option C is correct because SageMaker Managed Spot Training can reduce training time by using cheaper spot instances. Option D is correct because distributed training with multiple GPUs can parallelize computation.

Option A is wrong because increasing instance size may help but is less effective than distributed training. Option E is wrong because SageMaker Processing is for data processing, not training.

Full explanation →

626

Multi-Selecthard

Which THREE factors should be considered when choosing an instance type for a SageMaker training job?

Select 3 answers

A.The number of vCPUs needed for parallel processing

B.The memory requirements of the model

C.The endpoint latency requirement

D.The AWS region where the instance is launched

E.The GPU requirements for model training

AnswersA, B, E

More vCPUs can speed up training.

Why this answer

Options B, C, and E are correct. Option A is wrong because it's not a factor; you can use any region. Option D is wrong because endpoints are for serving, not training.

Full explanation →

627

MCQhard

A data scientist is performing feature engineering on a dataset with high cardinality categorical features (e.g., ZIP codes with thousands of unique values). Which technique is most effective for reducing dimensionality while preserving predictive power?

A.Hash encoding

B.One-hot encoding

C.Target encoding

D.Label encoding

AnswerC

Correct: Target encoding reduces cardinality by using target statistics, preserving predictive power.

Why this answer

Option D is correct because target encoding (mean encoding) replaces categories with the mean of the target variable, which captures predictive signal and reduces cardinality. Option A is wrong because one-hot encoding creates many columns, leading to high dimensionality. Option B is wrong because label encoding implies ordinality that may not exist.

Option C is wrong because hashing can cause collisions and loss of information.

Full explanation →

628

MCQmedium

Refer to the exhibit. An ML engineer runs the above CLI command to inspect files in an S3 bucket. The training data consists of 200 CSV files, each 1 GB. The engineer plans to use Amazon SageMaker to train a model using this data. What should the engineer do to optimize training performance?

A.Increase the number of training instances to process files in parallel.

B.Use Amazon Athena to transform the data into CSV format with headers.

C.Use the File input mode and copy all files to the training instance's EBS volume.

D.Convert the CSV files to Parquet format and use Pipe input mode.

AnswerD

Parquet is columnar and compressed; Pipe mode streams data directly from S3.

Why this answer

Option D is correct because converting to Parquet and using a Pipe input mode reduces I/O and improves throughput. Option A is wrong because copying to EBS is not efficient. Option B is wrong because Athena is for querying, not training.

Option C is wrong because increasing instances may help but does not address the inefficiency of CSV format.

Full explanation →

629

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data is collected from IoT devices and is highly variable in volume. The engineer needs to ensure that the data is ingested reliably and can be processed in near real-time. Which AWS service should be used to ingest the data into the data lake?

A.Amazon Kinesis Data Firehose

B.AWS Glue

C.Amazon Kinesis Data Streams

D.Amazon Simple Queue Service (SQS)

AnswerA

Firehose can load streaming data directly into S3 with near real-time latency.

Why this answer

Amazon Kinesis Data Firehose is the best choice for reliably ingesting streaming data into S3 with near real-time delivery, automatic scaling, and optional transformations.

Full explanation →

630

MCQmedium

A company is deploying a machine learning model using SageMaker. The model is a PyTorch model that requires GPU for inference. The company wants to minimize costs while ensuring low latency. Which instance type should be used for the SageMaker endpoint?

A.ml.m5.large

B.ml.p3.2xlarge

C.ml.c5.2xlarge

D.ml.g4dn.xlarge

AnswerB

GPU instance optimized for inference, cost-effective.

Why this answer

Option B is correct because ml.p3.2xlarge is a GPU instance optimized for inference with low latency, and it is cost-effective for production workloads. Option A is wrong because ml.c5.2xlarge is a CPU instance and does not have a GPU. Option C is wrong because ml.g4dn.xlarge is also a GPU instance but may not provide the same performance for PyTorch models as ml.p3.

Option D is wrong because ml.m5.large is a general-purpose CPU instance without GPU support.

Full explanation →

631

MCQhard

A data engineer configures an S3 event notification to trigger an AWS Lambda function when a new object is created in 'my-input-bucket'. The Lambda function processes the CSV file and writes results to 'my-output-bucket'. The engineer notices that the Lambda function is not triggered for some objects. Which step should the engineer take to diagnose the issue?

A.Check the Lambda function's execution role for permissions to write to the output bucket.

B.Review the CloudWatch Logs for the Lambda function to see if there are errors.

C.Check the Lambda function's resource-based policy to ensure S3 has permission to invoke the function.

D.Verify that the S3 event notification is configured with the correct prefix and suffix filters.

AnswerC

Missing invoke permission is a common cause of trigger failure.

Why this answer

Option D is correct. The most likely issue is missing permissions. The S3 bucket must have permission to invoke the Lambda function.

The engineer should check the Lambda resource-based policy to ensure it allows invocation from S3. Option A is wrong because the event notification configuration is separate. Option B is wrong because the Lambda function's execution role needs permission to write to output bucket, but that would not prevent triggering.

Option C is wrong because CloudWatch Logs would show invocations but not if not triggered.

Full explanation →

632

MCQhard

A data engineer runs the AWS CLI command shown and notices a zero-byte file in the results. What is the most likely cause of this zero-byte file?

A.The S3 bucket has a lifecycle policy that deleted the content.

B.The file was written with the wrong prefix.

C.The file was compressed, reducing size to zero.

D.The file was created by a failed Spark task that wrote no data.

AnswerD

Failed tasks can produce empty files.

Why this answer

Zero-byte files often occur when an ETL job fails partway through writing, or when a task starts but writes no data. A completed write would have non-zero size. The other options are less likely: prefix typo wouldn't produce a file; correct permissions wouldn't cause zero bytes; compression would produce some output.

Full explanation →

633

MCQhard

A company operates a real-time fraud detection system using SageMaker. The model is deployed on an ml.c5.xlarge instance behind an Application Load Balancer (ALB). Recently, during a sales event, traffic spiked and the endpoint returned HTTP 503 errors. The team scaled the instance count from 2 to 5, but errors persisted. CloudWatch metrics show low CPU utilization (~30%) and high memory usage (~90%). The model loads a large dictionary file (2GB) into memory at startup. Which action should resolve the issue?

A.Enable auto-scaling with Spot instances.

B.Switch to a compute-optimized instance type like c5.2xlarge.

C.Increase the number of instances further to 10.

D.Use a memory-optimized instance type like r5.large.

AnswerD

r5.large has 16GB memory vs 4GB for c5.xlarge, allowing the model to handle more requests.

Why this answer

Option C is correct because the high memory usage indicates the instance type does not have enough memory to handle concurrent requests. Using a memory-optimized instance like r5.large provides more memory per vCPU, reducing memory pressure and preventing OOM errors. Option A is incorrect because low CPU utilization suggests the bottleneck is not CPU.

Option B is incorrect because increasing instances does not help if each instance is memory-constrained. Option D is incorrect because using Spot instances may cause interruptions and does not fix the memory issue.

Full explanation →

634

MCQhard

A large e-commerce company is using Amazon DynamoDB as the source for real-time analytics. The data is streamed to Amazon Kinesis Data Streams using DynamoDB Streams and then processed by an AWS Lambda function. The Lambda function writes the data to an Amazon Elasticsearch Service cluster for search and visualization. Recently, the Lambda function has been failing with throttling errors from the Elasticsearch cluster. What is the MOST effective way to handle this?

A.Increase the Lambda function's reserved concurrency to handle more invocations.

B.Increase the number of shards in the Kinesis data stream.

C.Decrease the Kinesis stream's retention period to reduce the data volume.

D.Configure a Dead Letter Queue (DLQ) on the Lambda function to capture failed records and implement retry logic.

AnswerD

DLQ captures records that fail due to throttling, allowing later reprocessing without blocking the stream.

Why this answer

Using a Dead Letter Queue (DLQ) for failed records allows the Lambda function to continue processing without blocking, and the failed records can be reprocessed later. Option A is wrong because increasing Lambda concurrency would exacerbate the throttling. Option B is wrong because increasing shards increases throughput but doesn't address Elasticsearch throttling.

Option D is wrong because Lambda cannot directly control Kinesis stream throughput; it processes in batches.

Full explanation →

635

MCQeasy

A data scientist wants to deploy a PyTorch model for real-time inference with low latency. Which AWS service should they use?

A.Amazon Elastic Container Service (ECS)

B.Amazon SageMaker batch transform

C.Amazon SageMaker real-time endpoint

D.AWS Lambda

AnswerC

Designed for low-latency inference.

Why this answer

Amazon SageMaker real-time endpoints provide low-latency inference for custom models. Option A is wrong because SageMaker batch transform is for offline processing. Option B is wrong because AWS Lambda has a 15-minute timeout and is not optimized for large model inference.

Option D is wrong because Amazon ECS requires more operational overhead.

Full explanation →

636

MCQhard

A data scientist is troubleshooting a failed SageMaker training job that uses a custom Docker image. The failure reason shows 'unrecognized arguments: --sagemaker_program'. What is the most likely cause?

A.The Docker image is tagged incorrectly and cannot be pulled

B.The training job is in a different region than the ECR repository

C.The input mode is File mode, but the container expects Pipe mode

D.The custom Docker image does not use the SageMaker training toolkit and thus does not accept SageMaker hyperparameters

AnswerD

Custom containers that are not toolkit-based ignore SageMaker hyperparameters, causing unrecognized argument errors if the entry point tries to parse them.

Why this answer

The error 'unrecognized arguments: --sagemaker_program' indicates that the custom Docker image does not include the SageMaker Training Toolkit. The SageMaker Training Toolkit is a Python library that provides a default entry point to parse and handle SageMaker-specific hyperparameters (like --sagemaker_program, --sagemaker_submit_directory, etc.). Without this toolkit, the container's entry point does not recognize these arguments, causing the training job to fail.

Exam trap

The trap here is that candidates often confuse container-level errors (like pull failures or region mismatches) with argument parsing errors, failing to recognize that the SageMaker Training Toolkit is required to handle SageMaker-specific CLI arguments.

How to eliminate wrong answers

Option A is wrong because if the Docker image were tagged incorrectly or could not be pulled, the error would be an ECR pull failure (e.g., 'CannotPullContainerError' or 'RepositoryNotFoundException'), not an argument parsing error. Option B is wrong because a region mismatch between the training job and the ECR repository would result in a 'RepositoryNotFoundException' or access denied error, not an unrecognized argument error. Option C is wrong because the input mode (File vs.

Pipe) affects how data is ingested (e.g., via SM_INPUT_FILE or SM_INPUT_PIPE environment variables), but it does not affect the parsing of command-line hyperparameters like --sagemaker_program.

Full explanation →

637

Multi-Selecthard

A company is deploying a machine learning model on SageMaker for real-time inference. The model requires GPU for low latency. Which THREE steps are necessary to set up the endpoint?

Select 3 answers

A.Train the model using a SageMaker training job

B.Create a SageMaker batch transform job

C.Create a SageMaker model object that points to the S3 bucket containing the model artifacts and the inference container image

D.Create an endpoint configuration specifying the instance type (e.g., ml.p3.2xlarge) and initial instance count

E.Create a SageMaker endpoint using the endpoint configuration

AnswersC, D, E

A model object is required to deploy an endpoint.

Why this answer

Correct options: A (Create a model in SageMaker with the inference code), C (Create an endpoint configuration with a production variant specifying instance type and initial variant weight), and D (Create an endpoint using the endpoint configuration). B is not necessary because the model is already trained. E is not necessary for real-time inference.

Full explanation →

638

MCQmedium

A company is using Amazon SageMaker to train a deep learning model on a large dataset stored in S3. The training job is failing with an OutOfMemory error. The data scientist wants to minimize cost while resolving the issue. Which action should the data scientist take?

A.Increase the instance type to one with more memory.

B.Use the 'auto' setting for the input mode.

C.Reduce the batch size hyperparameter.

D.Change the input mode from 'File' to 'Pipe'.

AnswerD

Pipe mode streams data, reducing memory footprint.

Why this answer

The OutOfMemory error occurs because the 'File' input mode downloads the entire training dataset to the instance's local storage before training begins, consuming significant memory. Switching to 'Pipe' mode streams data directly from S3 to the training algorithm, reducing memory footprint and avoiding the need for larger instances. This minimizes cost by using the existing instance type while resolving the memory issue.

Exam trap

The trap here is that candidates may assume reducing the batch size (Option C) is the standard fix for memory issues, but they overlook that the 'File' input mode's full dataset download is the primary cause, and 'Pipe' mode directly addresses this without additional cost.

How to eliminate wrong answers

Option A is wrong because increasing the instance type to one with more memory would resolve the error but at a higher cost, contradicting the goal to minimize cost. Option B is wrong because the 'auto' setting for input mode does not exist in SageMaker; the valid input modes are 'File' and 'Pipe', and 'auto' is not a recognized configuration. Option C is wrong because reducing the batch size hyperparameter may reduce memory usage per step but does not address the root cause of the dataset being fully loaded into memory in 'File' mode, and it could negatively impact model convergence or training time.

Full explanation →

639

MCQmedium

Refer to the exhibit. An administrator has attached this IAM policy to a user. The user tries to start a SageMaker training job that uses a custom Docker image from Amazon ECR. The training job fails with an access denied error. What is the MOST likely reason?

A.The s3:* action is too permissive and should be scoped.

B.The policy is missing ecr:GetDownloadUrlForLayer and ecr:BatchGetImage.

C.The iam:PassRole permission is missing the SageMaker service principal.

D.The sagemaker:* action should be restricted to specific resources.

AnswerB

ECR permissions are required to pull the custom image.

Why this answer

The policy grants s3:* on the bucket but does not include ecr:* permissions (Option B) needed to pull the Docker image. Option A (s3:* on objects) is too broad but not the issue. Option C (iam:PassRole on role) is allowed.

Option D (sagemaker:* on specific resources) is not the issue.

Full explanation →

640

MCQhard

Refer to the exhibit. A data scientist is trying to run a SageMaker training job using a script that reads training data from 's3://my-bucket/training/data.csv'. The job fails with an access denied error. What is the MOST likely reason?

A.The S3 bucket policy may deny access, or the IAM role lacks necessary permissions beyond GetObject.

B.The training job is running in a VPC without S3 VPC endpoint.

C.The sagemaker:CreateTrainingJob action is not allowed on the specific resource.

D.The S3 path is incorrectly formatted.

AnswerA

Common reason: the training script may need to list the bucket or access other prefixes, or a bucket policy denies the request.

Why this answer

Option B is correct. The IAM policy allows s3:GetObject only on objects under 'training/', but the script may also need to read other objects or the bucket itself. Option A is wrong because the action is allowed.

Option C is wrong because the S3 URI is valid. Option D is wrong because no VPC is mentioned.

Full explanation →

641

Multi-Selecthard

Which THREE of the following are appropriate methods to reduce overfitting in a decision tree model?

Select 3 answers

A.Increase the number of features considered for each split

B.Increase the maximum depth of the tree

C.Prune the tree after training

D.Set a minimum number of samples required to split an internal node

E.Limit the maximum depth of the tree

AnswersC, D, E

Pruning removes branches that have little predictive power, reducing overfitting.

Why this answer

Option A is correct because pruning reduces tree size. Option B is correct because limiting depth reduces complexity. Option D is correct because setting minimum samples per leaf prevents overfitting.

Option C is wrong because increasing depth increases overfitting. Option E is wrong because increasing features may increase overfitting.

Full explanation →

642

MCQeasy

Refer to the exhibit. What is the recall of the model?

A.0.85

B.0.80

C.0.89

D.0.90

AnswerB

Recall = 80/(80+20) = 0.80.

Why this answer

Recall is calculated as True Positives divided by the sum of True Positives and False Negatives. From the confusion matrix, True Positives = 80 and False Negatives = 20, so recall = 80 / (80 + 20) = 0.80. Option B is correct.

Exam trap

Cisco often tests the distinction between recall and precision, where candidates mistakenly compute precision (TP/(TP+FP)) instead of recall, leading to option A (0.85).

How to eliminate wrong answers

Option A (0.85) is wrong because it incorrectly uses True Positives divided by the sum of True Positives and False Positives (80/94 ≈ 0.85), which is precision, not recall. Option C (0.89) is wrong because it likely results from dividing True Positives by the total number of predictions (80/90 ≈ 0.89), which is accuracy. Option D (0.90) is wrong because it might come from dividing True Positives by the sum of True Positives and False Positives plus False Negatives (80/100 = 0.80, not 0.90), or from a miscalculation such as using True Negatives incorrectly.

Full explanation →

643

MCQeasy

During exploratory data analysis, a machine learning engineer finds that a dataset has a significant number of missing values in a categorical feature with 10 levels. Which approach should they take to handle these missing values before modeling?

A.Impute missing values with the mean of the feature.

B.Create a new category labeled 'Missing' for missing values.

C.Drop all rows with missing values.

D.Impute missing values with the mode of the feature.

AnswerB

Preserves the missingness pattern and avoids bias.

Why this answer

Option C is correct because creating a separate 'Missing' category preserves the missingness pattern and avoids data loss or bias from imputation for categorical features. Option A is incorrect because dropping rows with missing values may discard valuable data. Option B is incorrect because mean imputation is for numerical features, not categorical.

Option D is incorrect because mode imputation may introduce bias if missingness is not random.

Full explanation →

644

MCQmedium

A data scientist is building a recommendation system for an e-commerce platform. The dataset includes user-item interactions (clicks, purchases, ratings). The scientist wants to use matrix factorization. Which approach is most appropriate for handling implicit feedback (e.g., clicks) rather than explicit ratings?

A.Use k-means clustering to segment users and then use item popularity within clusters

B.Use singular value decomposition (SVD) on the interaction matrix with missing values filled with 0

C.Use a deep neural network with a softmax output to predict item probabilities

D.Use weighted alternating least squares (WALS) with confidence weights

AnswerD

WALS is specifically designed for implicit feedback by assigning confidence to observed and unobserved interactions.

Why this answer

Weighted Alternating Least Squares (WALS) is specifically designed for implicit feedback scenarios because it treats unobserved interactions as negative signals with low confidence, rather than missing values. By assigning confidence weights (e.g., based on click frequency or dwell time), WALS can factorize the implicit feedback matrix effectively, avoiding the bias introduced by treating all zeros as true negatives.

Exam trap

The trap here is that candidates often assume SVD (Option B) is the standard matrix factorization method, but they overlook that SVD requires a complete matrix and treats zeros as missing, which is invalid for implicit feedback where zeros carry meaning.

How to eliminate wrong answers

Option A is wrong because k-means clustering followed by item popularity ignores the collaborative signal between users and items, and does not learn latent factors that capture nuanced preferences. Option B is wrong because SVD requires a dense matrix and assumes missing values are zero, which is inappropriate for implicit feedback where zeros can mean either no interaction or negative preference, leading to poor factorization. Option C is wrong because while a deep neural network with softmax can predict item probabilities, it is not the most appropriate or efficient approach for implicit feedback matrix factorization; WALS is a simpler, proven method that directly handles the implicit feedback structure without overfitting or requiring extensive hyperparameter tuning.

Full explanation →

645

MCQhard

A data scientist is using Amazon SageMaker Autopilot to automatically build a model for a regression problem. The dataset has 100 features and 50,000 rows. Autopilot recommends a model with an R² of 0.85 on the validation set. However, when deployed to production, the model performs poorly (R² of 0.2). What is the most likely cause?

A.The model is overfitting to the training data

B.The production data distribution has shifted from the training data distribution

C.The model is underfitting due to insufficient training

D.Autopilot selected the wrong features

AnswerB

Data drift causes model performance to degrade in production.

Why this answer

Option B is correct because a large discrepancy between validation and production performance often indicates data drift. Option A (overfitting) is possible but less likely given validation performance. Option C (feature importance) is not the direct cause.

Option D (Autopilot bug) is rare.

Full explanation →

646

MCQeasy

A company uses SageMaker to deploy a model for predicting customer churn. The model was trained on historical data and achieves 85% accuracy on the test set. After deployment, the model's predictions are significantly worse on new data due to changes in customer behavior. What is the MOST likely cause?

A.Data leakage during training

B.The training dataset was too small

C.Concept drift in the underlying data distribution

D.The model is overfitting to the training data

AnswerC

Changes in customer behavior cause concept drift, reducing model accuracy over time.

Why this answer

The model's performance degradation on new data, despite high accuracy on the test set, is a classic symptom of concept drift. Concept drift occurs when the statistical properties of the target variable (customer churn) change over time due to shifts in customer behavior, making the trained model's decision boundary obsolete. SageMaker deployed the model as a persistent endpoint, but the underlying data distribution has evolved, so the model no longer generalizes to the current environment.

Exam trap

The trap here is that candidates confuse concept drift with overfitting, assuming any performance drop after deployment must be due to the model memorizing noise, but the key differentiator is the temporal nature of the degradation tied to changing customer behavior, not a static training-data issue.

How to eliminate wrong answers

Option A is wrong because data leakage would inflate test set accuracy artificially, but the model would fail immediately on new data—not after a period of deployment—and the scenario describes a gradual change in customer behavior, not a training flaw. Option B is wrong because a small training dataset typically causes high bias or variance, leading to poor accuracy on both test and new data, whereas here the model initially achieved 85% accuracy on the test set. Option D is wrong because overfitting would cause poor performance on the test set (not 85% accuracy) and would not explain a delayed degradation tied to changing customer behavior; overfitting is a static issue, not a temporal one.

Full explanation →

647

MCQhard

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time data. The data source is a Kinesis data stream, and the output is written to an S3 bucket. Recently, the processing latency has increased significantly. The team suspects that the Flink application is encountering backpressure. Which metric should the team monitor to confirm backpressure?

A.currentLowWatermark

B.busyTimeMsPerSecond

C.numberOfFailedCheckpoints

D.numRecordsInPerSecond

AnswerB

High busy time indicates operator is overloaded, causing backpressure.

Why this answer

Backpressure in Flink is indicated by high 'busyTimeMsPerSecond' metric, which shows the time the operator is busy processing. Option A is wrong because 'numRecordsInPerSecond' measures throughput, not backpressure. Option B is wrong because 'currentLowWatermark' is for event time progress.

Option D is wrong because 'numberOfFailedCheckpoints' indicates checkpoint failures, not backpressure directly.

Full explanation →

648

MCQmedium

The Glue job my-glue-job fails after a few successful runs. The error log shows 'Job run exceeds max concurrent runs limit'. The CloudFormation template is shown in the exhibit. What change should be made to allow multiple runs to execute concurrently?

A.Change the IAM role to one with more permissions

B.Increase the MaxRetries property to 3

C.Remove the --job-bookmark-option argument

D.Set the MaxConcurrentRuns property to 3

AnswerD

This allows up to 3 concurrent job runs.

Why this answer

The 'MaxConcurrentRuns' is set to 1, which prevents parallel executions. Setting it to a higher value (e.g., 3) allows concurrent runs. MaxRetries is for retry count, not concurrency.

Role and TempDir are not relevant.

Full explanation →

649

MCQeasy

A data scientist runs the AWS CLI command shown in the exhibit. The output shows that job-2 failed. Which action should the data scientist take to diagnose the failure?

A.Check the CloudWatch Logs log group for job-2

B.Check the S3 bucket for any error logs uploaded by the training job

C.Run `aws sagemaker list-training-jobs --name-contains job-2` to get more details

D.Run `aws sagemaker describe-training-job --training-job-name job-2` to see the failure reason

AnswerD

DescribeTrainingJob includes a FailureReason field.

Why this answer

Option A is correct because the DescribeTrainingJob API provides detailed status and failure reason. Option B (list-training-jobs) already used. Option C (CloudWatch Logs) is useful but the first step is to get the failure reason from DescribeTrainingJob.

Option D (S3 logs) is not standard.

Full explanation →

650

MCQmedium

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

A.Use SageMaker File input mode and increase the EBS volume size to 1 TB.

B.Use SageMaker Pipe input mode to stream data directly from S3.

C.Convert the CSV files to Parquet format and use File input mode.

D.Load the data into an Amazon EFS file system and mount it to the training instance.

AnswerB

Pipe mode streams data on-the-fly, eliminating the need to download the full dataset, thus reducing I/O wait time.

Why this answer

Option B is correct because SageMaker Pipe input mode streams data directly from S3 to the training algorithm without writing to the instance's EBS volume, eliminating disk I/O bottlenecks. This is especially effective for large datasets (500 GB) that are updated daily, as it reduces startup time and avoids the need to download the entire dataset before training begins.

Exam trap

The trap here is that candidates often assume converting to a columnar format like Parquet always improves performance, but they overlook that File input mode still requires a full download to disk, whereas Pipe mode avoids that entirely regardless of file format.

How to eliminate wrong answers

Option A is wrong because increasing the EBS volume size to 1 TB does not reduce I/O wait time; it only provides more storage space, and the data must still be downloaded from S3 to the EBS volume before training, which adds latency. Option C is wrong because while converting CSV to Parquet can improve read performance and reduce data size, using File input mode still requires the entire dataset to be downloaded to the instance's EBS volume before training starts, negating the benefit of reduced I/O wait time. Option D is wrong because mounting an Amazon EFS file system to the training instance introduces network file system latency and is not optimized for the high-throughput, low-latency data loading required for training jobs; SageMaker's built-in Pipe mode is designed specifically for this purpose.

Full explanation →

651

MCQhard

A machine learning team is building a recommendation system for an e-commerce platform. They have user-item interaction data (clicks, purchases). They need to choose an algorithm that can capture both user and item latent factors and handle missing data. Which algorithm should they use?

A.Linear regression

B.Principal component analysis (PCA)

C.Matrix factorization

D.Convolutional neural network (CNN)

AnswerC

Matrix factorization learns latent factors and handles missing data.

Why this answer

Matrix factorization is the correct choice because it decomposes the user-item interaction matrix into lower-dimensional latent factors for users and items, capturing underlying patterns in preferences. It naturally handles missing data by learning from observed interactions only, making it ideal for recommendation systems with sparse data.

Exam trap

The trap here is that candidates may choose PCA because it also performs dimensionality reduction, but PCA cannot handle missing data or model user-item interactions for collaborative filtering, which is the core requirement of the question.

How to eliminate wrong answers

Option A is wrong because linear regression models a continuous target variable from features but cannot capture latent factors or handle missing data in a user-item matrix. Option B is wrong because PCA is an unsupervised dimensionality reduction technique that does not model user-item interactions or handle missing data; it requires a complete matrix and ignores the collaborative filtering structure. Option D is wrong because CNNs are designed for spatial data like images and are not suited for collaborative filtering or latent factor extraction from sparse interaction matrices.

Full explanation →

652

MCQhard

A data scientist is training a deep learning model on SageMaker using a custom Docker container. The training job fails with an error indicating that the container exited with a non-zero status. The CloudWatch logs show 'FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/data.csv''. What is the most likely cause?

A.The container's Docker entry point is misconfigured.

B.The S3 data source path is incorrect or the data has not been uploaded.

C.The training script has a syntax error.

D.The model output path in the training job configuration is wrong.

AnswerB

Missing training data leads to FileNotFoundError.

Why this answer

Option C is correct because the error indicates the training data is missing at the expected path, which typically occurs when the S3 data source configuration is incorrect or the data is not uploaded. Option A is wrong because the error is about missing data, not the container entry point. Option B is wrong because the error is not related to training script syntax.

Option D is wrong because the error does not mention model artifacts.

Full explanation →

653

MCQeasy

A company uses AWS Lambda to process events from Amazon S3. The Lambda function transforms the data and writes results to another S3 bucket. Recently, the function has been failing due to timeout errors when processing large files. Which solution should the data engineer implement?

A.Increase the Lambda function memory and timeout limit

B.Increase the Lambda timeout to 15 minutes

C.Use S3 Batch Operations with a Lambda function to process objects

D.Use Amazon SQS to queue the events and process them in batches

AnswerC

Batch Operations can invoke Lambda for each object, handling large volumes.

Why this answer

Using S3 Batch Operations with a Lambda function allows processing large numbers of objects asynchronously, avoiding Lambda timeouts. Option A is wrong because increasing memory also increases CPU but may not solve timeout for large files. Option B is wrong because SQS does not help with processing large files.

Option D is wrong because increasing timeout may not be sufficient and is not best practice for large files.

Full explanation →

654

MCQeasy

A data scientist is training a binary classification model on imbalanced data (95% negative, 5% positive). The model achieves 95% accuracy but only 10% recall on the positive class. Which metric should be used to evaluate model performance?

A.F1 score

B.Accuracy

C.Recall

D.Precision

AnswerA

F1 score balances precision and recall, suitable for imbalanced data.

Why this answer

Option C is correct because with imbalanced data, accuracy is misleading. F1 score balances precision and recall. Option A is wrong because accuracy is high but not informative.

Option B is wrong because precision alone ignores recall. Option D is wrong because recall alone ignores precision.

Full explanation →

655

MCQhard

A company needs to process sensitive data from multiple sources. They want to use AWS Glue to catalog and transform the data. Which feature should they use to ensure that sensitive columns are masked before the data is available for querying?

A.AWS Glue DataBrew

B.AWS Glue Studio

C.AWS Lake Formation

D.Amazon Macie

AnswerA

DataBrew allows data masking and cleansing interactively.

Why this answer

Glue DataBrew provides data masking and cleansing capabilities. Glue Studio is for building ETL jobs, but masking requires custom code. Lake Formation is for fine-grained access control, not masking.

Macie is for discovering sensitive data, not masking.

Full explanation →

656

MCQeasy

A data scientist needs to train a machine learning model using a large dataset (500 GB) stored in an S3 bucket. The training will be performed on a SageMaker notebook instance. The data scientist wants to minimize data transfer costs and reduce training time. Which data ingestion approach should the data engineer recommend?

A.Use the SageMaker SDK to directly read the data from S3 during training without copying it to the notebook.

B.Copy the dataset to the notebook instance's attached EBS volume before training.

C.Load the dataset into an Amazon RDS database and query it from the notebook.

D.Mount the S3 bucket to the notebook instance using Amazon Elastic File System (EFS).

AnswerA

SageMaker can read data directly from S3, minimizing transfer and storage costs.

Why this answer

Option C is correct because using S3 as a data source for SageMaker directly avoids copying data into the notebook instance, reducing transfer costs and time. Option A (copying to notebook EBS) incurs transfer costs and uses local storage. Option B (EFS) adds cost and complexity.

Option D (RDS) is for structured data and incurs additional costs.

Full explanation →

657

MCQeasy

A data scientist is analyzing a dataset of online retail transactions. The dataset contains 500,000 rows and 10 columns: 'TransactionID', 'CustomerID', 'ProductID', 'Quantity', 'UnitPrice', 'TransactionDate', 'PaymentMethod', 'ShippingAddress', 'Country', and 'TotalAmount'. The data scientist loads the data into a SageMaker notebook and performs initial EDA. The data scientist finds that 'UnitPrice' has a range from $0.01 to $10,000, with a mean of $50 and a median of $20. 'Quantity' ranges from -10 to 100, with negative values indicating returns. 'TotalAmount' is calculated as Quantity * UnitPrice. The data scientist also notices that 2% of the 'CustomerID' values are missing, and 1% of 'ProductID' values are missing. There are no missing values in other columns. The data scientist wants to clean the data and prepare it for customer segmentation. Which course of action is most appropriate?

A.Impute missing 'CustomerID' with the mean of 'CustomerID' and missing 'ProductID' with the mode.

B.Remove all rows with any missing values.

C.Keep negative 'Quantity' and treat them as errors; replace them with the median of positive quantities.

D.Remove rows with negative 'Quantity' to focus on purchases. Impute missing 'CustomerID' and 'ProductID' with a placeholder such as 'Unknown'.

AnswerD

Negative quantities are returns; imputing with 'Unknown' preserves rows.

Why this answer

Option A is correct because negative quantities are returns and should be removed if the goal is to model purchase behavior, and missing CustomerID and ProductID can be imputed with 'Unknown' to avoid data loss. Option B is wrong because mean imputation for CustomerID is not valid (categorical). Option C is wrong because removing all rows with any missing values would discard 3% of data.

Option D is wrong because negative quantities are meaningful as returns, not errors.

Full explanation →

658

MCQmedium

A company is using Amazon SageMaker to deploy a model for real-time inference. The model receives requests that are small but arrive in bursts. The data scientist wants to minimize latency and cost. Which deployment option is MOST appropriate?

A.Use a real-time endpoint with a single instance

B.Use a multi-model endpoint with auto-scaling

C.Use Amazon SageMaker Serverless Inference

D.Use a batch transform job triggered by a schedule

AnswerC

Serverless scales automatically and you pay only for inference duration.

Why this answer

Amazon SageMaker Serverless Inference is the most appropriate option because it automatically scales compute resources based on request volume, charges only for the compute time used during inference (per-millisecond billing), and has no idle costs. This matches the bursty, small-request pattern perfectly, minimizing both latency and cost without requiring manual instance management.

Exam trap

The trap here is that candidates often confuse 'multi-model endpoints' with 'serverless' and assume auto-scaling eliminates idle costs, but multi-model endpoints still require a minimum number of running instances, incurring continuous charges.

How to eliminate wrong answers

Option A is wrong because a single-instance real-time endpoint incurs continuous hourly costs even when idle, and cannot handle burst traffic without significant latency or throttling. Option B is wrong because a multi-model endpoint with auto-scaling still requires at least one running instance at all times, leading to idle costs and slower scaling compared to serverless. Option D is wrong because batch transform jobs are designed for offline, asynchronous processing of large datasets, not real-time inference, and cannot meet low-latency requirements.

Full explanation →

659

MCQmedium

A team is training a deep learning model on Amazon SageMaker. The training job is slow because the data is stored in S3 as many small files. Which approach is MOST effective to improve training throughput?

A.Use SageMaker Pipe mode for training input

B.Increase the number of ml.c5.xlarge instances

C.Shuffle the S3 objects to randomize order

D.Use Amazon EFS instead of S3

AnswerA

Pipe mode streams data directly, avoiding the need to download all files first, improving throughput.

Why this answer

Using SageMaker Pipe mode streams data directly from S3, reducing startup time. Shuffling files or increasing instance count does not address the small file overhead. Using EFS would introduce latency.

Full explanation →

660

MCQhard

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

A.The policy grants s3:PutObject on all buckets, not just the specific one.

B.The condition requires objects to be encrypted with SSE-KMS, but the job uses SSE-S3.

C.The policy does not grant s3:PutObject on the bucket itself, which is needed for some write operations.

D.The condition requires objects to use SSE-S3, but the job uses SSE-KMS.

AnswerC

Bucket-level permissions may be required for certain write operations.

Why this answer

The error 'Access Denied' when writing to S3 with SSE-S3 encryption typically occurs because the IAM policy lacks the `s3:PutObject` permission on the bucket resource itself. While the policy may grant `s3:PutObject` on the object ARN (`arn:aws:s3:::bucket/*`), some S3 write operations—especially those involving encryption headers or bucket-level checks—also require the permission on the bucket ARN (`arn:aws:s3:::bucket`). Without this, the request is denied even if the object-level permission exists.

Exam trap

The trap here is that candidates assume `s3:PutObject` on the object ARN is sufficient for all write operations, overlooking that S3 requires the same permission on the bucket ARN for certain encryption-related or bucket-policy-evaluation scenarios.

How to eliminate wrong answers

Option A is wrong because granting `s3:PutObject` on all buckets would be overly permissive, not restrictive; the issue is missing permission on the specific bucket, not an overly broad scope. Option B is wrong because the job uses SSE-S3, and the condition requiring SSE-KMS would cause a different error (e.g., 'The request was denied because the encryption key is not authorized'), not a generic 'Access Denied'. Option D is wrong because the job uses SSE-S3, not SSE-KMS, so a condition requiring SSE-S3 would actually match and not cause a denial.

Full explanation →

661

Multi-Selecthard

A data scientist is developing a deep learning model for object detection using Amazon SageMaker. The training dataset has 50,000 labeled images. The data scientist wants to improve model generalization without collecting more data. Which TWO techniques can be applied? (Choose two.)

Select 2 answers

A.Increase the learning rate to speed up convergence.

B.Increase the number of training epochs to ensure convergence.

C.Apply data augmentation techniques such as random cropping and horizontal flipping.

D.Use transfer learning from a pre-trained model on ImageNet.

E.Increase the batch size to reduce variance.

AnswersC, D

Data augmentation increases data diversity without new data.

Why this answer

Options A and D are correct. Option A: Data augmentation (e.g., random crops, flips) effectively increases dataset diversity. Option D: Using a pre-trained model (transfer learning) improves generalization.

Option B (increasing batch size) may hurt generalization. Option C (increasing learning rate) can cause divergence. Option E (increasing epochs) may lead to overfitting.

Full explanation →

662

MCQhard

A data scientist attempts to create a SageMaker training job using the IAM policy shown in the exhibit. The training job fails with an access denied error. What is the most likely cause?

A.The S3 bucket policy does not grant access to the SageMaker service principal

B.The IAM policy is missing the sagemaker:DescribeTrainingJob permission

C.The IAM policy is missing the s3:ListBucket action

D.The IAM policy is missing the s3:PutObject permission

AnswerC

SageMaker needs ListBucket to read objects from the bucket.

Why this answer

Option C is correct: The policy allows s3:GetObject but not s3:ListBucket, which is required to read objects. Option A (missing sagemaker:DescribeTrainingJob) is not needed for creation. Option B (resource arn for s3 is wrong) is correct but not the cause of access denied.

Option D (missing s3:PutObject) is not required for reading training data.

Full explanation →

663

Multi-Selectmedium

A data scientist is building a text classification model using a bag-of-words approach with logistic regression. The dataset has 10,000 documents and 50,000 unique tokens. The model overfits. Which TWO techniques can help reduce overfitting?

Select 2 answers

A.Increase the number of n-grams features

B.Use one-hot encoding instead of bag-of-words

C.Use a more complex model such as a neural network

D.Reduce the vocabulary size by removing rare and very frequent terms

E.Apply L2 regularization to the logistic regression model

AnswersD, E

Reducing the number of features reduces model complexity and overfitting.

Why this answer

Option D is correct because removing rare and very frequent terms reduces the feature space and eliminates noise, which helps the logistic regression model generalize better. Rare terms often act as noise that the model can latch onto for spurious correlations, while very frequent terms (like stopwords) provide little discriminative power. This dimensionality reduction directly combats overfitting by simplifying the model.

Exam trap

AWS often tests the misconception that adding more features or using a more complex model always improves performance, when in fact these actions increase overfitting risk in high-dimensional sparse datasets.

Full explanation →

664

MCQhard

A machine learning team is deploying a time-series forecasting model using Amazon SageMaker. The model is trained on historical data and needs to be updated daily with new data. The team wants to automate the retraining pipeline and avoid manual intervention. Which approach is the most efficient?

A.Use AWS Step Functions to orchestrate retraining, but require a manual approval step.

B.Use SageMaker training jobs manually triggered by the team each day.

C.Use a cron job on an EC2 instance to run a training script.

D.Use Amazon SageMaker Pipelines with a scheduled Lambda function to trigger retraining daily.

AnswerD

Combines SageMaker Pipelines for automated ML workflows with Lambda for scheduling, providing a fully automated solution.

Why this answer

Option D is correct because Amazon SageMaker Pipelines provides a fully managed, end-to-end orchestration service for building, training, and deploying machine learning models. By combining it with a scheduled AWS Lambda function, the team can automate daily retraining without manual intervention, leveraging SageMaker's native integration for step sequencing, artifact tracking, and model registry updates.

Exam trap

The trap here is that candidates might choose Option C (cron job on EC2) because it seems simpler, but they overlook the operational burden of managing EC2 and the lack of native SageMaker integration for model lineage and automated deployment.

How to eliminate wrong answers

Option A is wrong because requiring a manual approval step contradicts the requirement to avoid manual intervention, making the pipeline not fully automated. Option B is wrong because manually triggering training jobs each day is the opposite of automation and introduces human error and operational overhead. Option C is wrong because using a cron job on an EC2 instance requires managing the instance (patching, scaling, security), and the training script would lack native integration with SageMaker's managed infrastructure, model registry, and pipeline lineage tracking.

Full explanation →

665

Multi-Selectmedium

Which TWO of the following are best practices for training deep learning models on Amazon SageMaker? (Select TWO.)

Select 2 answers

A.Use SageMaker Processing to perform data augmentation before training.

B.Use Pipe input mode to stream data directly from S3 to the algorithm.

C.Store training data on Amazon EBS volumes attached to the training instance.

D.Use managed spot training to reduce costs.

E.Disable checkpointing to improve training speed.

AnswersB, D

Pipe mode reduces startup time and storage.

Why this answer

Option B is correct because SageMaker's Pipe input mode streams training data directly from Amazon S3 to the algorithm without writing it to disk, reducing I/O latency and eliminating the need for large local storage. This is especially beneficial for deep learning models that iterate over large datasets, as it allows training to start faster and avoids the overhead of downloading data to EBS volumes.

Exam trap

The trap here is that candidates often confuse SageMaker Processing with a general-purpose compute environment for any training task, when in fact it is specifically for data processing jobs, not for augmenting data during model training.

Full explanation →

666

MCQmedium

A machine learning engineer is analyzing a dataset with a mix of categorical and numerical features. The engineer wants to understand the correlation between categorical features and the target variable. Which statistical test is most appropriate for measuring association between a categorical feature and a binary target?

A.Pearson correlation coefficient

B.ANOVA (Analysis of Variance)

C.Chi-squared test of independence

D.Mutual information

AnswerC

Chi-squared test tests association between two categorical variables.

Why this answer

Option C is correct because the Chi-squared test of independence is used to determine if there is a significant association between two categorical variables, which is applicable here. Option A is wrong because Pearson correlation is for continuous variables. Option B is wrong because ANOVA is for comparing means across groups, but assumes continuous target.

Option D is wrong because Mutual Information can be used but is not a statistical test with a p-value.

Full explanation →

667

MCQhard

A company is building a recommendation system using Amazon SageMaker's Factorization Machines algorithm. The dataset includes user IDs, item IDs, and ratings. The data is sparse. Which data format should be used for training?

A.CSV format with one row per rating.

B.JSON lines format with nested structures.

C.RecordIO-protobuf format with sparse feature vectors.

D.Parquet format with columns for each feature.

AnswerC

Protobuf with sparse encoding is efficient and recommended.

Why this answer

Option C is correct because Factorization Machines (FM) in SageMaker are optimized for sparse, high-dimensional data. The RecordIO-protobuf format allows you to directly specify sparse feature vectors using integer keys and float values, which avoids the memory overhead of dense representations and enables efficient distributed training. This format is the recommended input for SageMaker's built-in FM algorithm.

Exam trap

The trap here is that candidates assume CSV is always the simplest and most compatible format, overlooking the fact that SageMaker's Factorization Machines specifically require sparse data representation for performance and correctness, making RecordIO-protobuf the only optimal choice among the options.

How to eliminate wrong answers

Option A is wrong because CSV format with one row per rating forces dense representation, which is inefficient for sparse data and does not leverage FM's native support for sparse feature vectors. Option B is wrong because JSON lines format with nested structures is not natively supported by SageMaker's Factorization Machines; the algorithm expects RecordIO-protobuf or CSV with a specific schema, not arbitrary nested JSON. Option D is wrong because Parquet format, while efficient for columnar storage, is not directly supported by SageMaker's FM algorithm and would require conversion to RecordIO-protobuf or CSV for training.

Full explanation →

668

MCQeasy

A data scientist is starting a new machine learning project and needs to understand the dataset. The dataset is stored as CSV files in Amazon S3, with a total size of 50 GB. The data scientist wants to quickly get summary statistics (count, mean, standard deviation, min, max) for each numerical column, and also check for missing values. The data scientist has access to SageMaker Studio. What is the most efficient way to achieve this?

A.Use AWS Glue Crawler to infer schema and then query with Athena.

B.Write a PySpark script in a SageMaker notebook to compute statistics.

C.Load a sample into Amazon QuickSight and use SPICE to compute statistics.

D.Use SageMaker Data Wrangler to import the data and generate a data quality report.

AnswerD

Data Wrangler provides summary statistics and missing value analysis.

Why this answer

SageMaker Data Wrangler can profile the data without writing code. Option A is wrong because Glue Crawler creates a schema but not statistics. Option B is wrong because writing a Spark job is overkill.

Option D is wrong because QuickSight requires data import.

Full explanation →

669

MCQmedium

A data engineering team is using Apache Spark on Amazon EMR to process streaming data from Amazon Kinesis Data Streams. The Spark application uses structured streaming to read from Kinesis, perform transformations, and write to Amazon S3 in Parquet format. The team notices that the application is falling behind and the processing latency is increasing. The Kinesis stream has 5 shards, and the EMR cluster has 5 core nodes of type r5.xlarge. The Spark application is configured with 5 executors, each with 2 cores and 8 GB memory. The team wants to reduce processing latency. Which change would be most effective?

A.Increase the executor memory to 16 GB.

B.Increase the number of shards in the Kinesis stream to 10 and increase the number of core nodes to 10.

C.Use a larger instance type for the core nodes, such as r5.4xlarge.

D.Change the output format from Parquet to CSV to reduce write time.

AnswerB

More shards increase parallelism, and more nodes allow more concurrent processing.

Why this answer

The number of shards (5) matches the number of executors (5), but each shard can be processed by a single executor. To increase parallelism, the team should increase the number of shards in the Kinesis stream and correspondingly increase the number of executors or cores. Alternatively, they can increase the number of cores per executor to allow parallel processing of multiple shards per executor.

Full explanation →

670

MCQhard

A company uses Amazon SageMaker to train a model for fraud detection. The dataset has 1 million samples with 200 features. The data is highly imbalanced (0.1% fraud). The team wants to use a random forest model. Which technique should they use to handle the class imbalance during training?

A.Synthetic Minority Over-sampling Technique (SMOTE)

B.Use class weights inversely proportional to class frequencies

C.Random undersampling of the majority class

D.Adjust the decision threshold after training

AnswerA

SMOTE generates synthetic samples, effectively balancing the dataset.

Why this answer

SMOTE generates synthetic samples of the minority class to balance the dataset. Option A is wrong because undersampling discards majority class data, losing information. Option B is wrong because class weights adjust loss function but are not specific to random forest.

Option D is wrong because threshold tuning is post-training, not during training.

Full explanation →

671

Multi-Selecthard

A data scientist is using Amazon SageMaker to train a deep learning model for natural language processing. The training job is taking too long to converge. The data scientist wants to speed up training without significantly sacrificing model accuracy. Which THREE strategies should the data scientist consider? (Choose three.)

Select 3 answers

A.Reduce the model size by using fewer layers or smaller hidden dimensions.

B.Increase the learning rate by a factor of 10 to accelerate convergence.

C.Increase the batch size to its maximum possible value to utilize GPU memory fully.

D.Use mixed precision training (FP16) to reduce memory and speed up matrix operations.

E.Use SageMaker's distributed data parallelism across multiple instances.

AnswersA, D, E

Smaller models train faster but may lose some accuracy.

Why this answer

Options A, C, and E are correct. Mixed precision training (A) speeds up computation on GPUs. Reducing model size (C) reduces computations.

Using distributed data parallelism (E) leverages multiple GPUs. Option B (increase batch size) may cause convergence issues. Option D (increase learning rate) can destabilize training.

Full explanation →

672

MCQhard

A data scientist examines a dataset with 100 features and suspects that some features are redundant due to high pairwise correlations. Which EDA technique should the scientist use to systematically identify groups of highly correlated features?

A.Generate a correlation matrix and visualize it as a heatmap.

B.Plot histograms for each feature.

C.Create scatter plots for each pair of features.

D.Use box plots to identify outliers.

AnswerA

Heatmap of correlation matrix quickly reveals high pairwise correlations.

Why this answer

Option B is correct because a correlation matrix heatmap visually identifies high correlations. Option A is wrong because histograms show univariate distributions. Option C is wrong because scatter plots are for pairs, not systematic.

Option D is wrong because box plots show outliers.

Full explanation →

673

MCQhard

A company runs a data pipeline using AWS Glue ETL jobs that process about 10 TB of data daily from Amazon S3. The jobs are triggered by a schedule and write results to a separate S3 bucket. Recently, the jobs have been taking longer to complete, and the data engineering team has observed that the number of files in the source bucket has increased significantly, from thousands to millions of small files (each about 100 KB). The Glue jobs are configured to use the 'Group Files' option, but performance is still poor. The team needs to improve the job performance without changing the source data generation process. Which course of action should the team take?

A.Increase the number of DPUs allocated to the existing Glue job

B.Switch the ETL processing to Amazon EMR with Spark

C.Use AWS Lambda to pre-process the files and combine them

D.Create a separate Glue job that runs before the main job to consolidate small files into larger ones in the source bucket

AnswerD

Consolidation reduces the number of files, improving read performance.

Why this answer

Using an AWS Glue job to periodically compact small files into larger files (e.g., 100 MB) before the main ETL job runs will reduce the overhead of reading millions of small files. Option B is wrong because increasing DPUs may help but does not address the root cause of many small files. Option C is wrong because using Spark on EMR may still suffer from small file issue.

Option D is wrong because Lambda has limitations on processing large numbers of files.

Full explanation →

674

MCQmedium

A company is fine-tuning a BERT model on Amazon SageMaker for a text classification task. The training script uses PyTorch and Hugging Face Transformers. The training job completes successfully, but the final model accuracy is low. The dataset has 10,000 labeled samples. What is the most likely cause and solution?

A.The instance type is insufficient; use a larger instance

B.The model is overfitting due to small dataset; use a pre-trained checkpoint and fine-tune only top layers

C.The learning rate is too high; reduce it

D.The training script has a bug in the data loader

AnswerB

Fine-tuning a pre-trained BERT on a small dataset may overfit; using a pre-trained checkpoint and freezing lower layers helps.

Why this answer

Option C is correct because BERT is large and 10,000 samples may not be enough; using a pre-trained checkpoint and doing transfer learning is standard. Option A (learning rate) is possible but not most likely. Option B (SageMaker error) is unlikely.

Option D (instance type) doesn't affect accuracy directly.

Full explanation →

675

MCQhard

During EDA, a data scientist plots the distribution of a feature and sees a bimodal pattern. What does this likely indicate?

A.The data may contain two distinct groups.

B.The feature has missing values.

C.The feature contains outliers.

D.The feature needs to be standardized.

AnswerA

Bimodal suggests mixture of two populations.

Why this answer

Option C is correct because bimodal distribution often indicates two underlying subpopulations. Option A is wrong because missing values cause spikes, not bimodal. Option B is wrong because outliers cause tails, not two peaks.

Option D is wrong because scaling does not create bimodality.

Full explanation →

Page 9 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →