Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 526–600

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 8 of 24

526

MCQeasy

In exploratory data analysis, a data scientist notices that the distribution of a continuous variable is bimodal. The scientist suspects that the two modes correspond to two different groups in the data. Which visualization is MOST appropriate to confirm this suspicion?

A.Box plot

B.Bar chart

C.Histogram with overlaid densities by group

D.Scatter plot

AnswerC

Overlaying densities by group allows visual comparison of the two modes.

Why this answer

Option A is correct because a histogram with hue (color) for each group can show the separate distributions. Option B is wrong because a box plot shows summary statistics but not the shape. Option C is wrong because a scatter plot is for two continuous variables.

Option D is wrong because a bar chart is for categorical data.

Full explanation →

527

MCQhard

A company is deploying a real-time fraud detection system using a gradient boosting model on AWS SageMaker. The model uses 200 features and is trained on 50 GB of data. The inference latency requirement is under 10 ms per request. During load testing, the endpoint shows average latency of 15 ms. Which change is MOST likely to reduce latency below 10 ms?

A.Switch to a GPU-based instance type

B.Reduce the number of features to the top 50 based on feature importance

C.Increase the number of trees in the model

D.Use a larger batch size for inference

AnswerB

Fewer features reduce inference computation time, directly lowering latency.

Why this answer

Reducing the number of features from 200 to the top 50 directly decreases the amount of data each inference request must process, which lowers both feature engineering overhead and model evaluation time. For gradient boosting models on SageMaker, fewer features mean fewer decision tree splits to traverse per prediction, which can significantly reduce latency without requiring hardware changes. This is the most direct and cost-effective way to meet the 10 ms requirement.

Exam trap

The trap here is that candidates often assume GPU instances universally speed up inference, but for tree-based models like gradient boosting, the bottleneck is sequential tree traversal, not parallel computation, so feature reduction is the correct optimization.

How to eliminate wrong answers

Option A is wrong because switching to a GPU-based instance type does not inherently reduce latency for gradient boosting models; GPUs excel at parallel matrix operations (e.g., deep learning) but offer minimal benefit for tree-based models where inference is sequential and CPU-bound. Option C is wrong because increasing the number of trees in the model increases the ensemble size, requiring more sequential evaluations per prediction, which would increase latency, not reduce it. Option D is wrong because using a larger batch size for inference increases throughput (requests per second) but does not reduce per-request latency; in fact, it can increase latency for individual requests due to queuing and processing delays.

Full explanation →

528

Multi-Selecthard

A data scientist is using Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error message indicating that the container exited with a non-zero code. Which THREE steps should the data scientist take to diagnose the issue? (Choose THREE.)

Select 3 answers

A.Retry the training job with the same configuration; the error might be transient.

B.Use the SageMaker Debugger to capture system metrics and output tensors for analysis.

C.Check the CloudWatch Logs for the training job to see the container's stdout and stderr.

D.Increase the number of training instances to distribute the workload.

E.Run the container locally using SageMaker Local Mode to simulate the training environment.

AnswersB, C, E

Debugger can capture detailed metrics that help identify why the container exited.

Why this answer

Option A, B, and D are correct because checking CloudWatch logs, testing locally with SageMaker Local Mode, and using the SageMaker Debugger can help identify the error. Option C is wrong because increasing instance count does not fix the error. Option E is wrong because Retry may mask the issue.

Full explanation →

529

MCQeasy

A data scientist is trying to create a SageMaker training job using an execution role with the attached IAM policy. The training job fails with an access denied error when trying to read training data from the S3 bucket 'my-bucket'. What is the most likely cause?

A.The S3 bucket policy explicitly denies access to the role.

B.The IAM policy does not include s3:ListBucket permission.

C.The S3 bucket is in a different AWS account.

D.The sagemaker:CreateTrainingJob action is not allowed.

AnswerA

Even if IAM allows, bucket policy can deny.

Why this answer

Option D is correct because the bucket policy may deny access. Option A is wrong because the role has s3:GetObject. Option B is wrong because it's allowed.

Option C is wrong because it's allowed.

Full explanation →

530

Multi-Selectmedium

Which THREE of the following are appropriate data visualization techniques for exploring the relationship between two numerical variables?

Select 3 answers

A.Scatter plot

B.Hexbin plot

C.Box plot

D.Bar chart

E.Pair plot

AnswersA, B, E

Scatter plots directly show the relationship between two numerical variables.

Why this answer

Scatter plot, hexbin plot, and pair plot are designed for bivariate numerical relationships. Bar chart is for categorical. Box plot is for numerical vs categorical.

Full explanation →

531

MCQeasy

During exploratory data analysis, a data scientist notices that the target variable is highly imbalanced. Which technique should be used to address this issue before training a classification model?

A.Apply PCA to reduce dimensionality

B.Remove outliers from the majority class

C.Use cross-validation to evaluate the model

D.Apply feature scaling to all features

E.Use SMOTE to generate synthetic samples for the minority class

AnswerE

SMOTE is a standard technique for imbalanced classification.

Why this answer

SMOTE (Synthetic Minority Over-sampling Technique) is a popular method for handling imbalanced datasets by generating synthetic samples for the minority class. Option A is wrong because removing outliers does not address class imbalance. Option B is wrong because feature scaling does not affect imbalance.

Option D is wrong because PCA is for dimensionality reduction, not imbalance. Option E is wrong because cross-validation is a model evaluation technique, not an imbalance solution.

Full explanation →

532

MCQhard

A machine learning engineer is deploying a model on Amazon SageMaker that was trained using a custom Docker container. The container is stored in Amazon ECR. The engineer creates a SageMaker model and endpoint configuration, but when creating the endpoint, it fails with an error: 'Could not find the inference code at the expected path.' The engineer verified that the container image is correct and the model artifacts are in S3. What is the most likely cause?

A.The container is not compatible with the SageMaker inference environment.

B.The SageMaker execution role does not have ECR pull permissions.

C.The model artifacts are not in the correct format.

D.The inference code is not placed in the /opt/ml/model directory inside the container.

AnswerD

SageMaker expects code in /opt/ml/model for custom containers.

Why this answer

Option C is correct because SageMaker expects the inference code to be in /opt/ml/model/ directory. Option A is wrong because ECR permissions would cause a different error. Option B is wrong because model artifacts are separate.

Option D is wrong because an incorrect Region would cause a different error.

Full explanation →

533

MCQhard

A data engineer is preparing a dataset for training a binary classification model. The target variable is highly imbalanced (95% negative, 5% positive). The engineer needs to split the data into training and test sets while maintaining the class distribution in both sets. Which method should the engineer use?

A.Use k-fold cross-validation and then split the data

B.Oversample the minority class first, then do a random split

C.Perform a simple random 80/20 split

D.Use stratified random sampling to split the data

AnswerD

Stratified split preserves class proportions in each subset.

Why this answer

Option D is correct because stratified random sampling ensures the proportion of classes is preserved in both training and test sets. Option A is wrong because simple random sampling may result in uneven distribution. Option B is wrong because oversampling should be done after splitting to avoid data leakage.

Option C is wrong because k-fold cross-validation is not a split method.

Full explanation →

534

MCQeasy

A data scientist is building a binary classifier and obtains the following confusion matrix on the test set: TP=80, FP=20, TN=70, FN=30. What is the precision?

A.0.727

B.0.8

C.0.75

D.0.762

AnswerB

Precision = TP/(TP+FP) = 80/100 = 0.8.

Why this answer

Precision = TP / (TP+FP) = 80/(80+20)=0.8. Recall = TP/(TP+FN)=80/110≈0.727. Accuracy = (80+70)/200=0.75.

F1 = 2*(0.8*0.727)/(0.8+0.727)≈0.762.

Full explanation →

535

MCQeasy

An AWS Glue job is failing with an error that it cannot access an S3 bucket. The IAM role attached to the Glue job is shown in the exhibit. What is the MOST likely cause of the failure?

A.The S3 bucket has a bucket policy that denies access to this role

B.The role lacks S3 permissions

C.The role does not have permission to call S3 APIs

D.The trust policy does not allow Glue to assume the role

AnswerA

A bucket policy can override the role's permissions.

Why this answer

Option C is correct because the trust policy allows only Glue service to assume the role, but if the S3 bucket has a bucket policy that denies access to the role, the Glue job will fail. Option A is wrong because the trust policy allows Glue. Option B is wrong because the role has S3 full access.

Option D is wrong because the role explicitly allows S3 access.

Full explanation →

536

MCQmedium

Refer to the exhibit. A data scientist is unable to read a CSV file from the S3 bucket 'my-bucket' using SageMaker. The IAM policy attached to the SageMaker execution role is shown. What is the most likely cause of the failure?

A.The policy does not allow the s3:GetObject action

B.The policy does not grant read access to the bucket

C.The bucket uses server-side encryption with AWS KMS (SSE-KMS) and the policy lacks kms:Decrypt permission

D.The policy does not include s3:ListBucket action

AnswerC

KMS-encrypted objects require kms:Decrypt permission.

Why this answer

The policy grants GetObject and ListBucket, but ListBucket is not sufficient for reading objects; GetObject is needed and is present. However, the error might be due to missing s3:GetObject on the specific object path. But the policy looks correct.

Actually, the issue could be that the bucket is in a different region or encryption mismatch. However, the most common cause is that the SageMaker notebook instance's IAM role does not have the policy attached, or the policy is missing permissions like KMS if encrypted. But given the options, the policy appears correct.

Wait, the question states 'unable to read'. A plausible cause is that the bucket uses SSE-KMS and the policy does not include kms:Decrypt. Option B is correct because if the bucket uses KMS encryption, the role needs KMS permissions.

Option A is wrong because GetObject is present. Option C is wrong because ListBucket is present. Option D is wrong because the policy allows read access.

Full explanation →

537

MCQhard

A company deploys a real-time inference endpoint using Amazon SageMaker with an ML model that has strict latency requirements. The endpoint currently uses a single ml.c5.xlarge instance. During a load test, the p99 latency exceeds the 100ms threshold. The team adds more instances but latency does not improve because the model is heavily CPU-bound. What is the MOST cost-effective change to meet the latency requirement?

A.Change the instance type to a GPU instance such as ml.g4dn.xlarge.

B.Use a multi-model endpoint to serve multiple models on the same instance.

C.Enable automatic scaling based on inference latency.

D.Increase the number of instances and use a target tracking scaling policy.

AnswerA

GPU instances accelerate model inference, reducing per-request latency.

Why this answer

Switching to an instance with GPU acceleration (e.g., ml.g4dn.xlarge) offloads computation to GPU, reducing CPU-bound latency. More instances (B) increase throughput but not per-request latency if the model is CPU-bound. Multi-model endpoints (C) help with many models but not single-model latency.

Automatic scaling (D) helps with varying load but not per-request latency improvement.

Full explanation →

538

Multi-Selectmedium

A data scientist is performing EDA on a dataset with 500,000 rows and 20 columns. The dataset contains missing values in some columns. Which TWO approaches are appropriate for handling missing data during EDA? (Choose 2)

Select 2 answers

A.Use forward fill to propagate the last observed value

B.Remove all rows with any missing value (listwise deletion)

C.Create an indicator column to flag whether the value was missing, then impute with a placeholder

D.Impute missing values with the mean of each column

E.Impute missing values with the median for numerical columns and mode for categorical columns

AnswersC, E

This retains the information about missingness and is a common practice.

Why this answer

Options C and D are correct because imputation with median/mode is robust, and flagging missingness with an indicator variable preserves information. Option A is wrong because listwise deletion can introduce bias and reduce sample size. Option B is wrong because mean imputation is sensitive to outliers.

Option E is wrong because forward fill is for time series, not general EDA.

Full explanation →

539

MCQeasy

A data analyst is performing EDA on a dataset containing timestamps of user logins. They want to understand daily login patterns. The timestamp column is in Unix epoch format (integer). Which of the following is the most appropriate transformation to extract day-of-week patterns?

A.Convert the timestamps to datetime objects and extract the day-of-week.

B.Convert the timestamps to string and split into date and time.

C.Apply min-max scaling to the timestamp values.

D.Bin the timestamps into 1-hour intervals.

AnswerA

This enables grouping by day of the week to analyze patterns.

Why this answer

Option B is correct because converting to datetime allows extracting day-of-week. Option A is wrong because binning into hours loses day information. Option C is wrong because converting to string does not facilitate analysis.

Option D is wrong because scaling does not help.

Full explanation →

540

MCQeasy

A data analyst needs to visualize the distribution of a numerical feature in a dataset. Which AWS service can be used to create a histogram directly from data stored in S3 without writing code?

A.Amazon Athena

B.Amazon SageMaker Studio

C.Amazon QuickSight

D.AWS Glue

AnswerC

QuickSight provides no-code visualizations like histograms.

Why this answer

Option B is correct because Amazon QuickSight is a BI service that can connect to S3 data and create histograms without coding. Option A is wrong because Amazon SageMaker Studio requires code or notebook. Option C is wrong because Amazon Athena outputs query results, not visualizations directly.

Option D is wrong because AWS Glue is for ETL, not visualization.

Full explanation →

541

MCQhard

A data science team at a financial services company is building a fraud detection model using a dataset of credit card transactions. The dataset contains 10 million rows and 20 features, including transaction amount, merchant category, time since last transaction, and customer ID. The target variable 'is_fraud' is highly imbalanced: only 0.1% of transactions are fraudulent. The team is performing exploratory data analysis (EDA) on a sample of 100,000 rows. They compute the correlation matrix and find that 'transaction amount' has a correlation of 0.02 with 'is_fraud'. They also plot the distribution of 'transaction amount' and see that it is heavily right-skewed with a long tail. The team wants to understand the relationship between 'transaction amount' and fraud more deeply before feature engineering. They have access to AWS SageMaker and can run processing jobs. Which course of action is most appropriate?

A.Conclude that 'transaction amount' is not predictive because the correlation is near zero

B.Train a random forest model on the sample and use feature importance to assess the predictive power of 'transaction amount'

C.Create bins for 'transaction amount' (e.g., 0-10, 10-50, 50-100, 100+) and compute the fraud rate per bin to detect any non-linear patterns

D.Apply a log transformation to 'transaction amount' to reduce skewness and re-run the correlation analysis

AnswerC

Binning and examining fraud rates per bin can reveal non-linear relationships.

Why this answer

Option B is correct because binning the transaction amount and computing fraud rates per bin can reveal non-linear relationships that correlation might miss. Option A is wrong because log transformation does not reveal relationship with target. Option C is wrong because correlation is already computed and can mask non-linearity.

Option D is wrong because feature importance from a tree model is more appropriate after feature engineering, not during EDA.

Full explanation →

542

MCQmedium

A data engineering team is building a real-time fraud detection system. Transactions are ingested via Amazon Kinesis Data Streams, and a machine learning model (deployed on Amazon SageMaker) scores each transaction. The team needs to store the raw transactions and the model's predictions in Amazon S3 for later analysis. Which architecture should the team use?

A.Use AWS Lambda to read from Kinesis, invoke SageMaker, and write directly to S3.

B.Use Amazon Kinesis Data Firehose with a transformation Lambda to call SageMaker.

C.Use Amazon Kinesis Data Analytics for Apache Flink to enrich records with SageMaker predictions, then output to Firehose for S3.

D.Use AWS Lambda to invoke the SageMaker endpoint for each record, then write to S3 via Firehose.

AnswerC

Flink can handle high-throughput, call SageMaker per record, and output to Firehose.

Why this answer

Option C is correct. Use Kinesis Data Analytics with a Flink application to enrich each record with the SageMaker prediction, then output to Kinesis Data Firehose for delivery to S3. Option A is wrong because Lambda cannot directly invoke SageMaker for every record in high-throughput streams due to concurrency limits.

Option B is wrong because Kinesis Data Firehose does not support invoking SageMaker directly. Option D is wrong because Lambda is not suitable for high-frequency real-time scoring.

Full explanation →

543

MCQmedium

A data scientist is using Amazon SageMaker to train a model with a custom Docker container. The training job fails with an error: 'Container exited with code 137'. What is the most likely cause?

A.The training data was corrupted.

B.The training job exceeded the maximum runtime.

C.The Docker entrypoint script was not found.

D.The training instance ran out of memory.

AnswerD

Exit code 137 indicates OOM kill.

Why this answer

Exit code 137 (128+9) indicates the container was killed by the SIGKILL signal, which typically occurs when the Linux Out-Of-Memory (OOM) killer terminates a process that has exceeded its memory allocation. In Amazon SageMaker, training instances have finite memory, and if the training algorithm or data loading exceeds that limit, the OOM killer forcibly stops the container, resulting in exit code 137.

Exam trap

The trap here is that candidates often confuse exit code 137 with a generic 'container error' or 'runtime timeout' (option B), not realizing that 137 specifically signals a SIGKILL from the OOM killer due to memory exhaustion.

How to eliminate wrong answers

Option A is wrong because corrupted training data would typically cause a non-zero exit code like 1 or a Python traceback, not a SIGKILL (137). Option B is wrong because exceeding the maximum runtime results in exit code 143 (SIGTERM) or a timeout error, not 137. Option C is wrong because a missing entrypoint script would cause an immediate container startup failure with exit code 127 (command not found) or 126 (permission denied), not a memory-related kill signal.

Full explanation →

544

MCQhard

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time sensor data. The application reads from a Kinesis data stream, performs windowed aggregations, and writes results to an S3 bucket. Recently, the application has been experiencing high latency and checkpoint failures. What is the MOST likely cause?

A.The number of shards in the Kinesis stream is insufficient for the data volume

B.The S3 destination bucket is located in a different AWS Region than the Kinesis application

C.The record size in the Kinesis stream exceeds the 1 MB limit

D.The parallelism of the Flink application is set too low for the number of shards

AnswerB

Cross-region writes increase latency and can cause checkpoint timeouts.

Why this answer

Option C is correct. If the S3 bucket is in a different region, cross-region data transfer can introduce latency and checkpoint failures. Option A (parallelism) would cause resource issues, not necessarily checkpoint failures.

Option B (shard count) would cause throttling. Option D (record size) is limited to 1 MB.

Full explanation →

545

MCQeasy

A data scientist needs to deploy a trained model to Amazon SageMaker for real-time inference. The model is stored as a .tar.gz file in Amazon S3. Which AWS service is used to create a SageMaker endpoint?

A.SageMaker Model and Endpoint Configuration

B.AWS Lambda

C.AWS CloudFormation

D.Amazon ECS

AnswerA

You create a Model, then EndpointConfig, then Endpoint.

Why this answer

Option B is correct because a SageMaker endpoint requires a model and an endpoint configuration. Options A, C, D are not required for creating the endpoint.

Full explanation →

546

MCQmedium

A company is using SageMaker's built-in image classification algorithm to classify product images into 100 categories. The training takes 3 hours on a single p3.2xlarge instance. They need to reduce training time to under 1 hour. They have access to a cluster of 4 p3.2xlarge instances. Which approach should they take?

A.Use SageMaker's hyperparameter tuning to find faster convergence

B.Use a smaller batch size on each instance

C.Use SageMaker's managed spot training with checkpointing

D.Use SageMaker's distributed training with data parallelism using Horovod

AnswerD

Data parallelism across 4 instances can reduce training time nearly linearly.

Why this answer

Distributed training with data parallelism effectively reduces training time.

Full explanation →

547

Multi-Selecthard

Which THREE of the following are best practices for optimizing performance of Amazon EMR clusters? (Choose 3)

Select 3 answers

A.Use Spot Instances for task nodes

B.Consolidate small files into larger ones before processing

C.Use instance fleets for heterogeneous instances

D.Enable EBS optimization on EC2 instances

E.Use Spot Instances to reduce costs

AnswersB, C, D

Consolidation reduces overhead and improves performance.

Why this answer

Option B is correct because consolidating small files into larger ones before processing on Amazon EMR reduces the overhead of the Hadoop Distributed File System (HDFS) metadata operations. Each small file consumes a block of memory in the NameNode, and processing many small files leads to excessive task launches and I/O overhead, degrading performance. Using tools like `s3-dist-cp` to combine files into fewer, larger blocks improves throughput and reduces job execution time.

Exam trap

The trap here is that candidates confuse cost optimization strategies (like Spot Instances) with performance optimization, leading them to select options A or E even though the question explicitly asks for performance best practices.

Full explanation →

548

MCQhard

A data scientist is performing EDA on a dataset of 1 million images stored in Amazon S3. Each image is 100x100 pixels in RGB format. The data scientist wants to compute the mean pixel value per channel across the entire dataset. Which approach is most efficient?

A.Use Amazon SageMaker Processing with a custom Python script that iterates over S3 objects and aggregates pixel values.

B.Use Amazon Athena with a SQL query on the image metadata stored in a CSV file.

C.Use AWS Glue ETL to read images and compute the mean.

D.Use a SageMaker notebook instance with a large instance type to load all images into memory and compute the mean.

AnswerA

SageMaker Processing can distribute the workload across multiple instances for efficient computation.

Why this answer

Option C is correct because using SageMaker Processing with a custom script can distribute the computation across multiple instances, making it efficient for large datasets. Option A is wrong because loading all images into memory on a single notebook instance is not feasible. Option B is wrong because Athena is designed for structured data, not images.

Option D is wrong because AWS Glue is for ETL on tabular data, not image processing.

Full explanation →

549

MCQeasy

A data scientist trains a model using Amazon SageMaker's built-in XGBoost algorithm. The model overfits on the training data. Which hyperparameter adjustment is MOST likely to reduce overfitting?

A.Increase the value of the max_depth hyperparameter.

B.Increase the value of the subsample hyperparameter to 1.0.

C.Increase the value of the lambda (L2 regularization) hyperparameter.

D.Increase the value of the num_round hyperparameter.

AnswerC

L2 regularization penalizes large coefficients, reducing model complexity and overfitting.

Why this answer

Option A is correct because increasing the L2 regularization lambda penalizes large weights and reduces overfitting. Option B is wrong because increasing max_depth increases model complexity, worsening overfitting. Option C is wrong because increasing num_round can lead to more overfitting.

Option D is wrong because increasing subsample may reduce overfitting, but is less direct than regularization.

Full explanation →

550

Multi-Selecthard

A company is using Amazon SageMaker to build a custom model. The training job is failing with a 'ResourceLimitExceeded' error. Which TWO actions should the company take to resolve this issue?

Select 2 answers

A.Request a service quota increase for SageMaker training instances.

B.Use spot instances for training.

C.Use Amazon EFS for training data.

D.Reduce the size of the training dataset.

E.Use a smaller instance type.

AnswersA, B

Increases the maximum number of instances.

Why this answer

Option A is correct because it increases the limit. Option D is correct because spot instances can help. Option B is wrong because it doesn't address limit.

Option C is wrong because it's for storage. Option E is wrong because it doesn't help with compute limits.

Full explanation →

551

MCQeasy

A data scientist is training a binary classification model and wants to evaluate its performance using a metric that is robust to class imbalance. Which metric should be used?

A.Mean squared error

B.Area under the ROC curve (AUC)

C.F1 score

D.Accuracy

AnswerC

F1 score balances precision and recall and is robust to class imbalance.

Why this answer

The F1 score is the harmonic mean of precision and recall and is robust to class imbalance because it considers both false positives and false negatives. Accuracy can be misleading with imbalanced classes.

Full explanation →

552

MCQhard

An ML engineer runs the AWS CLI command shown in the exhibit on a file in S3. The engineer wants to use this file in a SageMaker training job. What does the output reveal about the data?

A.The file is 5 GB and is stored as a CSV

B.The file is 5 MB and is stored as a CSV

C.The file is in Parquet format with two features and one label

D.The file is versioned and can be accessed by version ID

AnswerA

ContentLength indicates 5 GB, and metadata shows format: csv.

Why this answer

Option B is correct because ContentLength is 5,368,709,120 bytes = 5 GB, and the file is a CSV. Option A is wrong because the file is 5 GB, not 5 MB. Option C is wrong because versioning is disabled (VersionId: null).

Option D is wrong because the metadata shows the file is CSV, not Parquet.

Full explanation →

553

MCQhard

A data scientist is analyzing a dataset with 1 million records and 20 features. The target variable is continuous. The scientist wants to identify non-linear relationships between features and the target. Which technique is MOST suitable for this purpose during exploratory data analysis?

A.Visualize the correlation matrix heatmap of all features.

B.Apply Principal Component Analysis (PCA) and examine the loadings.

C.Calculate mutual information scores between each feature and the target.

D.Compute Pearson correlation coefficients between each feature and the target.

AnswerC

Mutual information captures non-linear dependencies.

Why this answer

Option D is correct because mutual information captures any kind of dependency, including non-linear. Option A is wrong because Pearson correlation only measures linear relationships. Option B is wrong because PCA is for dimensionality reduction, not feature-target relationship.

Option C is wrong because correlation matrix is pairwise among features, not with target.

Full explanation →

554

MCQeasy

A data scientist is analyzing a dataset with a target variable that is binary (0/1). Which visualization is most appropriate to explore the relationship between a continuous feature and the target?

A.Scatter plot of the feature vs. the target.

B.Bar chart of the feature.

C.Box plot of the feature grouped by target.

D.Histogram of the feature.

AnswerC

Box plots compare distributions across categories.

Why this answer

Option D is correct because a box plot shows distribution of continuous feature across binary classes. Option A is wrong because scatter plot is for two continuous variables. Option B is wrong because histogram shows distribution of one variable.

Option C is wrong because bar chart is for categorical features.

Full explanation →

555

MCQhard

Refer to the exhibit. A data engineer runs the AWS CLI command to check an object in an S3 bucket. The bucket is part of a data lake and is configured with versioning enabled. However, the output shows "VersionId": null. What is the most likely reason for this?

A.The object is encrypted using SSE-S3, which hides the version ID

B.The object was uploaded before versioning was enabled

C.The command must include the --version-id parameter to display the version ID

D.Versioning is not enabled on the bucket

AnswerB

Objects uploaded before versioning was enabled have a null version ID.

Why this answer

Option D is correct because the head-object command does not have the --version-id parameter, so it retrieves the latest version, but if versioning is enabled, the latest version will have a version ID. However, the output shows null, which indicates that the object is not versioned. This happens when the bucket has versioning enabled but the object was uploaded before versioning was enabled.

Option A is wrong because the command does not require versioning parameter to show version ID. Option B is wrong because versioning is enabled at the bucket level, not object. Option C is wrong because SSE does not affect versioning.

Full explanation →

556

MCQhard

A company uses Amazon SageMaker to train a deep learning model for image classification. The training dataset consists of 500,000 images, each 256x256 pixels, stored in S3. The team uses a single ml.p3.2xlarge instance for training. The training time is unacceptably long (over 48 hours). The team wants to reduce training time without sacrificing model accuracy. They have already optimized the data pipeline by using SageMaker Pipe mode and sharding the S3 dataset. The model is a ResNet-50 implemented in TensorFlow. The team is considering the following options: A) Switch to a ml.p3.16xlarge instance which has 8 GPUs and more memory. B) Implement distributed data parallelism using Horovod across multiple instances. C) Use SageMaker's built-in Hyperparameter Tuning to find optimal hyperparameters. D) Reduce the image resolution to 128x128 to speed up training. Which option will MOST effectively reduce training time while maintaining accuracy?

A.Switch to a ml.p3.16xlarge instance

B.Reduce the image resolution to 128x128

C.Implement distributed data parallelism using Horovod across multiple instances

D.Use SageMaker's built-in Hyperparameter Tuning

AnswerC

Horovod enables efficient multi-GPU, multi-instance training, scaling training time linearly.

Why this answer

Using multiple instances with Horovod for distributed data parallelism can scale training linearly with the number of GPUs, significantly reducing time. A larger single instance (ml.p3.16xlarge) provides 8 GPUs but still limited by single instance. Hyperparameter tuning does not directly reduce training time.

Reducing resolution may lose accuracy.

Full explanation →

557

MCQhard

A company uses Amazon EMR to run Spark jobs on a transient cluster that processes data from S3. The jobs are failing with 'OutOfMemory' errors. The data engineer has already increased the executor memory. Which additional configuration change would MOST likely resolve the issue?

A.Use fewer, larger instance types for the core nodes

B.Increase the number of partitions in the data

C.Increase the driver memory

D.Increase the number of executors

AnswerB

More partitions means smaller data per task, reducing memory usage.

Why this answer

Increasing the number of partitions can reduce the amount of data each executor processes, alleviating memory pressure. Option A is wrong because more executors may increase parallelism but not reduce per-executor data. Option C is wrong because increasing driver memory addresses driver-side issues.

Option D is wrong because using fewer nodes may worsen the problem.

Full explanation →

558

MCQmedium

A team is building a data pipeline that ingests data from an Amazon S3 bucket, transforms it using AWS Glue, and loads it into Amazon Redshift for analysis. The Glue job runs on a schedule every hour. The team has noticed that the job takes longer than expected and sometimes fails due to memory issues. The data volume is variable, with occasional spikes. Which solution should the team implement to optimize the pipeline?

A.Decrease the number of workers to reduce memory contention.

B.Enable job bookmarks to process only new data and use a G.2X worker type for more memory.

C.Increase the schedule frequency to run the job more often with smaller data increments.

D.Replace AWS Glue with Amazon EMR using Spark.

AnswerB

Job bookmarks prevent reprocessing and larger workers provide more memory.

Why this answer

Option C is correct because Glue Job Bookmarks help track processed data and avoid reprocessing, and using a larger worker type with more memory can handle spikes. Option A is wrong because increasing the schedule frequency would not address the root cause. Option B is wrong because using a smaller worker type would worsen memory issues.

Option D is wrong because Spark is already used internally.

Full explanation →

559

MCQmedium

A company is using Amazon SageMaker to deploy a model for real-time inference. The model requires 500 MB of memory and has a latency requirement of 100 ms. The endpoint is receiving 10 requests per second. Which instance type should be chosen for cost-effectiveness?

A.ml.c5.xlarge

B.ml.t2.medium

C.ml.m5.large

D.ml.p3.2xlarge

AnswerC

Adequate memory and cost-effective.

Why this answer

Option A is correct because ml.m5.large (2 vCPU, 8 GB) is more than sufficient for memory and throughput, and is cost-effective. Option B is wrong because ml.c5.xlarge (4 vCPU, 8 GB) is more expensive than needed. Option C is wrong because ml.t2.medium (2 vCPU, 4 GB) has enough memory but may have burstable CPU limitations.

Option D is wrong because ml.p3.2xlarge is GPU-optimized and overkill.

Full explanation →

560

MCQeasy

A data science team is deploying a machine learning model to production using Amazon SageMaker. The model requires real-time inference with low latency. Which SageMaker feature should they use to deploy the model?

A.SageMaker Notebook Instance

B.SageMaker Batch Transform

C.SageMaker Autopilot

D.SageMaker Realtime Endpoint

AnswerD

Provides low-latency, real-time inference.

Why this answer

Option C is correct because SageMaker Realtime Endpoints provide low-latency, synchronous inference. Option A is wrong because batch transform is for asynchronous, batch predictions. Option B is wrong because SageMaker Notebooks are for development, not deployment.

Option D is wrong because SageMaker Autopilot automates model building, not deployment.

Full explanation →

561

MCQeasy

A data scientist is using Amazon SageMaker to train a linear regression model. The target variable is right-skewed. Which transformation should the data scientist apply to the target variable to improve model performance?

A.Min-max scaling

B.One-hot encoding

C.Log transformation

D.Principal Component Analysis (PCA)

AnswerC

Log transformation reduces right skewness.

Why this answer

Option B is correct because log transformation is commonly used to reduce skewness. Option A is wrong because min-max scaling does not address skewness. Option C is wrong because one-hot encoding is for categorical variables.

Option D is wrong because PCA is for dimensionality reduction.

Full explanation →

562

MCQmedium

A data scientist runs a logistic regression and obtains a model with 95% accuracy on the training set. However, the model performs poorly on the test set. Which exploratory data analysis step should have been performed to identify this issue?

A.Generating a correlation matrix of features

B.Log transformation of skewed features

C.Checking for class imbalance in the target variable

D.Creating a heatmap of missing values

AnswerC

Class imbalance can lead to high training accuracy but poor generalization.

Why this answer

Checking for class imbalance is critical because it can cause a model to predict the majority class and still achieve high accuracy, but fail on the minority class in unseen data. Option A is wrong because log transformation is for skewness, not class imbalance. Option C is wrong because a correlation matrix helps with multicollinearity.

Option D is wrong because missing value heatmaps show missing data patterns.

Full explanation →

563

MCQeasy

A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?

A.Local input mode

B.Batch input mode

C.File input mode

D.Pipe input mode

AnswerC

File mode downloads data to the local file system, making it available for training.

Why this answer

File input mode is correct because it downloads the entire training dataset from S3 to the local file system of the ml.m5.large instance before training begins, ensuring the data is available locally as required. This mode is suitable for datasets up to 10 GB, as the instance's local storage (typically 8 GB for ml.m5.large) may be insufficient, but SageMaker uses the instance's Amazon EBS volume (up to 512 GB) for file input mode, making it viable.

Exam trap

The trap here is that candidates may confuse 'File input mode' with 'Pipe input mode' and incorrectly choose Pipe mode for local file availability, or invent 'Local input mode' as a plausible-sounding option.

How to eliminate wrong answers

Option A is wrong because 'Local input mode' is not a valid SageMaker input mode; the correct term is 'File input mode' for local file system access. Option B is wrong because 'Batch input mode' is not a SageMaker input mode; SageMaker uses 'File' or 'Pipe' modes, and batch processing refers to Batch Transform jobs, not training input. Option D is wrong because 'Pipe input mode' streams data directly from S3 to the training algorithm without writing to the local file system, which does not satisfy the requirement that data must be available on the local file system during training.

Full explanation →

564

MCQhard

A data scientist is training a time series forecasting model using Amazon SageMaker's DeepAR algorithm. The dataset contains daily sales data for 10,000 products over 2 years. The scientist splits the data chronologically: training on the first 18 months, validation on the next 3 months, and test on the last 3 months. The model performs well on validation but poorly on test. The data scientist suspects the model is overfitting to the validation period. Which action should the scientist take to improve test performance?

A.Use time series cross-validation with an expanding window

B.Reduce the context length to 30 days

C.Add more exogenous features like holidays and promotions

D.Use the entire dataset for training and ignore validation

AnswerA

Proper cross-validation reduces overfitting to a specific validation period.

Why this answer

Option D (use cross-validation respecting time order) such as expanding window. Option A (increase training data) may help but not specifically address overfitting to validation. Option B (add more features) may worsen overfitting.

Option C (reduce context length) may lose long-term patterns.

Full explanation →

565

MCQeasy

A data engineering team needs to process streaming data from thousands of IoT devices. The data must be ingested with low latency and processed in near real-time to detect anomalies. Which AWS service should they use for ingestion?

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Analytics

C.Amazon S3

D.Amazon Kinesis Data Streams

AnswerD

Kinesis Data Streams is the correct service for real-time streaming ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion and can handle large throughput with low latency. Option A is wrong because S3 is for object storage, not streaming. Option C is wrong because Kinesis Data Analytics is for processing.

Option D is wrong because Kinesis Data Firehose is for loading data to destinations.

Full explanation →

566

MCQhard

A machine learning team is using Amazon SageMaker Experiments to track multiple training runs. They need to compare the performance of different models based on metrics like accuracy and F1 score. However, when they view the experiment list in SageMaker Studio, the metrics are not displayed. What is the MOST likely cause?

A.The training job did not define metric definitions in the algorithm specification.

B.The training script did not use the SageMaker SDK to log the metrics.

C.The training job is running on an instance type that does not support Experiments.

D.The IAM role used by SageMaker does not have permission to write to the Experiments table.

AnswerB

Metrics must be logged using experiment.log_metric() or automatically if using frameworks with SageMaker integration.

Why this answer

Option C is correct because metrics must be explicitly logged via the SageMaker SDK's log_metric() or in training job definition to appear in Experiments. Option A is wrong because permission to view is separate from metric capture. Option B is wrong because the metric definitions in the training job are optional but not required for logging.

Option D is wrong because Experiments run on any instance type.

Full explanation →

567

MCQeasy

A data scientist is analyzing a dataset with missing values in several columns. The dataset is stored in an S3 bucket. What is the most efficient method to identify the percentage of missing values per column using AWS services?

A.Use Amazon SageMaker Notebook with pandas to load the dataset and compute missing percentages.

B.Use Amazon QuickSight to connect to S3 and calculate missing value percentages via calculated fields.

C.Use Amazon Athena to query the data with SQL using COUNT(*) and CASE statements to compute missing percentage per column.

D.Use AWS Glue Crawler to infer schema and view missing values statistics in the AWS Glue Data Catalog.

AnswerC

Athena is serverless and can query S3 data directly, making it efficient for this task.

Why this answer

Option C is correct because Amazon Athena allows running SQL queries directly on data in S3, and the COUNT and CASE statements can compute missing value percentages efficiently without moving data. Option A is wrong because AWS Glue Crawler only catalogs metadata, not performing data analysis. Option B is wrong because SageMaker Notebook requires manual coding and is less efficient for quick checks.

Option D is wrong because QuickSight is a visualization tool, not for direct SQL-based analysis.

Full explanation →

568

MCQeasy

A data scientist wants to automate the selection of optimal hyperparameters for a model. Which SageMaker feature should be used?

A.SageMaker Debugger

B.SageMaker Model Monitor

C.SageMaker Automatic Model Tuning

D.SageMaker Experiments

AnswerC

Automatic Model Tuning optimizes hyperparameters.

Why this answer

SageMaker Automatic Model Tuning (AMT) is the correct feature because it automates hyperparameter optimization by running multiple training jobs with different hyperparameter combinations, using algorithms like Bayesian optimization or random search to find the best set. This directly addresses the requirement to automate selection of optimal hyperparameters.

Exam trap

The trap here is that candidates confuse SageMaker Experiments (which tracks and compares runs) with Automatic Model Tuning (which actively searches for optimal hyperparameters), leading them to pick D instead of C.

How to eliminate wrong answers

Option A is wrong because SageMaker Debugger monitors and debugs training jobs in real-time (e.g., detecting vanishing gradients or overfitting), but it does not perform hyperparameter optimization. Option B is wrong because SageMaker Model Monitor detects data drift and quality issues in deployed endpoints, not hyperparameter tuning during training. Option D is wrong because SageMaker Experiments tracks and organizes training runs, metrics, and parameters for comparison, but it does not automatically select optimal hyperparameters.

Full explanation →

569

MCQhard

A data pipeline uses AWS Lambda to process small files (10-50 MB) from an S3 bucket and write results to DynamoDB. The Lambda function times out after 15 seconds for larger files. The team wants to handle files up to 100 MB without changing the Lambda code. Which solution is MOST cost-effective?

A.Use AWS Glue Python shell job to replace Lambda

B.Increase the Lambda function timeout to 5 minutes

C.Use Amazon ECS with AWS Fargate to run the processing task

D.Configure an SQS queue to buffer the S3 events and batch them

AnswerB

Lambda allows up to 15 minutes, and 5 minutes is sufficient for 100 MB. No code changes needed.

Why this answer

Increasing Lambda timeout is the simplest and most cost-effective solution for occasional larger files. Using ECS Fargate or Glue is overkill and more expensive. SQS does not solve the timeout issue.

Full explanation →

570

MCQmedium

A media company ingests video metadata from multiple sources into an Amazon S3 bucket. Each metadata record is a JSON file about 2 KB. They use AWS Glue ETL jobs to process these files and load them into Amazon Redshift for analytics. The jobs currently run hourly and take about 10 minutes to process all new files. However, the company is growing and expects the number of files to increase 100x. The data engineering team wants to minimize processing time and cost. The Glue job currently reads all files from the S3 bucket using a full scan. What should they do to optimize the pipeline?

A.Consolidate the small JSON files into larger files using a scheduled job

B.Convert the data to Parquet format and partition it

C.Increase the number of Glue DPUs to process files faster

D.Use S3 event notifications to trigger Glue jobs only for new files

AnswerD

Event-driven eliminates full scan and reduces cost.

Why this answer

Using S3 event notifications to trigger Glue jobs on new objects eliminates scanning and reduces latency. Option A is wrong because increasing DPUs costs more. Option B is wrong because consolidating files reduces number of objects.

Option D is wrong because converting to Parquet helps but still scanning all files.

Full explanation →

571

MCQhard

A training job log shows this error. The training instance is an ml.m5.large with 8 GB EBS storage. The training data is 500 MB, and the model size is expected to be 200 MB. What is the most likely cause?

A.The training data is not fully downloaded from S3 before processing

B.The S3 bucket does not have write permissions

C.The training instance does not have enough RAM

D.The training process is generating large temporary files that fill the instance's local storage

AnswerD

Intermediate files, checkpoints, or logs can exceed the 8 GB storage.

Why this answer

Option C is correct: the instance's local storage is full due to temporary files or checkpoints. Option A (insufficient memory) would show MemoryError. Option B (S3 permissions) would show AccessDenied.

Option D (data download) would show download errors.

Full explanation →

572

MCQhard

A company wants to build a machine learning model to predict customer churn. The dataset includes customer demographics, usage patterns, and support interactions. The data is stored in Amazon S3. The data scientist needs to perform feature engineering, including creating aggregate features from support interactions and encoding categorical variables. Which AWS service is most suitable for building the feature engineering pipeline?

A.AWS Glue

B.Amazon EMR

C.AWS Batch

D.Amazon SageMaker Processing

AnswerD

SageMaker Processing is purpose-built for data preprocessing and feature engineering with SageMaker.

Why this answer

Amazon SageMaker Processing is the most suitable service because it is purpose-built for data preprocessing and feature engineering within the SageMaker ecosystem. It allows you to run custom Python scripts (e.g., using pandas or PySpark) on managed infrastructure to create aggregate features from support interactions and encode categorical variables, and it integrates seamlessly with SageMaker for model training and deployment.

Exam trap

The trap here is that candidates often confuse AWS Glue (a general ETL tool) with SageMaker Processing, but the question specifically asks for a service that integrates with the SageMaker model building pipeline, making SageMaker Processing the correct choice.

How to eliminate wrong answers

Option A is wrong because AWS Glue is primarily a serverless ETL service for data cataloging and schema discovery, not optimized for running custom feature engineering scripts with tight integration to SageMaker training jobs. Option B is wrong because Amazon EMR is a big data platform for running distributed frameworks like Spark and Hadoop, which is overkill and less integrated for simple feature engineering tasks that SageMaker Processing can handle more directly. Option C is wrong because AWS Batch is a general-purpose batch computing service for running any containerized workload, but it lacks native integration with SageMaker’s model building pipeline and does not provide the same level of convenience for feature engineering steps.

Full explanation →

573

Multi-Selecteasy

A data scientist needs to select a model training infrastructure that supports distributed training across multiple GPUs and provides automatic model parallelism. Which TWO AWS services should the scientist consider?

Select 2 answers

A.AWS Glue

B.AWS Lambda

C.Amazon Redshift

D.Amazon EMR

E.Amazon SageMaker

AnswersD, E

EMR with Spark MLlib can perform distributed training.

Why this answer

Options A (SageMaker) and B (Amazon EMR) are correct. SageMaker supports distributed training with model parallelism. EMR with Spark supports distributed ML.

Option C (AWS Glue) is for ETL, not training. Option D (Amazon Redshift) is a data warehouse. Option E (AWS Lambda) is not for large-scale training.

Full explanation →

574

Multi-Selecteasy

A data scientist is performing feature selection for a linear regression model. Which TWO methods are appropriate? (Choose TWO.)

Select 2 answers

A.Lasso (L1) regularization

B.Ridge (L2) regularization

C.t-distributed stochastic neighbor embedding (t-SNE)

D.Forward selection

E.Principal component analysis (PCA)

AnswersA, D

Lasso can zero out feature coefficients, effectively selecting features.

Why this answer

Forward selection and Lasso regularization are both feature selection methods. Lasso adds L1 penalty that shrinks coefficients to zero. Option C is wrong because PCA reduces dimensions but does not select original features.

Option D is wrong because L2 regularization (Ridge) does not set coefficients to zero. Option E is wrong because t-SNE is for visualization.

Full explanation →

575

MCQmedium

A company is building a data pipeline to process sensitive customer data. The pipeline uses AWS Glue for ETL and stores results in Amazon S3. The security team requires that all data be encrypted at rest in S3 using customer-managed AWS KMS keys. Additionally, the Glue job must be able to write encrypted data to S3. What should the data engineer do to meet these requirements?

A.Attach a policy to the Glue job's IAM role that includes kms:GenerateDataKey and kms:Decrypt actions for the KMS key.

B.Use S3 server-side encryption with customer-provided keys (SSE-C).

C.Use S3 server-side encryption with SSE-S3, which is enabled by default.

D.Configure an S3 bucket policy to enforce encryption and attach it to the Glue job's IAM role.

AnswerA

These permissions allow Glue to encrypt and decrypt data using the KMS key.

Why this answer

The Glue job's IAM role needs kms:GenerateDataKey and kms:Decrypt permissions on the KMS key. Option A is wrong because S3 bucket policy alone doesn't grant the Glue service access to KMS. Option C is wrong because SSE-S3 is Amazon-managed keys, not customer-managed.

Option D is wrong because SSE-C requires the customer to manage keys and Glue cannot provide the encryption key in each request.

Full explanation →

576

MCQmedium

A data scientist is using Amazon SageMaker to train a model using the built-in XGBoost algorithm. The training job uses a hyperparameter tuning job to optimize hyperparameters. The tuning job has been running for 3 hours and has completed 20 training jobs. The data scientist wants to stop the tuning job early if it is not making progress. What should the data scientist do to accomplish this?

A.Configure the tuning job with early stopping enabled.

B.Set up a CloudWatch alarm to stop the tuning job if a metric does not improve.

C.Use SageMaker Experiments to monitor and manually stop the tuning job.

D.Use SageMaker Debugger to stop training jobs that are not improving.

AnswerA

Built-in early stopping stops underperforming training jobs.

Why this answer

Option B is correct because SageMaker's automatic model tuning supports early stopping with training job early stopping type. Option A is wrong because SageMaker Experiments is for tracking, not stopping. Option C is wrong because SageMaker Debugger stops training jobs, not tuning jobs.

Option D is wrong because a CloudWatch alarm cannot stop a tuning job directly.

Full explanation →

577

MCQhard

A company is building a near-real-time dashboard using data from multiple sources. They need to aggregate millions of events per second with sub-second latency. The architecture must be fully managed and minimize operational overhead. Which service should they use for the aggregation layer?

A.Amazon Kinesis Data Analytics for Apache Flink.

B.AWS Lambda functions triggered by Kinesis Data Streams.

C.Amazon EMR with Spark Streaming.

D.Amazon Redshift with materialized views refreshed frequently.

AnswerA

Kinesis Data Analytics with Flink provides low-latency, stateful stream processing at scale.

Why this answer

Option B is correct because Kinesis Data Analytics is designed for real-time streaming analytics with sub-second latency and is fully managed. Option A is wrong because Redshift is not designed for sub-second streaming ingestion; it is batch-oriented. Option C is wrong because EMR requires cluster management.

Option D is wrong because Lambda has concurrency limits and is not optimized for millions of events per second.

Full explanation →

578

MCQhard

During exploratory data analysis, a data scientist observes a strong correlation (r=0.95) between two numeric features. The model to be trained is a linear regression. What is the most appropriate action?

A.Apply standardization to both features.

B.Remove one of the correlated features.

C.Use L2 regularization (Ridge regression) without removing features.

D.Create an interaction term between the two features.

AnswerB

Removing reduces multicollinearity in linear regression.

Why this answer

Option C is correct because high correlation indicates multicollinearity, which can be addressed by removing one feature. Option A is wrong because scaling does not help. Option B is wrong because interaction terms increase multicollinearity.

Option D is wrong because regularization helps but is not the first step; removal is simpler.

Full explanation →

579

MCQhard

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to Amazon S3. The data arrives at unpredictable rates, with occasional bursts. The company needs to ensure data is delivered within 60 seconds of ingestion, and the data must be partitioned by year/month/day/hour. Which configuration meets these requirements?

A.Set the buffer size to 1 MB and disable dynamic partitioning

B.Use a Lambda function to process data and write to S3 with partitioning

C.Use AWS Glue streaming ETL to read from Firehose and write to S3

D.Set the buffer interval to 60 seconds and enable dynamic partitioning

AnswerD

Buffer interval controls delivery frequency; dynamic partitioning creates time-based folders.

Why this answer

Setting the buffer interval to 60 seconds and enabling dynamic partitioning ensures data is delivered within 60 seconds and automatically partitioned by time. Option B (buffer size 1 MB) would cause excessive small files. Option C (Lambda transformation) adds latency.

Option D (Glue streaming) is not directly integrated with Firehose.

Full explanation →

580

MCQeasy

A data scientist is training a binary classification model on an imbalanced dataset (95% negative class, 5% positive class). The model achieves 95% accuracy but only predicts the negative class for all examples. Which metric should the scientist use to evaluate model performance more appropriately?

A.F1 score

B.Mean squared error

C.Accuracy

D.AUC-ROC

AnswerD

AUC-ROC evaluates the model's ability to distinguish between classes regardless of threshold and is robust to imbalance.

Why this answer

AUC-ROC is robust to class imbalance because it evaluates the model's ability to discriminate between positive and negative classes across all classification thresholds, rather than relying on a single threshold. In this scenario, the model predicts only the negative class, so its true positive rate is 0 and false positive rate is 0, yielding an AUC-ROC of 0.5 (random performance), which correctly reflects the model's lack of predictive power.

Exam trap

The trap here is that candidates often choose F1 score (Option A) thinking it handles imbalance well, but they forget that F1 score requires at least some true positives to be meaningful, and in this extreme case where the model predicts only negatives, F1 score collapses to 0 or undefined, whereas AUC-ROC correctly identifies random performance.

How to eliminate wrong answers

Option A is wrong because F1 score is a harmonic mean of precision and recall, but when the model predicts only the negative class, recall is 0 (no true positives), making the F1 score undefined or 0, which does not provide a meaningful evaluation of the model's overall discriminative ability. Option B is wrong because mean squared error (MSE) is a regression metric that measures average squared differences between predicted and actual values; it is not designed for binary classification and does not account for class imbalance or threshold behavior. Option C is wrong because accuracy is misleading on imbalanced datasets; a model that always predicts the majority class achieves high accuracy (95%) but fails to identify any positive instances, so accuracy does not reflect the model's true performance on the minority class.

Full explanation →

581

MCQmedium

A data scientist is working with a dataset containing categorical features with high cardinality. The scientist wants to use a tree-based model. Which encoding method should be used?

A.Ordinal encoding

B.Target encoding

C.Label encoding

D.One-hot encoding

AnswerA

Ordinal encoding assigns integers without implying order, suitable for trees.

Why this answer

Option C is correct because tree-based models can handle ordinal encoding naturally. Option A is wrong because one-hot encoding creates many dimensions, not ideal for high cardinality. Option B is wrong because label encoding may impose ordinal relationship.

Option D is wrong because target encoding may cause overfitting.

Full explanation →

582

MCQmedium

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents 1% of the data. The model needs to maximize recall while keeping precision above 0.7. Which sampling strategy should the data scientist use?

A.NearMiss from imbalanced-learn to undersample the majority class based on distance to minority samples.

B.SMOTE from imbalanced-learn to generate synthetic samples for the minority class.

C.RandomUnderSampler from imbalanced-learn to undersample the majority class.

D.TomekLinks from imbalanced-learn to remove overlapping samples.

E.RandomOverSampler from imbalanced-learn to oversample the minority class.

AnswerB

SMOTE creates synthetic samples, balancing the dataset and improving recall while preserving precision.

Why this answer

Option C is correct because SMOTE generates synthetic samples for the minority class, which can improve recall without discarding data. Option A (RandomUnderSampler) may discard too many majority samples, reducing precision. Option B (RandomOverSampler) can cause overfitting.

Option D (NearMiss) focuses on hard samples and may reduce recall. Option E (TomekLinks) only removes noisy instances, not addressing imbalance effectively.

Full explanation →

583

MCQmedium

A company is using Amazon SageMaker to build a binary classification model for customer churn. The dataset is highly imbalanced (90% no churn, 10% churn). Which technique is MOST effective for handling class imbalance?

A.Use accuracy as the evaluation metric.

B.Undersample the majority class.

C.Use SMOTE to generate synthetic samples for the minority class.

D.Train a random forest model instead of logistic regression.

AnswerC

SMOTE is a standard oversampling technique.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the most effective option because it generates synthetic samples for the minority class by interpolating between existing minority instances, thereby balancing the dataset without discarding valuable majority-class data. This approach directly addresses the class imbalance in a binary classification task on SageMaker, improving model recall for the churn class without the information loss caused by undersampling.

Exam trap

The trap here is that candidates often assume switching to a tree-based model (like random forest) inherently solves class imbalance, but the exam tests that explicit resampling or cost-sensitive techniques are required for effective handling.

How to eliminate wrong answers

Option A is wrong because accuracy is a misleading metric for imbalanced datasets; a model that predicts 'no churn' for all instances would achieve 90% accuracy but fail to identify any churn cases. Option B is wrong because undersampling the majority class discards potentially useful data, which can lead to loss of information and reduced model performance, especially when the dataset is not extremely large. Option D is wrong because simply switching to a random forest model does not inherently address class imbalance; while tree-based models can handle imbalance better than logistic regression, they still require explicit imbalance-handling techniques like SMOTE or class weighting to be effective.

Full explanation →

584

MCQeasy

A company is using SageMaker to train a model. The training data includes personally identifiable information (PII). The company must ensure that the data is encrypted at rest and in transit. Which combination of actions meets these requirements?

A.Use S3 server-side encryption with S3 managed keys (SSE-S3)

B.Enable SSL for data in transit and use VPC endpoints

C.Place all resources in a private VPC subnets with no internet access

D.Use S3 server-side encryption and enable SageMaker inter-container traffic encryption

AnswerD

S3 SSE encrypts at rest; SageMaker inter-container encryption uses TLS for in-transit.

Why this answer

Option C is correct: S3 SSE-S3 or SSE-KMS encrypts data at rest, and SageMaker uses HTTPS for in-transit encryption. Option A (only SSL) lacks at-rest encryption. Option B (only S3 SSE) lacks in-transit.

Option D (VPC only) does not encrypt.

Full explanation →

585

MCQeasy

A data analyst wants to understand the distribution of a continuous variable. Which visualization is most appropriate for this purpose?

A.Box plot

B.Bar chart

C.Histogram

D.Scatter plot

AnswerC

Histogram displays the distribution of a single continuous variable.

Why this answer

Option B is correct because a histogram shows the frequency distribution of a continuous variable. Option A is wrong because a scatter plot shows relationship between two variables. Option C is wrong because a box plot shows summary statistics but not full distribution.

Option D is wrong because a bar chart is for categorical data.

Full explanation →

586

Multi-Selecteasy

Which TWO actions are best practices for tuning hyperparameters using Amazon SageMaker Automatic Model Tuning?

Select 2 answers

A.Set the number of training jobs to a very large value

B.Use the same hyperparameters as the baseline model

C.Use Bayesian optimization strategy

D.Use grid search strategy

E.Use random search strategy

AnswersC, E

Bayesian optimization is effective and efficient.

Why this answer

Using random search or Bayesian optimization are supported strategies. Grid search is also possible but not efficient for many hyperparameters. Setting a large number of training jobs can be costly.

Using the same hyperparameters as the baseline does not tune. So correct are A and C. B: Grid search is less efficient.

D: Large number of jobs is not a best practice due to cost. E: Not tuning is not a best practice.

Full explanation →

587

MCQhard

A data engineering team is designing a data lake on Amazon S3. They need to enforce encryption at rest for all data stored in the bucket. The security policy requires that the encryption keys be managed by the organization using AWS Key Management Service (KMS), and that the bucket must deny uploads of unencrypted objects. Which bucket policy should be applied?

A.A bucket policy that denies PutObject unless the request includes the 'x-amz-server-side-encryption' header with value 'AES256'

B.A bucket policy that denies PutObject if the 'x-amz-server-side-encryption' header is not present

C.A bucket policy that denies PutObject unless the request includes the 'x-amz-server-side-encryption-aws-kms-key-id' header matching the desired KMS key ID

D.Enable default encryption on the bucket with AWS-KMS

AnswerC

This enforces the use of a specific KMS key.

Why this answer

To enforce encryption, a bucket policy can deny PutObject if the object is not encrypted with the required KMS key. The condition 's3:x-amz-server-side-encryption-aws-kms-key-id' checks the key ID. Option B correctly denies requests that do not include the required key.

Option A is incomplete; Option C uses the wrong condition; Option D uses SSE-S3 instead of KMS.

Full explanation →

588

MCQhard

A data scientist is training a deep learning model for image classification. The model is overfitting on the training data. Which combination of techniques will most effectively reduce overfitting?

A.Add dropout layers and use data augmentation

B.Reduce the batch size

C.Train for more epochs without early stopping

D.Increase the number of layers and neurons

AnswerA

Dropout randomly drops units to prevent co-adaptation; data augmentation increases effective training set size, both reduce overfitting.

Why this answer

Dropout layers randomly deactivate a fraction of neurons during training, which forces the network to learn more robust features and prevents co-adaptation. Data augmentation artificially expands the training dataset by applying transformations (e.g., rotation, flipping, cropping), which reduces the model's ability to memorize spurious patterns and improves generalization. Together, these techniques directly counteract overfitting by increasing regularization and effective training diversity.

Exam trap

Cisco often tests the misconception that increasing model complexity (more layers/neurons) or training longer will fix overfitting, when in reality these actions worsen it, and that simple hyperparameter changes like batch size reduction are not primary regularization techniques.

How to eliminate wrong answers

Option B is wrong because reducing the batch size introduces noisier gradient estimates, which can sometimes act as a mild regularizer but is not a primary or reliable technique to combat overfitting; it may even destabilize training. Option C is wrong because training for more epochs without early stopping will exacerbate overfitting, as the model will continue to memorize noise in the training data. Option D is wrong because increasing the number of layers and neurons increases model capacity, which makes overfitting worse by allowing the network to fit training data more precisely.

Full explanation →

589

MCQeasy

A data scientist is analyzing a dataset with missing values. Which technique is most appropriate for imputing missing values in a numerical feature that follows a normal distribution?

A.Mean imputation

B.Standard deviation imputation

C.Mode imputation

D.Median imputation

AnswerA

Mean imputation preserves the mean of the normal distribution.

Why this answer

Mean imputation is suitable for normally distributed data as it preserves the mean. Median is robust to outliers, not normality. Mode is for categorical data.

Standard deviation is not an imputation method. KNN imputation is non-parametric.

Full explanation →

590

MCQmedium

A research lab is using SageMaker to train deep learning models on a custom dataset stored in S3. Each training job uses a single ml.p3.2xlarge instance. Recently, training jobs have been failing intermittently with 'NetworkError: Connection reset by peer' during the data download phase. The data scientist notices that the dataset is 50GB and the network throughput is low. The training script uses the default S3 download method (boto3) to copy data from S3 to the local instance storage. Which solution should the data scientist implement to resolve the issue?

A.Mount an EBS volume to the instance and copy data there before training.

B.Use SageMaker Pipe mode to stream data directly from S3.

C.Add retry logic in the training script to handle network errors.

D.Use a larger instance type like p3.8xlarge for better network bandwidth.

AnswerB

Pipe mode avoids large local file downloads and is more resilient.

Why this answer

Option D is correct because using SageMaker's Pipe mode streams data directly from S3 without writing to local disk, which is more reliable and avoids large downloads. Option A is incorrect because simply retrying may not fix the underlying network issue. Option B is incorrect because using a larger instance does not guarantee improved network reliability.

Option C is incorrect because using EBS volumes adds cost and does not solve the network reset issue; data still needs to be downloaded.

Full explanation →

591

MCQhard

A data scientist is training a neural network for image classification. The dataset has 50,000 images across 100 classes. The model uses a ResNet-50 architecture pre-trained on ImageNet. The training loss decreases rapidly, but validation loss starts to increase after 5 epochs. Which of the following is the most effective technique to address this?

A.Increase the learning rate

B.Add more layers to the network

C.Use data augmentation to increase the diversity of the training set

D.Use a smaller batch size

AnswerC

Data augmentation artificially expands the training set, reducing overfitting and improving generalization.

Why this answer

The rapid decrease in training loss followed by an increase in validation loss after only 5 epochs is a classic sign of overfitting. Data augmentation artificially expands the training set by applying random transformations (e.g., rotations, flips, crops) to existing images, which improves the model's generalization and reduces overfitting. This is the most effective technique among the options because it directly addresses the lack of diverse training examples without changing the model architecture or training hyperparameters in a way that could destabilize learning.

Exam trap

The trap here is that candidates often confuse overfitting with underfitting or training instability, and incorrectly choose to increase learning rate or add layers, not recognizing that the validation loss rising while training loss falls is the textbook symptom of overfitting that requires regularization or more data.

How to eliminate wrong answers

Option A is wrong because increasing the learning rate would likely cause the optimizer to overshoot minima, making both training and validation loss unstable or diverge, which does not fix overfitting. Option B is wrong because adding more layers to an already deep ResNet-50 would increase model capacity and exacerbate overfitting, especially with a fixed dataset size. Option D is wrong because using a smaller batch size introduces more noise into gradient estimates, which can sometimes act as a regularizer but is less reliable and effective than data augmentation for addressing overfitting in image classification; it may also slow convergence.

Full explanation →

592

MCQhard

A data scientist is working on a multi-class classification problem with 10 classes. The model outputs probabilities and the scientist wants to evaluate the model's ability to rank classes correctly. Which metric is most appropriate?

A.F1 score

B.Accuracy

C.Area Under the ROC Curve (AUC-ROC)

D.Log loss

AnswerC

AUC-ROC measures ranking ability for multi-class via one-vs-rest.

Why this answer

The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) measures the model's ability to distinguish between classes. For multi-class, one-vs-rest AUC can be used. Log loss measures calibration, not ranking.

F1 score is for binary classification or per-class. Accuracy does not consider ranking. Option A: AUC-ROC is correct.

Option B: Log loss measures probability calibration. Option C: F1 score is not a ranking metric. Option D: Accuracy is not a ranking metric.

Full explanation →

593

MCQeasy

A machine learning team is using Amazon SageMaker to train models on a large dataset stored in Amazon S3. The dataset is 5 TB in size and is partitioned by date. The team wants to minimize data transfer costs and reduce training time by caching frequently accessed data locally on the training instances. The training instances are EC2 instances with attached Amazon EBS volumes. The team is considering using SageMaker Pipe mode to stream data directly from S3, but they are concerned about network bandwidth. Which approach should the team use to optimize data loading for training?

A.Use Amazon FSx for Lustre as a high-performance file system linked to the S3 bucket, and mount it on the training instances.

B.Use SageMaker File mode with Amazon EFS, which allows multiple training instances to share the same file system and caches data from S3.

C.Increase the size of the EBS volumes attached to the training instances and copy the entire dataset to the volumes before training.

D.Use SageMaker Pipe mode to stream data from S3 directly to the training algorithm, which automatically caches data in memory.

AnswerB

File mode with EFS enables caching and sharing, reducing repeated S3 downloads.

Why this answer

Option D is correct because Amazon SageMaker File mode with Amazon Elastic File System (EFS) provides a shared file system that can cache data across training jobs, reducing the need to repeatedly download from S3. Option A is incorrect because FSx for Lustre is optimized for high-performance computing but not specifically for SageMaker training. Option B is incorrect because SageMaker Pipe mode streams data but does not cache.

Option C is incorrect because EBS volumes are attached per instance and cannot be shared across jobs for caching.

Full explanation →

594

MCQhard

A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?

A.One-hot encoding introduced multicollinearity among the binary columns.

B.One-hot encoding reduced the number of features, causing underfitting.

C.The one-hot encoding introduced high variance, but the validation set has low variance.

D.The model suffers from the curse of dimensionality due to the large number of features.

AnswerD

With 100 additional sparse features, the model may overfit and not generalize well.

Why this answer

One-hot encoding 'zip_code' with 100 unique values creates 100 binary features. With only 100 features, the dataset is not high-dimensional enough to cause the curse of dimensionality, which typically requires thousands of features. The poor performance is more likely due to other issues like overfitting or data leakage, not the curse of dimensionality.

Option D is incorrect because the curse of dimensionality is not the most likely cause in this scenario.

Exam trap

AWS often tests the misconception that one-hot encoding always causes multicollinearity or the curse of dimensionality, when in reality the primary risk is overfitting due to sparse representation of high-cardinality categories.

How to eliminate wrong answers

Option A is wrong because one-hot encoding does not introduce multicollinearity; in fact, it creates orthogonal binary columns that are linearly independent when the intercept is dropped. Option B is wrong because one-hot encoding increases the number of features, not reduces them, so it cannot cause underfitting due to feature reduction. Option C is wrong because one-hot encoding can increase variance (overfitting) but the validation set having low variance is not a direct consequence; the issue is that the model may overfit the training data, not that the validation set has low variance.

Full explanation →

595

MCQmedium

A data scientist is building a regression model to predict house prices. The dataset includes features such as square footage, number of bedrooms, and location. After training a linear regression model, the scientist notices that the residuals have a pattern: they increase as the predicted value increases. Which action is most appropriate?

A.Remove outliers from the dataset

B.Use Ridge regression instead of linear regression

C.Add polynomial features to the model

D.Apply a log transformation to the target variable

AnswerD

Log transformation can stabilize variance and reduce heteroscedasticity.

Why this answer

Patterned residuals (heteroscedasticity) violating linear regression assumptions. Log-transforming the target variable can stabilize variance. Adding polynomial features or interactions may help with non-linearity but not specifically for heteroscedasticity.

Ridge regression is for multicollinearity, not for patterned residuals.

Full explanation →

596

MCQhard

A company uses Amazon SageMaker to train a model for fraud detection. The dataset has 1 million transactions, with 0.1% fraud. The data scientist trains a random forest model and achieves 99.9% accuracy but 0% recall on the fraud class. Which technique is most likely to improve recall without significantly reducing precision?

A.Tune the classification threshold

B.Use cost-sensitive learning with a high cost for fraud misclassification

C.Apply SMOTE to generate synthetic fraud samples

D.Undersample the majority class

AnswerC

SMOTE creates synthetic instances of the minority class, balancing the dataset and improving recall while maintaining precision.

Why this answer

Option C is correct because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic fraud samples by interpolating between existing minority class instances, which directly addresses the extreme class imbalance (0.1% fraud). This increases the representation of the fraud class in the training data, allowing the random forest model to learn decision boundaries that capture fraud patterns, thereby improving recall without introducing the noise or information loss associated with other methods.

Exam trap

The trap here is that candidates often assume cost-sensitive learning (Option B) is the best approach for imbalanced data, but in extreme imbalance with 0% recall, oversampling techniques like SMOTE are more effective because they directly increase the minority class representation rather than just adjusting penalties.

How to eliminate wrong answers

Option A is wrong because tuning the classification threshold can trade off precision and recall, but with 0% recall, the model is already predicting all instances as non-fraud; lowering the threshold may increase recall but will likely cause a drastic drop in precision due to the overwhelming majority class. Option B is wrong because cost-sensitive learning assigns a higher penalty to fraud misclassification during training, which can improve recall, but it does not directly address the lack of fraud examples in the dataset and may still result in poor precision if the model cannot learn from sparse data. Option D is wrong because undersampling the majority class reduces the dataset size and discards potentially useful information, which can lead to loss of decision boundary details and decreased model performance, often harming precision more than helping recall.

Full explanation →

597

MCQeasy

An ML engineer needs to run a hyperparameter tuning job on Amazon SageMaker. The training algorithm supports distributed training across multiple GPUs. The engineer wants to minimize the total time to find the best hyperparameters. Which strategy should be used?

A.Use random search to explore a wide range.

B.Use grid search to cover all combinations.

C.Use Hyperband which is designed for distributed training.

D.Use Bayesian optimization as the tuning strategy.

AnswerD

Bayesian optimization adaptively selects hyperparameters, reducing total tuning time.

Why this answer

Bayesian search uses past results to select hyperparameters, converging faster than random or grid search. Random search (B) does not use past results. Grid search (C) is exhaustive and slow.

Hyperband (D) is a bandit method but requires early stopping; Bayesian is better for minimizing total time with a fixed budget.

Full explanation →

598

MCQhard

A data engineer is performing EDA on a dataset with 1 million rows and 200 columns. The dataset is stored in S3 as CSV files. The engineer notices that some columns have a high proportion of zeros. What is the best approach to determine if these zeros represent missing data or actual zero values?

A.Check correlation of zero columns with other features; if low, assume zeros are missing.

B.Calculate the percentage of zeros and compare with other columns; if unusually high, treat as missing.

C.Use AWS Glue Data Catalog to view column statistics and infer missing values.

D.Consult the data source documentation or domain experts to understand the meaning of zero values.

AnswerD

Domain knowledge is crucial for accurate interpretation of data.

Why this answer

Option D is correct because domain knowledge and documentation are the most reliable ways to understand the meaning of zeros. Option A is wrong because statistical methods cannot distinguish missing vs actual zero without context. Option B is wrong because metadata may not have this detail.

Option C is wrong because comparing to other columns might be misleading.

Full explanation →

599

MCQhard

Refer to the exhibit. A SageMaker training job is launched with the CLI command shown. The job fails with an error 'S3 data distribution type not supported for File mode'. What is the most likely fix?

A.Change TrainingInputMode to Pipe

B.Change InstanceCount to 2

C.Increase VolumeSizeInGB to 100

D.Increase MaxRuntimeInSeconds to 7200

AnswerA

FullyReplicated is only supported in Pipe mode.

Why this answer

Option D is correct because FullyReplicated data distribution is only supported in Pipe mode. Option A is wrong because increasing volume size does not affect data distribution. Option B is wrong because changing instance type does not fix the mode mismatch.

Option C is wrong because MaxRuntime is not related.

Full explanation →

600

MCQhard

A team is using SageMaker to train a deep learning model for image classification. The training job is failing with a 'CUDA out of memory' error. The team is using a p3.2xlarge instance (1 GPU, 16 GB GPU memory). The dataset consists of 256x256 RGB images. Which action is MOST likely to resolve the error without changing the instance type?

A.Increase the batch size to utilize GPU more efficiently

B.Enable automatic model tuning to optimize hyperparameters

C.Use Spot Instances to reduce cost

D.Reduce the batch size

AnswerD

Smaller batch size reduces memory footprint per iteration, resolving OOM errors.

Why this answer

The 'CUDA out of memory' error indicates that the GPU memory is exhausted. Reducing the batch size directly decreases the memory footprint per training step, allowing the model to fit within the 16 GB GPU memory of the p3.2xlarge instance. This is the most direct and effective fix without changing the instance type.

Exam trap

The trap here is that candidates may confuse 'CUDA out of memory' with a performance issue and incorrectly choose to increase batch size for efficiency, when in fact the error is a hard memory limit that requires reducing memory usage.

How to eliminate wrong answers

Option A is wrong because increasing the batch size would increase GPU memory consumption, worsening the out-of-memory error. Option B is wrong because automatic model tuning (hyperparameter optimization) does not directly address GPU memory limits; it may even suggest larger batch sizes that exacerbate the issue. Option C is wrong because Spot Instances reduce cost but do not affect GPU memory capacity; the error would persist regardless of instance pricing model.

Full explanation →

Page 8 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →