AWS Certified Machine Learning Specialty MLS-C01 MLS-C01 Questions 1576–1650 | Page 22/24

1576

MCQeasy

An engineer sees the error in the exhibit when trying to deploy a model from a model registry in SageMaker. What is the MOST likely cause?

A.The IAM role lacks permission to access the model registry

B.The model package version does not exist in the registry

C.The model package is still in 'Approved' status

D.The SageMaker endpoint is already deployed with the same model

AnswerB

The ARN includes a version number; the error says 'Could not find'.

Why this answer

Option A is correct because the error indicates the model package ARN does not exist. Option B would show a different error. Option C is not indicated.

Option D would show a different error.

Full explanation →

1577

MCQmedium

An organization stores streaming data in Amazon Kinesis Data Streams. A data analyst wants to perform real-time exploratory data analysis on the incoming data to detect anomalies. Which AWS service should the analyst use to run SQL queries on the streaming data?

A.Amazon Kinesis Data Analytics

B.Amazon SageMaker

C.AWS Glue

D.Amazon Athena

AnswerA

Kinesis Data Analytics supports SQL queries on streaming data for real-time analysis.

Why this answer

Option B is correct because Amazon Kinesis Data Analytics enables running SQL queries on streaming data in real-time. Option A is wrong because Athena is for batch queries on S3. Option C is wrong because Glue is for ETL, not real-time SQL.

Option D is wrong because SageMaker is for ML model training, not streaming SQL.

Full explanation →

1578

Multi-Selectmedium

A company is using Amazon SageMaker to run a hyperparameter tuning job. The tuning job uses Bayesian optimization. Which THREE statements about Bayesian optimization are correct? (Choose THREE.)

Select 3 answers

A.It can only handle a maximum of 5 hyperparameters

B.It works well for continuous hyperparameters

C.It selects hyperparameter combinations based on previous trial results

D.It often finds optimal hyperparameters in fewer trials than random search

E.It requires more trials than grid search to find optimal values

AnswersB, C, D

Bayesian optimization handles continuous parameters naturally.

Why this answer

Options A, B, and D are correct. Bayesian optimization uses past results to choose hyperparameters (A), it often finds better values than random search (B), and it is suitable for continuous hyperparameters (D). Option C is false because Bayesian optimization typically requires fewer trials than grid search.

Option E is false because it can handle many hyperparameters, though may need more trials.

Full explanation →

1579

MCQhard

A company uses Amazon SageMaker to train a model. The training job uses a custom Docker container. The job fails with the error 'CannotStartContainerError: API error (500).' Which of the following is the most likely cause?

A.The Docker image is built for a different CPU architecture.

B.The training script has a syntax error.

C.The S3 input data is missing.

D.The output path is not writable.

AnswerA

Incompatible architecture prevents container from running.

Why this answer

Option D is correct because incompatible CPU instruction sets can cause container start failures. Option A is wrong because it would cause a different error. Option B is wrong because the error is during start.

Option C is wrong because the error mentions container, not file system.

Full explanation →

1580

MCQmedium

A company has a time series dataset of daily sales for the past 5 years. They want to forecast sales for the next 30 days. The data shows weekly seasonality and a slight upward trend. Which Amazon SageMaker algorithm is most appropriate for this task?

A.DeepAR

B.Linear Learner

C.XGBoost

D.K-Means

AnswerA

DeepAR is a built-in SageMaker algorithm for time series forecasting that handles seasonality and trends.

Why this answer

DeepAR is purpose-built for time series forecasting with seasonal patterns and trends. It uses a recurrent neural network (RNN) to model the conditional distribution of future values given past observations, and it natively handles multiple time series, missing data, and known seasonal periods (e.g., weekly). The weekly seasonality and upward trend in the daily sales data are exactly the kind of patterns DeepAR is designed to capture.

Exam trap

The trap here is that candidates often pick XGBoost (Option C) because it is a powerful tree-based model, but they overlook that it lacks native time series capabilities and requires manual feature engineering to capture seasonality and trend, whereas DeepAR is the only option specifically designed for this forecasting task.

How to eliminate wrong answers

Option B (Linear Learner) is wrong because it is a general-purpose linear regression or classification algorithm that cannot model seasonality or temporal dependencies without extensive manual feature engineering (e.g., lag variables, Fourier terms). Option C (XGBoost) is wrong because while it can be used for time series via feature engineering, it is not a dedicated forecasting algorithm and does not natively handle temporal order, autocorrelation, or seasonality; it treats each prediction as an independent regression task. Option D (K-Means) is wrong because it is an unsupervised clustering algorithm that groups data points by similarity and has no mechanism for forecasting future values in a time series.

Full explanation →

1581

MCQmedium

A company wants to deploy a machine learning model that performs real-time inference with sub-second latency. The model is a deep neural network with 500 MB of weights. The inference endpoint must scale to zero when not in use to minimize cost. Which AWS service should the company use?

A.Deploy the model as an AWS Lambda function with provisioned concurrency.

B.Use Amazon SageMaker Serverless Inference to host the model.

C.Host the model on Amazon ECS with Fargate and use a target tracking scaling policy.

D.Create an Amazon SageMaker real-time endpoint with automatic scaling policies.

AnswerB

SageMaker Serverless Inference automatically scales to zero when idle, reducing costs, and can handle sub-second latency for suitable workloads. It also supports large model sizes.

Why this answer

Amazon SageMaker Serverless Inference is designed for workloads with intermittent traffic patterns, automatically scaling to zero when idle and scaling up for real-time requests. It supports models up to 1 GB in size and provides sub-second latency for inference, making it ideal for this 500 MB deep neural network. This service eliminates the need to manage underlying infrastructure while meeting the latency and cost requirements.

Exam trap

The trap here is that candidates often confuse SageMaker Serverless Inference with SageMaker real-time endpoints, assuming automatic scaling can reduce costs to zero, but real-time endpoints always require a minimum instance count, whereas Serverless Inference truly scales to zero.

How to eliminate wrong answers

Option A is wrong because AWS Lambda has a maximum deployment package size of 250 MB (unzipped, including layers) and a 15-minute execution timeout, making it unsuitable for a 500 MB model and real-time inference with sub-second latency. Option C is wrong because Amazon ECS with Fargate does not natively scale to zero; it requires at least one running task to handle requests, and target tracking scaling policies maintain a baseline capacity, incurring costs even when idle. Option D is wrong because Amazon SageMaker real-time endpoints with automatic scaling policies cannot scale to zero; they maintain a minimum number of instances to ensure availability, leading to ongoing costs when not in use.

Full explanation →

1582

MCQmedium

A machine learning team is deploying a model for real-time fraud detection. The model must make predictions with less than 100ms latency. The team uses SageMaker and the model is a large ensemble of decision trees. Which SageMaker hosting option is MOST suitable?

A.SageMaker Multi-model endpoint

B.SageMaker Serverless Inference

C.SageMaker Elastic Inference

D.SageMaker Batch Transform

AnswerA

Supports multiple models on one instance with low latency.

Why this answer

A SageMaker Multi-model endpoint is the most suitable option because it allows hosting multiple models (including large ensembles) on a single endpoint while sharing underlying compute resources, which reduces cost and latency. The endpoint dynamically loads and caches models based on inference requests, enabling sub-100ms predictions for large ensemble models by keeping frequently used models in memory.

Exam trap

The trap here is that candidates might choose SageMaker Serverless Inference (Option B) thinking it automatically handles scaling for real-time workloads, but they overlook the cold start latency penalty that makes it unsuitable for sub-100ms latency requirements.

How to eliminate wrong answers

Option B (SageMaker Serverless Inference) is wrong because it has a cold start latency that can exceed 100ms, making it unsuitable for real-time fraud detection requiring consistent sub-100ms responses. Option C (SageMaker Elastic Inference) is wrong because it accelerates deep learning models by attaching GPU accelerators, but it does not benefit decision tree ensembles which are CPU-bound and do not leverage GPU acceleration. Option D (SageMaker Batch Transform) is wrong because it is designed for offline, asynchronous batch predictions on large datasets, not for real-time inference with low latency requirements.

Full explanation →

1583

MCQmedium

A data scientist is analyzing a dataset with missing values. The missing data is not random and is correlated with other features. Which imputation method is most appropriate to minimize bias?

A.Last observation carried forward

B.Multiple imputation using MICE

C.Listwise deletion

D.Mean imputation

AnswerB

Correct: MICE models missing values using other features, suitable for non-random missingness.

Why this answer

Option B is correct because Multiple Imputation by Chained Equations (MICE) accounts for relationships between features and preserves variability. Option A is wrong because mean imputation can bias estimates when data is not missing completely at random. Option C is wrong because dropping rows reduces sample size and may introduce bias.

Option D is wrong because last observation carried forward is for time series.

Full explanation →

1584

MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should be used to trigger the job?

A.AWS Lambda

B.AWS Step Functions

C.Amazon CloudWatch Events (EventBridge)

D.AWS Data Pipeline

AnswerC

EventBridge can schedule cron jobs to trigger Glue.

Why this answer

Option B is correct because Amazon CloudWatch Events (EventBridge) can trigger Glue jobs on a schedule. Option A is wrong because Lambda can trigger Glue but is not a scheduler; Option C is wrong because Data Pipeline is for complex workflows; Option D is wrong because Step Functions is for state machines, not simple scheduling.

Full explanation →

1585

MCQhard

You are a data engineer at a fintech company. The company processes real-time stock market data from multiple exchanges. The data is ingested via Amazon Kinesis Data Streams with 50 shards. Each record is about 1 KB, and the ingestion rate is 5,000 records per second. The data is consumed by a Java application running on Amazon ECS that performs real-time analytics and stores results in Amazon DynamoDB. Recently, the application has been experiencing high latency, and some records are stuck in the shards for minutes before being consumed. The CloudWatch metrics show that the application's CPU utilization is low, but the iterator age is increasing. The application uses the Kinesis Client Library (KCL) with a single worker. What is the most likely cause and how should it be fixed?

A.Increase the number of shards to 200 to provide more throughput.

B.Increase the CPU capacity of the ECS task by moving to a larger instance type.

C.Move the destination from DynamoDB to Amazon RDS to reduce write latency.

D.Scale the number of KCL workers to match the number of shards (e.g., 50 workers) to process shards in parallel.

AnswerD

A single worker can only process one shard at a time; with 50 shards, records in other shards wait. Multiple workers can process shards concurrently, reducing latency.

Why this answer

Option C is correct because a single KCL worker can only process one shard at a time; with 50 shards, records sit idle. Increasing the number of workers (e.g., to 50) allows parallel processing of all shards, reducing iterator age. Option A is wrong because the ingestion rate is well within the 50 shards' capacity (50 MB/s write vs ~5 MB/s actual).

Option B is wrong because CPU is low, not high. Option D is wrong because DynamoDB is not part of the ingestion path.

Full explanation →

1586

MCQhard

An ML engineer is deploying a model on a SageMaker endpoint and wants to ensure that only authorized users and services can invoke the endpoint. The company uses AWS IAM for access control and requires that the endpoint be invoked only from within a specific VPC. What combination of actions should the engineer take? (Choose the single best answer.)

A.Use AWS CloudFront to restrict access based on IP addresses.

B.Use API Gateway in front of the SageMaker endpoint and attach a resource policy to API Gateway.

C.Create a VPC endpoint for Amazon SageMaker and attach a policy that only allows invocation from the VPC. Use IAM roles to restrict which users can invoke the endpoint.

D.Configure network ACLs on the VPC subnet to allow only the endpoint's security group.

AnswerC

VPC endpoint with policy ensures only traffic from the VPC can reach SageMaker API, and IAM controls user permissions.

Why this answer

The best approach is to attach an IAM policy to the endpoint (via AWS Lambda or resource policy) that denies access unless the request originates from the VPC. However, SageMaker endpoints do not support resource-based policies directly; instead, use VPC Endpoint policies and IAM. The correct answer is to create a VPC endpoint for SageMaker and attach a policy that restricts invocation to the VPC, combined with IAM roles that allow only specific principals.

Option B (network ACLs) is not sufficient for authentication. Option C (API Gateway) adds unnecessary complexity. Option D (CloudFront) is for CDN, not access control.

Full explanation →

1587

MCQmedium

A data scientist is analyzing a dataset with missing values in a numeric column. The missing rate is 30% and the data is not missing completely at random. Which imputation method should the data scientist avoid to minimize bias?

A.Mean imputation

B.Model-based imputation using linear regression

C.k-Nearest Neighbors imputation

D.Multiple imputation using chained equations

AnswerA

Mean imputation can introduce bias and reduce variance, especially when data is not missing completely at random.

Why this answer

Option C is correct because mean imputation can introduce bias when data is not missing completely at random, as it reduces variance and distorts relationships. Option A (multiple imputation) and B (model-based imputation) are appropriate for non-random missing data. Option D (k-NN imputation) can also be used but may be less biased than mean imputation.

Full explanation →

1588

MCQeasy

During exploratory data analysis, a data scientist notices that the Pearson correlation coefficient between two continuous variables is 0.85. What does this indicate?

A.A causal relationship between the two variables

B.A strong positive linear relationship

C.A weak negative linear relationship

D.No relationship between the variables

AnswerB

Values close to 1 indicate a strong positive linear relationship.

Why this answer

Option B is correct because a correlation of 0.85 indicates a strong positive linear relationship. Option A is wrong because 0.85 is far from 1. Option C is wrong because correlation does not imply causation.

Option D is wrong because 0.85 is not weak.

Full explanation →

1589

MCQmedium

A company uses Amazon SageMaker to train a linear regression model. After training, the model shows high bias on the training set. Which action is MOST likely to reduce bias?

A.Add more features

B.Collect more training data

C.Apply L2 regularization

D.Deploy the model to a larger instance

AnswerA

More features can capture patterns better.

Why this answer

High bias indicates that the model is underfitting the training data, meaning it is too simple to capture the underlying patterns. Adding more features increases the model's capacity to learn complex relationships, directly addressing underfitting by reducing bias. In SageMaker, this can be done by engineering additional input columns or using feature transformations before training.

Exam trap

The trap here is that candidates confuse high bias with high variance and incorrectly choose regularization or more data, which are solutions for overfitting, not underfitting.

How to eliminate wrong answers

Option B is wrong because collecting more training data does not reduce bias; it primarily helps with high variance (overfitting) by providing more examples to generalize from. Option C is wrong because L2 regularization (ridge regression) penalizes large coefficients, which increases bias to reduce variance, making bias worse in an already underfit model. Option D is wrong because deploying the model to a larger instance affects inference performance (latency/throughput) but does not change the model's learned parameters or its bias-variance tradeoff.

Full explanation →

1590

Multi-Selectmedium

A data scientist is exploring a dataset with 50 features and a binary target. The data scientist computes the correlation matrix and finds that two features, X1 and X2, have a correlation coefficient of 0.95. Which TWO actions should the data scientist consider? (Choose 2.)

Select 2 answers

A.Apply a log transformation to X1 and X2.

B.Remove one of the highly correlated features from the dataset.

C.Apply Principal Component Analysis (PCA) to the feature set.

D.Create an interaction term between X1 and X2.

E.Impute missing values for X1 and X2.

AnswersB, C

Removing one feature reduces multicollinearity.

Why this answer

Option A is correct because high correlation indicates multicollinearity; removing one feature reduces redundancy. Option C is correct because PCA can create uncorrelated components. Option B is wrong because adding interaction terms would increase multicollinearity.

Option D is wrong because correlation does not imply missing values. Option E is wrong because log transformation does not address correlation between features.

Full explanation →

1591

Multi-Selecthard

Which THREE techniques can help reduce overfitting in a neural network? (Select THREE.)

Select 3 answers

A.Increase training epochs

B.Dropout

C.Early stopping

D.Increase the number of layers

E.L2 regularization

AnswersB, C, E

Dropout randomly drops units.

Why this answer

Dropout is correct because it randomly deactivates a fraction of neurons during training, forcing the network to learn redundant representations and preventing co-adaptation of features. This reduces overfitting by acting as an ensemble method without increasing computational cost at inference time.

Exam trap

AWS often tests the misconception that adding more capacity (layers/epochs) always improves performance, when in fact it increases overfitting without proper regularization.

Full explanation →

1592

MCQmedium

A data science team needs to process streaming data from thousands of IoT devices and perform real-time anomaly detection. The data must be persisted in Amazon S3 for batch processing later. Which combination of AWS services should be used to meet these requirements?

A.Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Analytics for anomaly detection, and Amazon Kinesis Data Firehose to deliver data to Amazon S3.

B.Amazon Kinesis Data Streams for ingestion, AWS Glue for anomaly detection, and Amazon S3 for storage.

C.AWS Lambda for both ingestion and anomaly detection, and Amazon S3 for storage.

D.Amazon Simple Queue Service (SQS) for ingestion, AWS Lambda for anomaly detection, and Amazon S3 for storage.

AnswerA

This combination provides real-time ingestion, analytics, and durable storage.

Why this answer

Option A is correct because Kinesis Data Streams ingests streaming data, Kinesis Data Analytics performs real-time anomaly detection, and Firehose delivers data to S3 for batch processing. Option B is wrong because SQS is not optimized for streaming and does not have built-in analytics. Option C is wrong because Lambda alone cannot handle high-throughput streaming and lacks persistence to S3.

Option D is wrong because Glue is a batch ETL service, not real-time.

Full explanation →

1593

MCQmedium

A data scientist is working with a dataset containing 10,000 observations and 100 features. The scientist wants to detect outliers in the dataset. Which method is most appropriate for outlier detection in a high-dimensional space?

A.Use Z-score to identify points beyond 3 standard deviations

B.Use Isolation Forest

C.Use Mahalanobis distance

D.Use interquartile range (IQR) for each feature

AnswerB

Isolation Forest is designed for high-dimensional data and does not assume distribution.

Why this answer

Option D is correct because Isolation Forest is effective for high-dimensional data and uses tree-based isolation. Option A is wrong because Z-score assumes normality and is univariate. Option B is wrong because IQR is univariate and not suitable for high dimensions.

Option C is wrong because Mahalanobis distance assumes multivariate normality and is sensitive to dimensionality.

Full explanation →

1594

MCQmedium

A machine learning team needs to process a large dataset stored in Amazon S3 using Apache Spark. They want to minimize cost and avoid managing infrastructure. Which AWS service should they use?

A.AWS Glue

B.Amazon Athena

C.Amazon EMR

D.Amazon SageMaker

AnswerA

Glue provides serverless Spark for ETL on S3 data.

Why this answer

AWS Glue is a serverless Spark environment that can process data in S3 without provisioning clusters. EMR requires cluster management. Athena is SQL-only.

SageMaker is for training models, not general-purpose Spark.

Full explanation →

1595

Matchingmedium

Match each AWS service to its primary purpose in a machine learning pipeline.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Build, train, and deploy ML models

ETL and data cataloging

Object storage for datasets and models

Serverless compute for preprocessing

Image and video analysis

Why these pairings

These services are commonly used in ML workflows on AWS.

Full explanation →

1596

MCQeasy

A company is building a data pipeline to process streaming data from IoT devices. The data must be ingested with low latency, transformed in real-time using custom logic, and stored in Amazon S3 partitioned by device ID and timestamp. Which combination of AWS services should the company use to meet these requirements?

A.Amazon Kinesis Data Firehose with direct S3 delivery

B.Amazon Managed Streaming for Apache Kafka (MSK) with Amazon S3 sink connector

C.Amazon DynamoDB Streams with AWS Lambda and Amazon S3

D.Amazon Kinesis Data Streams with AWS Lambda and Amazon S3

AnswerD

Kinesis Data Streams for ingestion, Lambda for real-time transformation, and S3 for storage with partitioning.

Why this answer

Option B is correct because Amazon Kinesis Data Streams provides low-latency ingestion, AWS Lambda can apply custom transformation logic in real-time, and Amazon S3 with partitioning can store the data. Option A is wrong because Kinesis Data Firehose does not support custom transformation without Lambda and cannot partition on write. Option C is wrong because Amazon MSK (Managed Streaming for Kafka) is more complex than needed and not as tightly integrated.

Option D is wrong because Amazon DynamoDB Streams is not designed for this volume of streaming data.

Full explanation →

1597

MCQhard

A data scientist uses Amazon SageMaker Data Wrangler to explore a dataset. The target column is 'price' (continuous). Which EDA analysis would best help decide between linear regression and tree-based models?

A.Compute variance inflation factor (VIF) for features

B.Check linear relationships between features and target

C.Detect outliers using Z-score

D.Identify class imbalance in the target

AnswerB

Why A is correct

Why this answer

Option A is correct because checking linearity (e.g., scatter plots of features vs. target) is fundamental for linear model assumptions. Option B is wrong because multicollinearity affects linear regression but not tree models. Option C is wrong because class imbalance is for classification.

Option D is wrong because outlier detection is important but not the primary factor for model selection.

Full explanation →

1598

MCQmedium

A data engineer needs to continuously ingest streaming data from thousands of IoT devices and store the raw data in Amazon S3 for archival processing. The data volume varies significantly throughout the day, and the solution must be serverless, scalable, and cost-effective. Which AWS service should be used to capture and buffer the streaming data before writing to S3?

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Streams

C.AWS Glue

D.Amazon Simple Queue Service (SQS)

AnswerA

Kinesis Data Firehose is a serverless service that can directly deliver streaming data to S3 with buffering.

Why this answer

Amazon Kinesis Data Firehose is a serverless service that can capture, transform, and load streaming data into S3. It automatically scales and can buffer data to handle varying volumes. Amazon Kinesis Data Streams (option A) is for real-time processing but requires a consumer to write to S3.

Amazon SQS (option C) is a message queue, not designed for streaming data. AWS Glue (option D) is a batch ETL service.

Full explanation →

1599

MCQmedium

A company uses Amazon SageMaker Data Wrangler to perform exploratory data analysis. They want to detect outliers in a numerical column using the Interquartile Range (IQR) method. Which transformation should they apply in Data Wrangler?

A.Impute

B.Normalize

C.Handle outliers

D.Binning

AnswerC

This transform supports IQR method.

Why this answer

Option A is correct because Data Wrangler has a built-in 'Handle outliers' transform that allows IQR-based detection. Option B (Normalize) scales data; Option C (Binning) groups values; Option D (Impute) fills missing values.

Full explanation →

1600

Multi-Selectmedium

A data scientist is exploring a dataset with skewed numerical features. Which THREE transformations can help make the features more normally distributed?

Select 3 answers

A.Min-max scaling

B.Standardization (Z-score)

C.Yeo-Johnson transformation

D.Box-Cox transformation

E.Log transformation

AnswersC, D, E

Correct: Yeo-Johnson works for both positive and negative values.

Why this answer

Correct options: A, B, D. Log transform (A), Box-Cox (B), and Yeo-Johnson (D) are common for normalizing skewed data. Option C is wrong because standardization (Z-score) does not change distribution shape.

Option E is wrong because min-max scaling does not fix skewness.

Full explanation →

1601

MCQeasy

A company uses Amazon SageMaker to deploy a model that predicts customer churn. The model is retrained weekly. The data scientist notices that the model's accuracy remains high, but the business reports that the model is not capturing new churn patterns. What is the most likely cause?

A.The model is underfitting the data

B.The model has data leakage from future data

C.The model is overfitting to the training data

D.The model is experiencing concept drift

AnswerD

Concept drift means the underlying data distribution changes, so the model's accuracy on old patterns remains high but it misses new patterns.

Why this answer

Concept drift occurs when the statistical properties of the target variable change over time, causing the model's predictions to become less relevant even if accuracy metrics remain high. In this scenario, the model is retrained weekly but still fails to capture new churn patterns because the underlying customer behavior has shifted—a classic sign of concept drift rather than a data or overfitting issue. Amazon SageMaker's built-in Model Monitor can detect such drift by comparing inference data distributions against a baseline.

Exam trap

The trap here is that candidates see 'accuracy remains high' and assume the model is overfitting or underfitting, but the key clue is 'not capturing new churn patterns'—which points to a shift in the underlying data distribution (concept drift), not a static model fit issue.

How to eliminate wrong answers

Option A is wrong because underfitting would manifest as consistently low accuracy on both training and test data, not as high accuracy with missed new patterns. Option B is wrong because data leakage from future data would cause unrealistically high performance during training and evaluation, not a failure to capture new churn patterns after deployment. Option C is wrong because overfitting would show high training accuracy but poor generalization on unseen data from the same distribution, whereas the problem here is that the data distribution itself has changed over time.

Full explanation →

1602

MCQhard

A data scientist is using SageMaker to train a random forest model. The dataset has 100 features and 1 million rows. The training job fails with a 'ResourceLimitExceeded' error. What is the MOST likely cause?

A.The S3 bucket containing the training data is not in the same region.

B.The instance type selected does not have enough GPU memory.

C.The wrong algorithm was specified for the training job.

D.The account has reached its limit on the number of SageMaker training instances.

AnswerD

ResourceLimitExceeded indicates a service quota limit.

Why this answer

Random forest usually does not require GPU; CPU instances are sufficient. The error is likely due to exceeding the allowed number of instances or vCPU limit. Option A (GPU memory) is unlikely as random forest uses CPU.

Option C (S3 bucket) does not cause resource limit. Option D (wrong algorithm) would give different error.

Full explanation →

1603

MCQmedium

A data scientist is training a deep learning model on Amazon SageMaker for image classification. The training is taking a long time and the GPU utilization is consistently below 30%. What should the data scientist do to improve GPU utilization and reduce training time?

A.Use early stopping to stop training earlier.

B.Increase the batch size.

C.Switch to a CPU-only instance.

D.Reduce the number of layers in the model.

AnswerB

Larger batches use GPU memory more efficiently and increase utilization.

Why this answer

Low GPU utilization (below 30%) indicates that the GPU is spending most of its time waiting for data to process, often due to small batch sizes that underutilize the GPU's parallel compute capacity. Increasing the batch size allows the GPU to process more samples per forward/backward pass, improving arithmetic intensity and hardware utilization, which directly reduces total training time on SageMaker.

Exam trap

The trap here is that candidates confuse 'low GPU utilization' with 'overfitting' or 'model complexity,' leading them to choose early stopping or reducing layers, when the real issue is insufficient data parallelism per batch.

How to eliminate wrong answers

Option A is wrong because early stopping halts training based on validation performance, but it does not address the root cause of low GPU utilization or improve hardware efficiency during each training step. Option C is wrong because switching to a CPU-only instance would drastically reduce computational throughput, making training even slower and further underutilizing resources. Option D is wrong because reducing the number of layers decreases model capacity and may harm accuracy, but it does not directly improve GPU utilization; the bottleneck is data throughput, not model depth.

Full explanation →

1604

MCQhard

A data engineering team is designing a data pipeline to process large CSV files (10-50 GB each) stored in Amazon S3. The pipeline must transform the data using AWS Glue and load it into Amazon Redshift for analytics. The team wants to minimize costs while ensuring the pipeline can handle peak loads. Which approach is the most cost-effective?

A.Use AWS Lambda to process each file and load into Redshift.

B.Use Amazon EMR with Hive to transform the data and load into Redshift.

C.Use an AWS Glue Python shell job with a single r5.xlarge worker.

D.Use AWS Glue with Spark and dynamic frames, scaling the number of workers based on file size.

AnswerD

Correct: Glue Spark jobs handle large files efficiently; dynamic frames simplify schema handling.

Why this answer

Option D (use AWS Glue with Spark and dynamic frame) is correct because Glue's Spark-based ETL can handle large files efficiently, and using dynamic frames allows schema inference without manual parsing. Option A (use a single r5.xlarge) may not handle peak loads. Option B (Lambda) has time and memory limits unsuitable for large files.

Option C (EMR with Hive) is more complex and typically more expensive than Glue for this use case.

Full explanation →

1605

Multi-Selecteasy

During EDA, a data scientist notices that a numeric feature 'age' has values ranging from 0 to 150, but expects adult ages between 18-100. Which TWO steps should the scientist take to investigate?

Select 2 answers

A.Remove all rows with age > 100

B.Compute summary statistics (min, max, percentiles)

C.Apply log transformation to normalize the distribution

D.Impute age values outside 18-100 with the mean

E.Create a box plot to visualize outliers

AnswersB, E

Why D is correct

Why this answer

Option B is correct because box plots show outliers. Option D is correct because summary statistics (min, max, percentiles) reveal extreme values. Option A is wrong because removing outliers before understanding context is premature.

Option C is wrong because log transformation changes scale but does not help identify outliers. Option E is wrong because mean imputation would distort distribution.

Full explanation →

1606

MCQeasy

A data scientist is exploring a dataset and wants to check for missing values. Which method is most appropriate to identify the percentage of missing values per column?

A.Use Amazon S3 Select to query missing values

B.Use Amazon Athena to run a SELECT COUNT(*) query

C.Use Amazon QuickSight to create a missing value dashboard

D.Use AWS Glue Crawler to detect missing values

E.Use pandas .isnull().sum() in a SageMaker notebook

AnswerE

This is a direct and efficient way to count missing values per column.

Why this answer

Using pandas .isnull().sum() in a SageMaker notebook is a standard approach to count missing values per column. Option A is wrong because S3 Select is for filtering S3 objects, not for data analysis. Option B is wrong because QuickSight is for visualization but not for programmatic missing value analysis.

Option D is wrong because Athena requires SQL and is less direct for EDA. Option E is wrong because Glue Crawler discovers schema, not missing values.

Full explanation →

1607

MCQmedium

A data scientist is training a binary classification model on a dataset with 100 features and 10,000 samples. The model achieves 99% accuracy on the training set but only 65% on the test set. Which technique should be applied first to address this issue?

A.Reduce the size of the training dataset

B.Increase the number of trees in a random forest

C.Apply L2 regularization to the model

D.Add more features to the model

AnswerC

L2 regularization penalizes large weights, reducing overfitting.

Why this answer

The symptoms indicate overfitting. Regularization (L1/L2) is a direct method to reduce overfitting by penalizing large coefficients. Option A is wrong because adding more features would worsen overfitting.

Option B is wrong because increasing model complexity would worsen overfitting. Option D is wrong because reducing training data would exacerbate the problem.

Full explanation →

1608

MCQmedium

A machine learning engineer is examining a dataset containing text reviews. They want to convert the text into numerical features for a model. During EDA, they notice that the word 'the' appears in almost every review, while words like 'excellent' appear rarely. Which of the following techniques should they use to reduce the impact of very common words?

A.Apply TF-IDF transformation.

B.Remove stopwords from the text.

C.Use word2vec embeddings.

D.Use a bag-of-words representation.

AnswerA

TF-IDF reduces the weight of terms that appear frequently across documents.

Why this answer

Option C is correct because TF-IDF downweights common words. Option A is wrong because bag-of-words does not weight. Option B is wrong because removing stopwords helps but does not adjust for frequency beyond that.

Option D is wrong because word2vec focuses on context, not frequency weighting.

Full explanation →

1609

MCQmedium

A company uses Amazon SageMaker to train machine learning models. The training data contains personally identifiable information (PII). The company needs to ensure that the data is encrypted in transit between S3 and SageMaker. Which configuration is REQUIRED?

A.Use SageMaker in a VPC with VPC endpoints for S3

B.Enable S3 server-side encryption

C.Set the S3 bucket policy to require SSL

D.Use a custom Docker image with TLS configured

AnswerA

VPC endpoints ensure traffic stays within AWS network and uses HTTPS.

Why this answer

Option A is correct because SageMaker uses HTTPS by default for S3 access. Option B is for data at rest. Option C is optional.

Option D is not required for encryption in transit.

Full explanation →

1610

Multi-Selectmedium

A data scientist is building a text classification model using a bag-of-words approach. The dataset contains 100,000 documents with a vocabulary of 50,000 unique words. The model is overfitting. Which THREE techniques can help reduce overfitting? (Choose THREE.)

Select 3 answers

A.Increase max_features to include more words

B.Apply L1 or L2 regularization

C.Reduce the n-gram range to unigrams only

D.Use feature selection to remove rare words

E.Use TF-IDF instead of raw counts

AnswersB, C, D

Regularization penalizes large coefficients, reducing overfitting.

Why this answer

Feature selection (removing rare words), regularization (L1/L2), and lowering n-gram range reduce model complexity and overfitting. Option A (increasing max_features) can increase overfitting. Option D (using TF-IDF) is a weighting scheme, not a regularization technique.

Full explanation →

1611

MCQhard

A company runs a real-time recommendation system that uses Amazon SageMaker endpoints for inference. The system ingests user activity data from a mobile app via Amazon API Gateway and AWS Lambda, which writes events to an Amazon Kinesis Data Stream. A second Lambda function consumes the stream, calls a SageMaker endpoint to generate recommendations, and stores the results in Amazon DynamoDB. The system has been working well, but recently the team noticed an increase in latency from the time a user action occurs to when the recommendation is stored. The SageMaker endpoint shows increased invocation latency but no throttling. CloudWatch metrics show that the Kinesis stream's IteratorAgeMilliseconds is increasing, indicating the consumer is falling behind. The Lambda consumer's duration is within limits, but the number of invocations is lower than expected. The team suspects the issue is with the event source mapping. Which course of action should the team take to reduce the latency?

A.Increase the batch size in the event source mapping to process more records per invocation.

B.Increase the number of shards in the Kinesis data stream to increase parallelism.

C.Decrease the Lambda function's reserved concurrency to force it to scale down.

D.Replace the Lambda consumer with an Amazon Kinesis Data Firehose delivery stream.

AnswerA

Larger batches improve throughput by reducing overhead per invocation.

Why this answer

Option B is correct because the consumer is falling behind despite adequate Lambda duration, suggesting that the batch size or parallelization factor is too low. Increasing the batch size allows each Lambda invocation to process more records, increasing throughput. Option A is wrong because increasing shards increases cost and may not help if the consumer is the bottleneck.

Option C is wrong because reducing concurrency would worsen the situation. Option D is wrong because the Lambda function is already consuming from Kinesis; using Firehose would not directly solve the consumer lag.

Full explanation →

1612

MCQeasy

A data scientist is exploring a dataset with 100 features. The goal is to build a binary classification model. The dataset is highly imbalanced with 95% negative class and 5% positive class. The data scientist wants to understand the relationship between features and the target. Which technique is most appropriate for initial exploratory analysis?

A.Remove the minority class samples and analyze the majority class only.

B.Use stratified sampling to create a balanced subset for visualization and correlation analysis.

C.Use random sampling to select 10% of the data for EDA.

D.Apply SMOTE to the dataset before performing EDA.

AnswerB

Stratified sampling preserves the proportion of each class and ensures the minority class is included in the analysis.

Why this answer

Option A is correct because stratified sampling ensures that the minority class is adequately represented in the sample. Option B is wrong because random sampling may miss the rare class entirely. Option C is wrong because SMOTE is a data augmentation technique for training, not for EDA.

Option D is wrong because removing the minority class would prevent analyzing the target.

Full explanation →

1613

MCQhard

A company is using Amazon SageMaker to train a model with a custom algorithm. The training script reads data from an S3 bucket using boto3. The training job fails with an 'AccessDenied' error when trying to access the S3 bucket. The IAM role attached to the SageMaker notebook instance has full S3 access. What is the most likely cause?

A.The S3 bucket has a bucket policy that denies access from the SageMaker service.

B.The SageMaker execution role used for the training job does not have S3 access permissions.

C.The training script is using an incorrect S3 bucket name.

D.The SageMaker training job is not configured to use the S3 VPC endpoint.

AnswerB

The training job uses its own execution role, which must be granted S3 access.

Why this answer

The IAM role attached to the SageMaker notebook instance is used for interactive development, but training jobs run under a separate SageMaker execution role. Even if the notebook role has full S3 access, the training job's execution role must also have explicit S3 permissions. The 'AccessDenied' error indicates that the execution role lacks the necessary s3:GetObject or s3:ListBucket actions for the S3 bucket.

Exam trap

The trap here is that candidates confuse the IAM role attached to the SageMaker notebook instance with the execution role used by the training job, assuming they are the same or that permissions propagate automatically.

How to eliminate wrong answers

Option A is wrong because a bucket policy that denies SageMaker access would typically produce a different error (e.g., 'AccessDenied' with a specific denial message), and the question states the role has full S3 access, so a bucket policy conflict is less likely than a missing execution role permission. Option C is wrong because an incorrect bucket name would result in a 'NoSuchBucket' or '404' error, not an 'AccessDenied' error. Option D is wrong because a missing S3 VPC endpoint would cause a network timeout or connectivity error, not an IAM permission error, and SageMaker can access S3 over the public internet by default.

Full explanation →

1614

Multi-Selectmedium

Which TWO of the following are valid techniques to handle missing data in a dataset?

Select 2 answers

A.Normalizing the data

B.Adding a constant value of 0

C.Mean imputation

D.Synthetic Minority Over-sampling (SMOTE)

E.Deleting rows with missing values

AnswersC, E

Replacing missing values with the mean is a standard technique.

Why this answer

Mean imputation (Option C) is a valid technique for handling missing data because it replaces missing values with the mean of the observed values for that feature, preserving the overall mean of the dataset. This approach is simple and effective for numerical data that is missing completely at random (MCAR), as it does not introduce bias in the mean estimate.

Exam trap

Cisco often tests the distinction between data preprocessing techniques (like imputation) and other unrelated techniques (like normalization or SMOTE), so the trap here is that candidates may confuse SMOTE or normalization as valid missing data handling methods because they are common preprocessing steps, but they serve entirely different purposes.

Full explanation →

1615

MCQhard

A data scientist is performing EDA on a dataset with 100 features. They want to identify which features are most predictive of the target using a model-agnostic method. Which technique should they use?

A.Pearson correlation matrix

B.L1 regularization

C.SHAP values

D.Permutation feature importance

AnswerD

Permutation importance works with any model and measures drop in performance when a feature is shuffled.

Why this answer

Option A is correct because permutation importance is model-agnostic and measures feature importance by shuffling. Option B is wrong because SHAP values are model-specific. Option C is wrong because L1 regularization is model-specific.

Option D is wrong because correlation is bivariate.

Full explanation →

1616

Multi-Selectmedium

A company is building a data lake on Amazon S3 and wants to ensure that data is encrypted at rest using AWS KMS. Which TWO actions are required to achieve this? (Choose TWO.)

Select 2 answers

A.Configure the KMS key policy to allow the S3 service to use the key

B.Enable default encryption on the S3 bucket with SSE-KMS

C.Add a bucket policy that denies PutObject without encryption

D.Enable encryption in transit using HTTPS for all S3 API calls

E.Use client-side encryption on all data before uploading

AnswersA, B

The key policy must grant the S3 service principal permission to encrypt/decrypt.

Why this answer

Option B is correct because default bucket encryption with SSE-KMS ensures all objects are encrypted with KMS. Option D is correct because KMS key policy must allow the bucket to use the key. Option A is wrong because bucket policy can restrict access but does not enforce encryption.

Option C is wrong because client-side encryption is not required if server-side encryption is used. Option E is wrong because encryption in transit (HTTPS) is separate from at-rest encryption.

Full explanation →

1617

Multi-Selectmedium

A company uses Amazon SageMaker to train a linear regression model. During evaluation, they observe that the model has high bias (underfitting). Which THREE actions can reduce bias?

Select 3 answers

A.Increase L2 regularization.

B.Add polynomial features.

C.Reduce the regularization strength.

D.Use a smaller training dataset.

E.Use a random forest model instead of linear regression.

AnswersB, C, E

Polynomial features increase model capacity, reducing bias.

Why this answer

Options A, B, and D are correct. Adding polynomial features increases model complexity. Using a more complex algorithm (random forest) can capture non-linear patterns.

Reducing regularization allows the model to fit more closely. Option C is wrong because adding L2 regularization increases bias. Option E is wrong because reducing training data typically increases bias.

Full explanation →

1618

Drag & Dropmedium

Drag and drop the steps to set up cross-validation in a SageMaker training job using the built-in XGBoost algorithm in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Cross-validation requires data splitting, job configuration with CV parameters, execution, and model selection.

Full explanation →

1619

MCQmedium

A data scientist is performing EDA and observes that a feature 'purchase_amount' has many zeros and a long tail of positive values. What type of model would be appropriate for this target variable?

A.Zero-inflated negative binomial regression.

B.Linear regression after log transformation.

C.Logistic regression on binary indicator of purchase.

D.Poisson regression.

AnswerA

Handles excess zeros and overdispersion.

Why this answer

Option A is correct because zero-inflated models handle excess zeros. Option B is wrong because linear regression assumes normal distribution. Option C is wrong because Poisson regression is for count data without excess zeros.

Option D is wrong because logistic regression is for binary outcomes.

Full explanation →

1620

MCQmedium

A data scientist is training a text classification model using Amazon SageMaker's BlazingText algorithm. The dataset consists of 1 million documents, each labeled with one of 10 categories. The model achieves 92% accuracy on a held-out test set. However, when deployed, the model performs poorly on documents containing slang and typos. What should the data scientist do to improve model robustness?

A.Remove all documents with slang or typos from the training set.

B.Augment the training data by introducing common slang replacements and typos.

C.Increase the embedding dimension from 100 to 300.

D.Increase the number of training epochs.

AnswerB

Data augmentation exposes the model to realistic noise, improving robustness.

Why this answer

Option A is correct. Data augmentation with noise helps the model generalize to variations. Option B is wrong because removing such documents reduces training data.

Option C is wrong because a larger embedding dimension may not help with slang. Option D is wrong because increasing epochs may lead to overfitting.

Full explanation →

1621

MCQeasy

A data scientist needs to run a one-time query on 10 TB of data stored in S3 using Amazon Athena. The query scans 5 TB and returns a small result set. Which approach minimizes cost?

A.Query the data directly in Athena without any preprocessing

B.Create an S3 Select query to filter data before Athena

C.Use Amazon Redshift Spectrum to query the data

D.Use AWS Glue to convert the data to Parquet format and repartition by date

AnswerA

For a one-time query, scanning 5 TB at $5 per TB is $25, which is minimal compared to preprocessing costs.

Why this answer

Athena charges based on data scanned. Partitioning and using columnar formats like Parquet reduce the amount of data scanned. Redshift Spectrum would require a cluster.

EMR is more expensive for one-time queries.

Full explanation →

1622

Multi-Selecteasy

A data scientist is exploring a dataset with a binary target variable. Which TWO metrics are appropriate for evaluating the balance of the target classes? (Choose two.)

Select 2 answers

A.Count plot of the target variable

B.Histogram of a feature

C.Scatter plot of two features colored by target

D.value_counts() on the target column

E.Correlation matrix of all features

AnswersA, D

Count plot shows frequency of each class.

Why this answer

Options A and D are correct. Count plot and value_counts show class frequencies. Option B is wrong because correlation matrix shows relationships between features.

Option C is wrong because scatter plot shows relationship between two numeric variables. Option E is wrong because histogram shows distribution of a continuous variable.

Full explanation →

1623

MCQhard

A data scientist is analyzing a dataset with 500 features and 100,000 observations. The target variable is binary. The dataset contains highly correlated features and some categorical variables with high cardinality. Which combination of techniques should the data scientist use to reduce dimensionality while preserving interpretability for EDA?

A.Apply Principal Component Analysis (PCA) to all features and then train a model on the top 50 components.

B.Use mutual information to select top features and apply label encoding to categorical variables.

C.Use chi-squared test to select top features and one-hot encode categorical variables.

D.Apply correlation-based feature selection to remove highly correlated pairs, then use target encoding for high-cardinality categorical variables.

AnswerD

Correlation filter reduces redundancy; target encoding converts categoricals to numeric without increasing dimensionality.

Why this answer

Option A is correct because correlation-based feature selection removes redundant features, and target encoding handles high-cardinality categoricals without expanding dimensions. Option B is wrong because PCA reduces interpretability and does not handle categoricals. Option C is wrong because chi-squared test is for categorical targets, but dataset has binary target; also one-hot encoding explodes dimensions.

Option D is wrong because mutual information is used for feature selection but does not address high cardinality directly.

Full explanation →

1624

MCQmedium

A data engineer is responsible for managing a data lake on Amazon S3. The data lake contains CSV files from various sources, totaling 10 TB. The engineer needs to make this data queryable using Amazon Athena. However, Athena queries are currently taking a long time and scanning large amounts of data. The engineer has noticed that the CSV files are not partitioned, and there are no indexes. The engineer wants to improve query performance and reduce costs. The data is accessed frequently for the last 30 days, but older data is rarely queried. The engineer also wants to minimize the amount of data scanned by Athena. What should the engineer do?

A.Convert the CSV files to JSON format and use Athena to query them.

B.Convert the CSV files to Parquet format and partition the data by date.

C.Create indexes on the S3 objects using AWS Glue.

D.Convert the CSV files to ORC format and create a view in Athena.

AnswerB

Parquet is columnar and compressed; partitioning by date allows partition pruning, reducing scan size.

Why this answer

Option D is correct. Converting CSV to Parquet reduces scan size due to columnar storage and compression. Partitioning by date allows Athena to skip irrelevant partitions.

Option A is wrong because it does not address the partitioning issue. Option B is wrong because converting to ORC alone without partitioning helps but not as much as partitioning. Option C is wrong because Athena does not support indexes.

Full explanation →

1625

MCQeasy

A company uses Amazon RDS for its transactional database and needs to export a daily snapshot of a table to Amazon S3 in Parquet format for analytics. Which AWS service can perform this export without writing custom code?

A.Amazon Redshift

B.AWS Database Migration Service (DMS)

C.Amazon Athena

D.AWS Glue

AnswerD

Glue can run scheduled ETL jobs to extract from RDS and write to S3 in Parquet.

Why this answer

AWS Glue can connect to RDS, extract data, and write to S3 in Parquet format using a Glue job. AWS Database Migration Service (DMS) is primarily for migration, not scheduled exports. Amazon Athena cannot directly connect to RDS.

Amazon Redshift is a data warehouse, not a migration tool.

Full explanation →

1626

MCQeasy

A team is building a product recommendation system using matrix factorization in Amazon SageMaker. They notice that the model's training loss decreases steadily but validation loss starts increasing after 5 epochs. What is the most likely cause?

A.Underfitting

B.Not enough training data

C.Learning rate too high

D.Overfitting

AnswerD

The model is memorizing the training data.

Why this answer

In matrix factorization for recommendation systems, a decreasing training loss with an increasing validation loss after several epochs is a classic sign of overfitting. The model is memorizing the training data (including noise) rather than learning generalizable patterns, which degrades its performance on unseen validation data.

Exam trap

The trap here is that candidates may confuse the symptom of overfitting (training loss decreasing, validation loss increasing) with underfitting or a learning rate issue, but the key is the divergence between the two loss curves after a period of convergence.

How to eliminate wrong answers

Option A is wrong because underfitting would show high training loss that does not decrease sufficiently, not a diverging gap between training and validation loss. Option B is wrong because insufficient training data can contribute to overfitting, but the direct symptom described—training loss decreasing while validation loss increases—is the hallmark of overfitting, not a data quantity issue alone. Option C is wrong because a learning rate that is too high typically causes the loss to oscillate or diverge entirely, not a steady decrease in training loss with a later increase in validation loss.

Full explanation →

1627

MCQhard

A SageMaker endpoint has a CloudWatch alarm configured as shown in the exhibit. The alarm fires when the p99 latency exceeds 500 ms for two consecutive minutes. Which action should the data scientist take to reduce latency?

A.Increase the number of instances behind the endpoint

B.Increase the batch size in the inference request

C.Use SageMaker asynchronous inference instead of real-time

D.Switch to GPU instances even if the model does not require GPU

AnswerA

More instances distribute load, reducing latency.

Why this answer

Option A is correct: Increasing instance count reduces load per instance, reducing latency. Option B (GPU) may not help if model is not compute-bound. Option C (batch size) depends on model.

Option D (async inference) changes architecture.

Full explanation →

1628

MCQeasy

During exploratory data analysis, a data scientist notices that a categorical feature 'city' has over 1,000 unique values. The dataset has 10,000 rows. Which technique should the scientist consider to reduce the cardinality of this feature?

A.Apply label encoding to assign numeric labels.

B.Group low-frequency categories into a single 'other' category.

C.Apply one-hot encoding to all categories.

D.Apply frequency encoding to replace each category with its frequency.

AnswerB

Grouping rare categories reduces cardinality effectively.

Why this answer

Grouping rare categories into an 'other' bucket is a common technique to reduce cardinality. Option A (one-hot encoding) would create too many columns. Option B (label encoding) doesn't reduce uniqueness.

Option D (frequency encoding) replaces categories with frequency but still has 1000 values.

Full explanation →

1629

MCQhard

A data scientist is working on a customer churn prediction project for a telecom company. The dataset contains 50,000 records with 25 features, including 'tenure' (number of months customer stayed), 'monthly_charges', 'total_charges', 'contract_type' (month-to-month, one year, two year), 'payment_method', and a target 'churn' (Yes/No). The data is stored in an S3 bucket as a single CSV file. The scientist uses Amazon SageMaker Data Wrangler to perform EDA. After importing the data, the scientist notices that the 'total_charges' column has many missing values (about 20% of rows). The scientist suspects that missing values occur only for customers with tenure = 0 (new customers). After verifying that suspicion, the scientist wants to handle the missing values appropriately. Which course of action should the scientist take?

A.Use a regression model to predict total_charges based on other features.

B.Impute missing total_charges with the mean of non-missing values.

C.Drop all rows with missing total_charges to avoid bias.

D.Impute missing total_charges with 0, since missing values correspond to customers with tenure=0.

AnswerD

Given the pattern, total_charges should be 0 for new customers; imputing with 0 preserves data integrity.

Why this answer

Option D is correct because if total_charges is missing only for tenure=0, it means those customers have not been billed yet, so total_charges should be 0. Imputing with 0 is appropriate. Option A is wrong because dropping rows with missing total_charges would remove all new customers, biasing the dataset.

Option B is wrong because imputing with mean would assign incorrect values to new customers. Option C is wrong because using a model to predict missing values is overkill and may introduce error when the true value is known to be 0.

Full explanation →

1630

MCQhard

A data scientist is using SageMaker Autopilot to automatically build a classification model. The dataset is highly imbalanced (1% positive class). Which configuration should the scientist set to handle the class imbalance?

A.Set the problem_type to 'BinaryClassification' and enable 'balance_class_weights'.

B.Use the 'AutoMLJobObjective' with 'F1' metric.

C.Set the 'sample_weight' attribute in the input data.

D.Manually downsample the majority class before training.

AnswerB

Optimizing for F1 helps address class imbalance by balancing precision and recall.

Why this answer

SageMaker Autopilot does not directly handle class imbalance automatically. The user can specify a 'problem_type' and 'target_attribute_name', but to address imbalance, they should enable oversampling or use custom recipes. However, among the options, setting the objective metric to 'F1' or 'AUC' is a common technique, but Autopilot allows setting 'balance_class_weights' to 'True' or using 'AutoMLJobObjective' with 'F1'.

The correct answer is to use the 'AutoMLJobObjective' with 'F1' metric, as Autopilot will focus on optimizing F1, which is sensitive to imbalance.

Full explanation →

1631

Multi-Selectmedium

A data engineer is building a streaming pipeline using Amazon Kinesis Data Streams and AWS Lambda. The Lambda function processes records and writes results to Amazon S3. The engineer notices that the Lambda function is experiencing throttling and some records are being dropped. Which TWO actions should the engineer take to improve the reliability of the pipeline?

Select 2 answers

A.Increase the number of shards in the Kinesis data stream.

B.Set a reserved concurrency on the Lambda function to prevent other functions from using its capacity.

C.Add a Dead Letter Queue to the Lambda function to capture failed records.

D.Decrease the batch size in the Lambda event source mapping.

E.Increase the Kinesis stream's retention period to 7 days.

AnswersA, B

More shards increase parallelism and throughput.

Why this answer

Increasing the number of shards in the Kinesis stream increases throughput and reduces throttling. Setting the Lambda function's reserved concurrency ensures it has sufficient capacity to handle the incoming records.

Full explanation →

1632

Multi-Selecthard

A company is deploying a machine learning model for real-time fraud detection. The model must have extremely low latency (<10 ms) and high throughput. Which THREE design choices meet these requirements? (Choose 3.)

Select 3 answers

A.Use GPU instances (e.g., ml.p3) for the endpoint.

B.Use one endpoint per model to avoid interference.

C.Use SageMaker Batch Transform for real-time predictions.

D.Use SageMaker multi-model endpoints to host multiple models on the same instance.

E.Use SageMaker Elastic Inference to attach GPU acceleration to a CPU instance.

AnswersA, D, E

GPU accelerates inference, reducing latency.

Why this answer

Option A is correct because GPU instances like ml.p3 provide massively parallel compute capability that accelerates matrix operations common in deep learning models, enabling inference latencies under 10 ms. For real-time fraud detection, the GPU's high throughput and low latency are essential for processing thousands of transactions per second without bottlenecks.

Exam trap

Cisco often tests the misconception that batch processing services like Batch Transform can be used for real-time inference, but the key distinction is that Batch Transform is designed for offline, asynchronous workloads and cannot meet low-latency requirements.

Full explanation →

1633

MCQeasy

A data engineer wants to stream clickstream data from a web application to Amazon S3 for near-real-time analytics. Which AWS service should be used to ingest and buffer the data before landing in S3?

A.Amazon AppFlow

B.Amazon Kinesis Data Streams

C.Amazon Kinesis Data Firehose

D.AWS Glue

AnswerC

Firehose can directly deliver streaming data to S3.

Why this answer

Amazon Kinesis Data Firehose is the correct service for loading streaming data into S3 without custom code. Option A is wrong because Kinesis Data Streams requires a consumer to write to S3. Option C is wrong because AppFlow is for SaaS integrations.

Option D is wrong because Glue is for ETL, not real-time streaming.

Full explanation →

1634

MCQhard

A machine learning engineer is evaluating a dataset for building a fraud detection model. The dataset has 1 million transactions, but only 500 are fraudulent. The engineer wants to understand the distribution of fraudulent vs. non-fraudulent transactions over time. Which EDA visualization is most suitable?

A.Bar chart of transaction count per day with colors for fraud status

B.Scatter plot of transactions over time colored by fraud status

C.Box plot of transaction amount per month grouped by fraud status

D.Line plot of daily fraud rate and non-fraud rate

AnswerD

Why D is correct

Why this answer

Option D is correct because a time series line plot with two lines (fraud vs. non-fraud) shows temporal patterns. Option A is wrong because bar chart of counts per day is less effective for two categories. Option B is wrong because scatter plot with 1 million points is overwhelming.

Option C is wrong because box plot shows distribution per time period but not temporal trend.

Full explanation →

1635

Multi-Selecthard

Which TWO of the following are techniques used to reduce overfitting in a neural network?

Select 2 answers

A.Increase the number of layers

B.Batch normalization

C.L2 regularization

D.Dropout

E.Increase the learning rate

AnswersC, D

L2 regularization penalizes large weights.

Why this answer

Options B and D are correct. B is correct because dropout randomly drops units, preventing co-adaptation. D is correct because L2 regularization penalizes large weights.

A is wrong because increasing the number of layers increases model complexity, which can worsen overfitting. C is wrong because increasing learning rate may cause divergence, not reduce overfitting. E is wrong because batch normalization helps training but does not primarily reduce overfitting.

Full explanation →

1636

MCQeasy

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents only 1% of the data. The model achieves 99% accuracy but fails to identify most positive cases. Which metric should the data scientist use to evaluate model performance?

A.R-squared

B.F1 score

C.Accuracy

D.RMSE

AnswerB

F1 score balances precision and recall, suitable for imbalanced data.

Why this answer

The F1 score is the harmonic mean of precision and recall, making it ideal for imbalanced datasets where accuracy is misleading. Since the model achieves 99% accuracy by simply predicting the majority class (negative), it fails to capture positive cases; F1 score penalizes this by balancing false positives and false negatives, providing a more truthful performance measure.

Exam trap

The trap here is that candidates often default to accuracy as the primary metric, overlooking how imbalanced data can inflate accuracy while hiding poor positive class detection, which the F1 score directly addresses.

How to eliminate wrong answers

Option A is wrong because R-squared is a regression metric that measures the proportion of variance explained by the model, not applicable to binary classification. Option C is wrong because accuracy is misleading on imbalanced datasets; a model predicting all negatives achieves 99% accuracy but fails to identify any positives, so it does not reflect true performance. Option D is wrong because RMSE is a regression metric that measures the square root of the average squared differences between predicted and actual values, not suitable for binary classification outcomes.

Full explanation →

1637

MCQeasy

A SageMaker endpoint configuration is shown in the exhibit. The company wants to deploy the model to a real-time endpoint. What is missing from this configuration to successfully create the endpoint?

A.The model name is missing

B.The endpoint name is not specified in the configuration

C.The initial instance count is missing

D.The accelerator type is missing

E.The data capture configuration is missing

AnswerB

Endpoint name is provided when creating the endpoint, not in the config.

Why this answer

Option C is correct because the endpoint configuration requires at least one production variant, which is present, but the endpoint name is specified when creating the endpoint, not in the config. Option A is wrong because the model name is specified. Option B is wrong because instance count is set.

Option D is wrong because accelerator type is optional. Option E is wrong because data capture config is optional.

Full explanation →

1638

MCQmedium

A company uses AWS Glue jobs with job bookmarks enabled to process incremental data. They notice that the job processes all data each time instead of only new data. What is the most likely reason?

A.The TempDir is not configured correctly.

B.The job bookmark option is set to 'job-bookmark-enable' but should be 'job-bookmark-disable'.

C.The source data does not have a column that can be used as a bookmark key.

D.The MaxConcurrentRuns is set to 3, which can cause bookmark conflicts.

AnswerD

Multiple concurrent runs can corrupt bookmark state.

Why this answer

Job bookmarks require a unique identifier to track processed data. If the source data lacks a monotonically increasing column or the schema does not include a suitable key, bookmarks may not work. Setting MaxConcurrentRuns > 1 can cause issues with bookmarks.

The correct answer: MaxConcurrentRuns is set to 3, which can interfere with bookmark state.

Full explanation →

1639

MCQhard

An engineer runs the AWS CLI command in the exhibit to create a SageMaker endpoint configuration. The endpoint is created successfully, but when invoked, the inference response is slow. The engineer wants to test with a different instance type. Which action should the engineer take?

A.Create a new endpoint configuration and use it to create a new endpoint

B.Modify the existing endpoint directly using the update-endpoint API with a new instance type parameter

C.Delete the endpoint and create a new one with the desired instance type

D.Update the endpoint configuration with the new instance type and then update the endpoint

AnswerD

You can update the endpoint configuration and then call update-endpoint to apply changes.

Why this answer

Option B is correct because updating the endpoint configuration and then updating the endpoint will apply the new instance type. Option A creates a new endpoint, not efficient. Option C does not change instance type.

Option D is wrong because the endpoint can be updated.

Full explanation →

1640

MCQhard

A data scientist is analyzing a dataset with many categorical features. The target variable is binary. Which statistical test should be used to assess the association between each categorical feature and the target?

A.Pearson correlation coefficient

B.Chi-squared test of independence

C.ANOVA

D.Kolmogorov-Smirnov test

AnswerB

Chi-squared tests association between categorical variables.

Why this answer

Option C is correct because Chi-squared test is used for independence between categorical variables. Option A is wrong because ANOVA is for continuous target. Option B is wrong because Pearson correlation is for continuous variables.

Option D is wrong because Kolmogorov-Smirnov test is for distribution comparison.

Full explanation →

1641

MCQmedium

Refer to the exhibit. An IAM policy is attached to a data engineering team's role. The team needs to upload data to the 'confidential' prefix in the 'my-data-lake' bucket. However, they are receiving 'AccessDenied' errors. What is the likely cause?

A.The condition in the Deny statement requires the team to use a specific source IP address.

B.The Allow statement only grants GetObject and PutObject, but the team needs ListBucket.

C.The Deny statement with the condition explicitly denies access to the 'confidential' prefix for accounts other than 123456789012.

D.The Allow statement's resource does not include the 'confidential' prefix.

AnswerC

The Deny statement applies to all actions on the confidential prefix for accounts not matching 123456789012, overriding the Allow.

Why this answer

Option A is correct. The Deny statement denies all s3 actions on the confidential prefix for any principal account that is not 123456789012. If the team's role is from a different account (e.g., 111111111111), the Deny applies.

Option B is incorrect because the Deny has a condition, but it still denies for other accounts. Option C is incorrect because the Allow statement does not include the confidential prefix explicitly; however, the Deny overrides the Allow. Option D is incorrect because the condition does not require the team to use a specific source IP.

Full explanation →

1642

MCQeasy

A company is building a binary classifier to predict customer churn. The dataset has 10,000 samples with 500 churners (5% positive class). After training a logistic regression model, the precision is 0.8 and recall is 0.2. Which metric should the data scientist focus on to improve the model's ability to identify churners while minimizing false positives?

A.Increase accuracy

B.Increase precision

C.Increase recall

D.Increase F1 score

AnswerC

Recall is low (0.2), so improving it will capture more churners.

Why this answer

Option A is correct because recall is very low (0.2), meaning the model misses most churners. Improving recall will capture more churners. Option B (precision) is already high but recall is low.

Option C (accuracy) is misleading due to class imbalance. Option D (F1) balances precision and recall, but the primary concern is low recall.

Full explanation →

1643

MCQeasy

An S3 event notification is configured to trigger a Lambda function when new objects are created. The Lambda function processes the event JSON shown. Which field should the function use to read the new object from S3?

A.s3.s3SchemaVersion

B.awsRegion

C.eventName

D.s3.bucket.arn and s3.object.key

AnswerD

These provide the bucket ARN and object key.

Why this answer

The event JSON contains the bucket name under s3.bucket.name and the object key under s3.object.key. The function should use these to construct the S3 URI and read the object. The other fields are not sufficient.

Full explanation →

1644

MCQmedium

A machine learning engineer is deploying a PyTorch model on SageMaker for real-time inference. The model requires GPU for low latency. Which instance type and configuration should the engineer choose?

A.Deploy to an ml.c5.4xlarge instance with SageMaker batch transform.

B.Deploy to an ml.m5.large instance with a SageMaker model endpoint.

C.Deploy to an ml.p3.2xlarge instance with a SageMaker endpoint.

D.Deploy to an ml.p3.2xlarge instance with SageMaker batch transform.

AnswerC

p3 provides GPU; endpoint enables real-time inference.

Why this answer

SageMaker real-time endpoints support GPU instances like ml.p3.2xlarge. Option A (ml.m5.large) is CPU only. Option B (ml.c5.4xlarge) is CPU.

Option D (ml.p3.2xlarge with batch transform) is inference but batch is not real-time; endpoint is needed.

Full explanation →

1645

MCQhard

Refer to the exhibit. A data scientist is running an Amazon EMR Spark job for exploratory data analysis on a large dataset. The job fails with the error shown. What is the most appropriate action to resolve this?

A.Reduce the number of worker nodes.

B.Convert the input data to Parquet format.

C.Increase the executor memory in Spark configuration.

D.Increase the driver memory.

AnswerC

More memory per executor prevents heap overflow.

Why this answer

Option B is correct because increasing the executor memory in Spark configuration can handle larger data. Option A (fewer nodes) reduces resources; Option C (Parquet) may help but not directly address memory; Option D (increase driver memory) may not help if executors are the issue.

Full explanation →

1646

MCQhard

A company uses Amazon SageMaker to train a custom TensorFlow model for image classification. The training job runs on a single ml.p3.2xlarge instance. The dataset contains 500,000 images stored in S3. The training time is too long (over 24 hours). The data scientist wants to reduce training time without changing the model architecture. The dataset is already in TFRecord format. The training script uses the default TensorFlow data pipeline. Which change will MOST significantly reduce training time?

A.Use SageMaker Pipe mode and increase the number of data files.

B.Use SageMaker's distributed data parallelism with multiple instances.

C.Switch the input mode from File to Pipe.

D.Optimize the data pipeline using tf.data.Dataset.prefetch and cache.

AnswerB

Distributed training across multiple GPUs significantly reduces wall-clock training time.

Why this answer

Option B is correct. Using SageMaker's distributed data parallelism with multiple GPUs reduces training time proportionally. Option A is wrong because File mode may cause I/O bottlenecks.

Option C is wrong because optimizing data pipeline helps but less than adding more compute. Option D is wrong because Pipe mode streams data but does not reduce computation.

Full explanation →

1647

MCQhard

A media company uses SageMaker to train a neural network for content recommendation. The model uses embeddings for users and items. Training is slow and they want to reduce time. The dataset has 10 million users and 1 million items. They have a cluster of 8 p3.16xlarge instances. Which strategy is most likely to reduce training time?

A.Use data parallelism to replicate the model on each GPU and synchronize gradients

B.Reduce the embedding dimension from 256 to 64

C.Use SageMaker's model parallelism to split the embedding layers across GPUs

D.Use a smaller batch size to fit on each GPU

AnswerC

Model parallelism distributes large embedding tables across devices, reducing memory and enabling larger batches.

Why this answer

Model parallelism is designed for large models with memory-intensive layers like embeddings.

Full explanation →

1648

MCQmedium

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are currently failing due to insufficient memory. The data volume varies, with occasional spikes. Which solution should be used to handle the variable memory requirements efficiently?

A.Migrate the ETL jobs to Amazon EMR with Apache Spark

B.Use AWS Glue Flex execution to allocate resources dynamically

C.Increase the number of DPUs (Data Processing Units) for all jobs

D.Split the jobs into smaller steps and run them sequentially

AnswerB

Flex execution provides flexible resources that adapt to workload, optimizing cost and performance.

Why this answer

AWS Glue Flex execution allows jobs to use flexible resources that can handle varying memory needs, and it is cost-effective for variable workloads. Option A (increasing DPU) would waste resources during low volume. Option C (using Apache Spark on EMR) increases management overhead.

Option D (splitting jobs) adds complexity and may not handle spikes.

Full explanation →

1649

MCQeasy

A company is using Amazon SageMaker to train a deep learning model. The training job is taking a long time, and the data scientist wants to reduce training time without sacrificing accuracy. Which technique should they use?

A.Use a larger instance type

B.Use a smaller instance type

C.Reduce the number of epochs

D.Use managed spot training with checkpointing

AnswerD

Spot instances are cheaper and can speed up training with checkpointing for interruptions.

Why this answer

Option D is correct because managed spot training can reduce cost and training time by using preemptible instances, often with checkpointing. Option A is wrong because increasing instance count may not linearly reduce time and could increase cost. Option B is wrong because reducing epochs may sacrifice accuracy.

Option C is wrong because using a smaller instance is likely to increase time.

Full explanation →

1650

Multi-Selecthard

A machine learning team is deploying a real-time inference endpoint for a fraud detection model using Amazon SageMaker. The model is a LightGBM classifier trained on 1 GB of tabular data. The endpoint must respond within 100 ms for 99% of requests, with a throughput of 10 requests per second. During load testing, the team observes that the 99th percentile latency is 250 ms and the endpoint CPU utilization is consistently above 90%. The team has already selected an ml.c5.xlarge instance with auto scaling enabled. Which combination of actions should the team take to meet the latency requirement? (Choose 3.)

Select 3 answers

A.Upgrade the instance type to ml.c5.2xlarge to increase CPU resources per instance.

B.Reduce the number of trees in the LightGBM model to decrease inference time.

C.Enable SageMaker's data compression for endpoint input payloads.

D.Switch to using SageMaker Batch Transform instead of a real-time endpoint.

AnswersA, B, C

More CPU reduces per-request processing time, lowering latency.

Why this answer

Option A (switching to ml.c5.2xlarge) provides more CPU capacity, reducing latency. Option B (enabling SageMaker's data compression) reduces network transfer time and I/O overhead. Option D (using batch transform instead of real-time) is a fundamental change in architecture that would not meet real-time requirements.

Option E (reducing the number of trees in LightGBM) directly reduces inference computation time. Option F (increasing instance count) is already handled by auto scaling, but alone may not reduce latency per request if each instance is saturated.

Full explanation →

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1576–1650