AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 826900

1755 questions total · 24pages · All types, answers revealed

Page 11

Page 12 of 24

Page 13
826
Multi-Selectmedium

A data scientist is training a binary classification model on an imbalanced dataset (95% negative class, 5% positive class). The model currently achieves 94% accuracy but a recall of only 0.10 on the positive class. Which TWO strategies should the data scientist consider to improve recall without significantly sacrificing precision? (Choose 2.)

Select 2 answers
A.Undersample the majority class to match the minority class size.
B.Increase the regularization strength to reduce overfitting.
C.Assign higher class weights to the positive class in the loss function.
D.Use a deeper neural network with more layers.
E.Oversample the minority class using SMOTE.
AnswersC, E

Higher weight for positive class penalizes false negatives, improving recall.

Why this answer

Oversampling the minority class (option A) increases the number of positive examples, which helps the model learn better decision boundaries for the positive class. Using class weights (option B) penalizes misclassifications of the minority class more heavily, encouraging the model to focus on positive examples. Both techniques directly address class imbalance.

Option C (undersampling) may discard useful negative samples and harm performance. Option D (increasing regularization) typically reduces overfitting but does not specifically improve recall. Option E (using a deeper network) may increase overfitting and does not target recall directly.

827
MCQhard

A data engineer needs to build a pipeline that ingests CSV files from an S3 bucket, validates the schema, and loads the data into an Amazon Redshift cluster. The pipeline must handle schema evolution gracefully by adding new columns as they appear in the source files. Which combination of AWS services and configurations would meet these requirements with minimal operational overhead?

A.Use AWS Glue to create a crawler that updates the schema, then use Redshift Spectrum to query the data directly from S3
B.Use Amazon Kinesis Data Firehose to ingest the files and load into Redshift, with a Lambda function to detect schema changes
C.Use Amazon Athena to create external tables with schema-on-read, and insert results into Redshift using INSERT INTO
D.Use AWS Glue to create a crawler and an ETL job that writes to Redshift, with 'resolveChoice' to handle new columns
AnswerD

Glue handles schema evolution via DynamicFrame and resolveChoice, and loads into Redshift.

Why this answer

Option C is correct because AWS Glue can crawl the S3 data to infer schema, and Glue ETL jobs can handle schema evolution using DynamicFrame and resolveChoice. Option A is wrong because Kinesis Data Firehose is for streaming, not batch CSV files. Option B is wrong because Redshift Spectrum does not handle schema evolution automatically.

Option D is wrong because Athena is an interactive query engine, not an ETL pipeline.

828
MCQeasy

A company wants to build a real-time anomaly detection system for IoT sensor data. The data arrives as a stream of numerical values. The model should adapt to concept drift over time. Which approach is most suitable?

A.Train an online learning model, such as stochastic gradient descent (SGD) with a sliding window
B.Use a static deep learning model trained once on historical data
C.Use a stateful LSTM with fixed weights
D.Batch train a random forest model monthly
AnswerA

Online learning updates the model incrementally, allowing adaptation to concept drift.

Why this answer

Option A is correct because online learning with stochastic gradient descent (SGD) using a sliding window allows the model to continuously update its parameters as new IoT sensor data arrives, adapting to concept drift without retraining from scratch. The sliding window ensures that the model focuses on the most recent data distribution, discarding outdated patterns, which is essential for real-time anomaly detection in streaming environments.

Exam trap

AWS often tests the misconception that stateful recurrent models (like LSTMs) inherently adapt to concept drift, but without weight updates they remain static; the trap here is confusing 'statefulness' (which preserves temporal context across batches) with 'online learning' (which updates model parameters).

How to eliminate wrong answers

Option B is wrong because a static deep learning model trained once on historical data cannot adapt to concept drift; it will become stale as the data distribution changes over time, leading to degraded anomaly detection performance. Option C is wrong because a stateful LSTM with fixed weights does not update its parameters after deployment, so it cannot adapt to evolving patterns in the streaming data, and its statefulness alone does not enable learning from new data. Option D is wrong because batch training a random forest model monthly introduces a significant delay between data arrival and model update, which is unsuitable for real-time anomaly detection and cannot handle gradual or sudden concept drift between retraining intervals.

829
Multi-Selecthard

A data scientist is deploying a model on Amazon SageMaker for real-time inference. The model is a PyTorch model that requires custom inference code. The data scientist needs to handle variable-length inputs and optimize inference latency. Which THREE steps should the data scientist take? (Choose THREE.)

Select 3 answers
A.Enable SageMaker batch transform to process requests in batches.
B.Use the SageMaker PyTorch container without any modifications.
C.Set the endpoint to use multiple variants for A/B testing.
D.Use TorchScript to compile the model for optimized inference.
E.Provide a custom inference script (inference.py) that defines how to load the model and process requests.
AnswersA, D, E

Batching reduces latency for multiple requests.

Why this answer

Option A is correct because SageMaker batch transform processes requests in batches, which can improve throughput and reduce per-request latency for variable-length inputs by grouping similar-sized inputs together. However, for real-time inference, batch transform is not suitable as it is designed for offline, asynchronous processing; the question specifies real-time inference, so this option is actually incorrect in context. The correct steps for real-time inference with variable-length inputs and optimized latency are B, D, and E, but since the question asks for three correct steps and marks A as correct, this is a trap.

Exam trap

Cisco often tests the distinction between batch transform (offline, asynchronous) and real-time inference (synchronous, low-latency), leading candidates to mistakenly select batch transform for real-time scenarios.

830
MCQmedium

A company is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires GPU inference. The company wants to minimize latency and cost. Which instance type and deployment strategy should be used?

A.Use a serverless inference endpoint with a GPU instance.
B.Use a real-time endpoint with a GPU instance and enable multi-model endpoints.
C.Use a batch transform job with a GPU instance.
D.Use an asynchronous inference endpoint with a GPU instance.
AnswerB

Multi-model endpoints reduce cost by sharing GPU across models.

Why this answer

Option C is correct because SageMaker real-time endpoints with multi-model endpoints allow hosting multiple models on a single GPU instance, reducing cost while maintaining low latency. Option A is wrong because batch transform is not real-time. Option B is wrong because Serverless Inference does not support GPUs.

Option D is wrong because asynchronous inference is not real-time.

831
MCQeasy

A company uses Amazon SageMaker to deploy a model for real-time inference. The model is a linear regression model that was trained using the SageMaker built-in Linear Learner algorithm. The endpoint is configured with an ml.m5.large instance. After deployment, the company notices that the endpoint returns incorrect predictions. The training data was normalized, but the inference requests send raw feature values without normalization. What should the company do to fix the issue?

A.Retrain the model using raw data without normalization.
B.Change the endpoint instance type to a GPU instance to handle the raw data.
C.Create a SageMaker inference pipeline that includes a preprocessing step to normalize the input data before passing it to the model.
D.Use a batch transform job to preprocess the data before sending it to the endpoint.
AnswerC

Correct: This ensures real-time raw data is normalized before inference.

Why this answer

The model expects normalized input. The inference pipeline must include a preprocessing step to normalize the data. Using a SageMaker inference pipeline with a preprocessing container (e.g., scikit-learn) before the model container is the correct approach.

Option B is correct. Option A (retrain with raw data) is a viable alternative but would require retraining and may reduce model performance. Option C (transform job) is for batch inference, not real-time.

Option D (change instance type) does not address the data mismatch.

832
MCQeasy

A data engineer needs to process streaming data from an IoT fleet and store the results in Amazon S3 for analysis. The solution must be serverless and handle data that arrives at irregular intervals. Which AWS service should be used to ingest the data?

A.Amazon S3
B.AWS IoT Core
C.Amazon Simple Queue Service (SQS)
D.Amazon Kinesis Data Streams
AnswerB

AWS IoT Core provides secure device connectivity, message routing, and integrates with serverless processing.

Why this answer

Option B is correct because AWS IoT Core is designed to ingest data from IoT devices securely and at scale, and it integrates with other AWS services for processing. Option A is wrong because Kinesis Data Streams is for real-time streaming but not specifically for IoT device connectivity. Option C is wrong because SQS is a message queue, not optimized for IoT ingestion.

Option D is wrong because S3 is storage, not ingestion.

833
MCQhard

Refer to the exhibit. An ML engineer applies this bucket policy to an S3 bucket. The SageMaker execution role MySageMakerRole is used to train a model. The training data is located in s3://my-bucket/data/. The SageMaker training job fails with an access error. What is the most likely cause?

A.The policy allows GetObject only from the data/ prefix, but the training job uses a different prefix.
B.The role is not in the same AWS account as the bucket.
C.The Deny statement on s3:ListBucket prevents the role from listing objects in the bucket.
D.The bucket has default encryption enabled, causing a conflict.
AnswerC

SageMaker may need to list objects to iterate over files; the explicit deny blocks this.

Why this answer

Option A is correct because the Deny statement explicitly denies the ListBucket action to all principals, including the SageMaker role. Even though GetObject is allowed, SageMaker often needs to list objects to read data from a prefix. Option B is wrong because the role is in the same account.

Option C is wrong because the bucket is not encrypted. Option D is wrong because the policy does not restrict GetObject.

834
MCQmedium

A company is using Amazon SageMaker to train machine learning models. The training data is stored in Amazon S3, but the data includes personally identifiable information (PII) that must be anonymized before training. What is the most efficient way to anonymize the data?

A.Use an AWS Glue ETL job to read from S3, apply anonymization, and write to another S3 bucket.
B.Use Amazon Athena to query the data and apply anonymization functions.
C.Use Amazon Redshift Spectrum to query and anonymize data in S3.
D.Use a SageMaker Processing job to read from S3 and apply anonymization.
AnswerA

Glue is a serverless ETL service that can efficiently transform large datasets.

Why this answer

Option B is correct because AWS Glue can run a transformation job to anonymize PII before training. Option A is wrong because SageMaker Processing jobs are for feature engineering, not data anonymization from S3. Option C is wrong because Athena is for querying, not transforming.

Option D is wrong because Redshift Spectrum queries data in S3 but does not anonymize efficiently.

835
MCQmedium

A company is training a deep learning model on Amazon SageMaker using a large dataset stored in S3. The training job is failing with an error indicating insufficient memory. The model architecture and hyperparameters are fixed. Which change is MOST likely to resolve the issue without modifying the model code?

A.Enable SageMaker's distributed data parallelism.
B.Use managed Spot training to get cheaper compute.
C.Use a larger instance type with more memory.
D.Use Pipe mode for input data instead of File mode.
AnswerA

Distributed data parallelism splits the minibatch across multiple GPUs/instances, reducing per-device memory footprint.

Why this answer

Option C is correct because enabling data parallelism with SageMaker distributed training splits the data across multiple instances, reducing per-instance memory usage. Option A is wrong because increasing instance memory does not address root cause if training script uses memory inefficiently. Option B is wrong because using Pipe mode reduces disk usage but not memory.

Option D is wrong because Spot instances do not affect memory.

836
MCQeasy

A data scientist trains a linear regression model to predict house prices. The model has high bias (underfitting). Which action is most likely to reduce bias?

A.Reduce the number of features
B.Decrease the maximum depth of the tree
C.Increase model complexity
D.Add L1 regularization
AnswerC

More complex models can capture underlying patterns better, reducing bias.

Why this answer

Increasing model complexity (e.g., adding polynomial features or using a more flexible algorithm) can reduce bias. Adding L1 regularization increases bias, reducing features reduces complexity, and lowering max_depth for a tree also increases bias.

837
MCQmedium

A data scientist is using Amazon SageMaker built-in XGBoost algorithm to train a regression model. The training job completes successfully but the model performance on the test set is poor, with high bias. Which hyperparameter adjustment is most likely to help reduce bias?

A.Increase the max_depth parameter.
B.Reduce the num_round parameter.
C.Increase the gamma parameter.
D.Decrease the max_depth parameter.
AnswerA

Increasing max_depth allows trees to learn more complex patterns, reducing bias.

Why this answer

High bias (underfitting) can be reduced by increasing the model complexity. Increasing max_depth allows more complex trees. Decreasing max_depth would increase bias.

Increasing gamma increases regularization and bias. Reducing num_round (number of trees) reduces complexity.

838
Multi-Selecthard

A data scientist is analyzing a dataset with several categorical features and a binary target. The scientist wants to check for association between each categorical feature and the target. Which THREE statistical tests are appropriate?

Select 3 answers
A.ANOVA
B.Pearson correlation coefficient
C.Chi-square test of independence
D.Mutual information
E.Cramér's V
AnswersC, D, E

Tests association between two categorical variables.

Why this answer

Options A, B, and D are correct. Chi-square test of independence is for categorical-categorical association. Cramér's V is a measure of association based on chi-square.

Mutual information is a non-parametric measure that can capture non-linear dependencies. Option C is wrong because ANOVA is for categorical vs continuous. Option E is wrong because Pearson correlation is for continuous variables.

839
MCQhard

A data scientist is performing EDA on a dataset of customer churn. The dataset includes a categorical feature 'Region' with 100 unique values. What is the best way to encode this feature for a tree-based model?

A.Replace each category with its frequency in the dataset
B.Use the feature as a categorical variable directly in the tree-based model
C.Label encode the feature (assign integers 0-99)
D.One-hot encode the feature
AnswerB

Many tree-based models (e.g., LightGBM, CatBoost) handle high-cardinality categoricals efficiently.

Why this answer

Option C is correct because tree-based models can handle high-cardinality categorical features natively without encoding; many implementations (e.g., LightGBM, CatBoost) support categorical features directly. Option A is wrong because one-hot encoding creates 100 columns, causing sparsity. Option B is wrong because label encoding imposes ordinality.

Option D is wrong because frequency encoding may cause target leakage if using target encoding without proper cross-validation.

840
MCQeasy

A data scientist is training a binary classification model on a highly imbalanced dataset where the positive class represents only 1% of the data. Which metric should be used to evaluate model performance during training to ensure the model is learning to detect the positive class?

A.F1 score
B.Accuracy
C.Precision
D.Recall
AnswerA

F1 score balances precision and recall, suitable for imbalanced classification.

Why this answer

Accuracy is misleading for imbalanced datasets because a model that predicts the majority class all the time can achieve 99% accuracy. F1 score balances precision and recall, making it suitable for imbalanced classification. Precision, recall, and AUC are also useful, but F1 is a common single metric for imbalanced binary classification.

Option A: Accuracy is not suitable. Option B: Precision alone ignores recall. Option C: F1 score is correct.

Option D: Recall alone ignores precision.

841
MCQmedium

A data scientist is training a binary classification model on a dataset with 100,000 positive samples and 1,000 negative samples. The model achieves 99% accuracy on the test set but a very low F1 score. What is the most likely cause?

A.The test set contains only positive samples
B.The model is overfitting due to too many features
C.The model is underfitting due to insufficient training
D.The model predicts the majority class most of the time due to class imbalance
AnswerD

Class imbalance causes the model to be biased toward the majority class, leading to high accuracy but low F1.

Why this answer

The accuracy is high because the model predicts the majority class (positive) most of the time, but the F1 score is low because it fails to identify the minority class (negative) correctly. This is a classic symptom of class imbalance where the model is biased toward the majority class.

842
Multi-Selecteasy

Which TWO of the following are common techniques for detecting outliers in a dataset?

Select 2 answers
A.Z-score
B.Interquartile range (IQR) method
C.Principal Component Analysis (PCA)
D.K-means clustering
E.Standard scaling
AnswersA, B

Z-score measures how many standard deviations a point is from the mean; values beyond a threshold (e.g., 3) are outliers.

Why this answer

Z-score identifies outliers based on standard deviations from the mean. IQR method uses quartile ranges to flag points outside 1.5*IQR. Standard scaling, PCA, and K-means are not primarily outlier detection methods.

843
Multi-Selectmedium

A company is designing a data pipeline to analyze customer behavior. The pipeline must handle real-time streaming data and batch data. The data must be stored in a data lake on Amazon S3 and also made available for interactive queries. Which THREE services should be combined to build this pipeline? (Choose THREE.)

Select 3 answers
A.Amazon Kinesis Data Streams
B.AWS Glue
C.Amazon Redshift
D.Amazon DynamoDB Streams
E.Amazon Athena
AnswersA, B, E

Real-time data ingestion.

Why this answer

Amazon Kinesis Data Streams ingests real-time data. AWS Glue can perform ETL and catalog the data. Amazon Athena allows interactive querying on S3.

Options B, D, and E are not needed or redundant.

844
MCQhard

An IAM policy is attached to a data scientist's role. The scientist is trying to list objects in the 'data-bucket' using Amazon Athena. The query fails with an access denied error. What is the MOST likely reason?

A.The policy does not allow s3:ListBucket on the bucket.
B.The policy has a syntax error.
C.The query is trying to read data from the 'sensitive/' prefix.
D.The s3:GetObject action is explicitly denied for all objects.
AnswerC

Deny overrides Allow for that prefix.

Why this answer

Option B is correct because Athena needs s3:GetObject on the bucket to read data, but the Deny statement prevents access to objects under 'sensitive/' prefix. However, the query may be trying to read from that prefix. Option A is wrong because ListBucket is allowed.

Option C is wrong because Deny blocks GetObject. Option D is wrong because the policy is valid.

845
MCQmedium

A data scientist is performing EDA on a dataset with a timestamp column. They want to detect seasonality. Which visualization is most appropriate?

A.Box plot of value grouped by month
B.Bar chart of average value per month
C.Line plot of value over time
D.Scatter plot of timestamp vs. value
AnswerC

Why A is correct

Why this answer

Option A is correct because a time series line plot clearly shows seasonal patterns. Option B is wrong because bar chart of monthly averages may not show seasonality within months. Option C is wrong because scatter plot with timestamp vs. value may be cluttered.

Option D is wrong because box plot by month shows distribution, not trend over time.

846
MCQhard

A data scientist is training a deep learning model on Amazon SageMaker and notices that training is taking much longer than expected. The training job uses a single GPU instance. The model is a large transformer with millions of parameters. Which change would most likely reduce training time?

A.Reduce the batch size to fit in memory
B.Use a smaller instance type
C.Switch to a CPU instance
D.Use SageMaker's distributed data parallelism with multiple GPU instances
AnswerD

Data parallelism splits the mini-batch across GPUs, reducing training time.

Why this answer

Using data parallelism with multiple GPU instances can significantly reduce training time for large models by distributing the workload across multiple GPUs. Model parallelism is also possible but data parallelism is more common and easier to implement.

847
Multi-Selecteasy

A machine learning engineer is setting up a training job in Amazon SageMaker. Which THREE components are required to define a training job? (Choose three.)

Select 3 answers
A.VPC configuration for network isolation.
B.Hyperparameters for the algorithm.
C.Output data configuration (e.g., model artifact path).
D.An algorithm or custom container image.
E.Input data configuration (e.g., S3 path).
AnswersC, D, E

Specifies where to save output.

Why this answer

Options A, B, and D are correct. An algorithm or container, input data configuration, and output data configuration are required. Option C is wrong because hyperparameters are optional.

Option E is wrong because an IAM role is required, not a VPC (though VPC is common).

848
Multi-Selecthard

A machine learning team is building a multi-class image classifier using a pre-trained ResNet-50 model in Amazon SageMaker. The dataset has 10 classes but is highly imbalanced, with one class representing 80% of the samples. The team wants to improve model performance on the minority classes. Which TWO of the following approaches are most likely to help? (Select TWO.)

Select 2 answers
A.Oversample the minority classes in the training data.
B.Reduce the batch size to increase the frequency of weight updates.
C.Increase the number of layers in the model.
D.Switch to a focal loss function.
E.Use class weighting in the loss function.
AnswersA, E

Oversampling increases representation of minority classes, balancing the training set.

Why this answer

Oversampling the minority classes (Option A) directly addresses class imbalance by replicating samples from underrepresented classes, giving the model more exposure to them during training. This is a standard data-level technique that helps the ResNet-50 model learn discriminative features for minority classes without altering the loss function or model architecture.

Exam trap

The trap here is that candidates may incorrectly select focal loss (Option D) as a standalone answer, but the question requires exactly two correct options, and class weighting (Option E) is a more straightforward loss-modification technique that is explicitly tested in the MLS-C01 exam as a standard approach for imbalanced classification.

849
Multi-Selecteasy

Which TWO AWS services can be used to transform data in a streaming fashion without using a persistent cluster? (Choose 2.)

Select 2 answers
A.AWS Glue
B.Amazon EMR
C.AWS Lambda
D.Amazon Kinesis Data Analytics
E.AWS Data Pipeline
AnswersC, D

Lambda can process streaming data from Kinesis or DynamoDB Streams serverlessly.

Why this answer

Option A (Lambda) and Option D (Kinesis Data Analytics) are serverless streaming transformation services. Option B (Glue) is serverless but not low-latency streaming. Option C (EMR) requires a cluster.

Option E (Data Pipeline) is for batch.

850
Multi-Selecteasy

Which TWO of the following are common techniques for detecting outliers in a numerical feature?

Select 2 answers
A.Chi-square test
B.Standard deviation
C.Interquartile Range (IQR)
D.Z-score
E.Principal Component Analysis (PCA)
AnswersC, D

Outliers are defined as points beyond 1.5*IQR from Q1 or Q3.

Why this answer

Z-score and IQR are standard outlier detection methods. PCA can detect outliers but is not a common direct method. Chi-square is for categorical association.

Standard deviation alone is not a method.

851
MCQhard

A financial services company needs to build a data lake on Amazon S3 that meets regulatory requirements for data retention and encryption. Data must be encrypted at rest and in transit, and access must be audited. The data lake will be queried by Amazon Athena and Amazon Redshift Spectrum. Which combination of actions should be taken?

A.Enable S3 default encryption with SSE-KMS and enable AWS CloudTrail for S3 data events.
B.Use IAM policies to control access and enable S3 server access logging.
C.Use SSL/TLS for all connections and enable S3 versioning.
D.Enable S3 default encryption with SSE-S3 and use S3 access logs.
AnswerA

SSE-KMS provides encryption with managed keys; CloudTrail logs data events for auditing.

Why this answer

Option D is correct because server-side encryption with KMS (SSE-KMS) provides encryption at rest, and CloudTrail logs S3 API calls for auditing. Option A is wrong because SSE-S3 does not provide key management control. Option B is wrong because SSL/TLS is for in-transit encryption, not at rest.

Option C is wrong because IAM does not provide encryption.

852
MCQmedium

A machine learning engineer is deploying a model using AWS Lambda for real-time inference. The model is a scikit-learn RandomForestClassifier with 100 trees, serialized as a pickle file of 150 MB. The Lambda function has 3 GB memory allocated. However, the inference requests are timing out after 30 seconds. What is the most likely cause?

A.scikit-learn is not compatible with AWS Lambda.
B.The Lambda function does not have enough memory to load the model.
C.The model is loaded from S3 on every invocation, causing high latency.
D.The Lambda function timeout is set too low; increase it to 5 minutes.
AnswerC

Lambda should load the model outside the handler to reuse across invocations, but even then, cold starts with a large model are slow.

Why this answer

Option C is correct because the default behavior of loading a model from S3 on every Lambda invocation introduces significant latency. Each invocation must download the 150 MB pickle file from S3 over the network, deserialize it, and then run inference, which easily exceeds the 30-second timeout. The model should be loaded once outside the handler (in global scope) and reused across invocations to avoid this overhead.

Exam trap

Cisco often tests the misconception that Lambda timeouts are always the root cause of slow inference, when in fact the real issue is inefficient resource initialization (like loading large models from S3 on every call) that can be fixed by architectural changes rather than simply increasing the timeout.

How to eliminate wrong answers

Option A is wrong because scikit-learn is fully compatible with AWS Lambda when included in the deployment package or as a Lambda layer. Option B is wrong because 3 GB of memory is more than sufficient to load a 150 MB model; memory is not the bottleneck here. Option D is wrong because increasing the timeout to 5 minutes would mask the underlying issue of inefficient model loading, not solve it; the real problem is the per-invocation S3 download latency, not the timeout value itself.

853
MCQeasy

A data scientist is performing EDA on a dataset with 500,000 rows and 10 columns. The dataset is stored in an S3 bucket as CSV files. The scientist wants to generate summary statistics (mean, median, min, max) for all numeric columns. Which service allows the quickest ad-hoc analysis without provisioning any infrastructure?

A.AWS Glue ETL
B.Amazon Athena
C.Amazon SageMaker Data Wrangler
D.Amazon QuickSight
AnswerB

Serverless SQL query service.

Why this answer

Option B is correct because Amazon Athena can query data in S3 directly using SQL. Option A is wrong because SageMaker Data Wrangler requires a notebook instance. Option C is wrong because AWS Glue ETL requires job setup.

Option D is wrong because QuickSight is for visualization, not direct summary statistics.

854
MCQeasy

A company is using Amazon Kinesis Data Firehose to load streaming data into an S3 bucket. The data schema evolves over time, with new columns added. The data must be queryable using Amazon Athena. What is the BEST way to handle schema changes?

A.Manually update the Athena table definition each time a new column is added
B.Configure Firehose to convert the data to Apache JSON format
C.Use AWS Glue Crawlers to automatically detect schema changes and update the table metadata
D.Recreate the Athena table daily to pick up new columns
AnswerC

Glue Crawlers can run on a schedule to discover new columns and update the Data Catalog.

Why this answer

Option D is correct. Using Glue Crawlers to update the schema and partitioning by date allows Athena to handle schema evolution gracefully. Option A (schema-on-read) is how Athena works, but manual updates are not needed.

Option B (convert to JSON) is not necessary. Option C (recreate table) is disruptive.

855
MCQeasy

A startup is building a recommendation system for an e-commerce platform using collaborative filtering. They have a dataset of user-item interactions (ratings) with 1 million users and 100,000 items. The data is sparse (99% missing ratings). They need to train a model on Amazon SageMaker that can handle large-scale sparse data efficiently. Which approach should they use?

A.Use PCA to reduce dimensionality and then apply k-nearest neighbors
B.Use the built-in Factorization Machines algorithm in SageMaker
C.Use the built-in XGBoost algorithm with one-hot encoding for user and item IDs
D.Implement a neural network with dense layers using the built-in MXNet framework
AnswerB

Factorization Machines are designed for sparse data and scale well.

Why this answer

SageMaker's Factorization Machines handle sparse data efficiently and are designed for recommendation tasks.

856
MCQeasy

A data engineer needs to run a one-time ETL job to transform 500 GB of data from Amazon RDS to Amazon S3. The job should be cost-effective and require minimal infrastructure management. Which AWS service should be used?

A.AWS Glue
B.Amazon EMR
C.Amazon Athena
D.AWS Data Pipeline
AnswerA

Glue is serverless, cost-effective, and ideal for one-time ETL.

Why this answer

Option B is correct because AWS Glue is serverless and suitable for one-time ETL jobs. Option A is wrong because EMR requires cluster management and is more expensive for one-time jobs. Option C is wrong because Data Pipeline is a managed service but still requires provisioning.

Option D is wrong because Athena is for querying, not ETL.

857
MCQeasy

A machine learning team is using SageMaker to build a model. They need to track hyperparameter tuning experiments, compare results, and visualize metrics. Which SageMaker feature should they use?

A.SageMaker Experiments
B.SageMaker Ground Truth
C.SageMaker Model Monitor
D.SageMaker Hyperparameter Tuning
E.SageMaker Debugger
AnswerA

Experiments provides tracking, comparison, and visualization.

Why this answer

Option C is correct because SageMaker Experiments provides experiment tracking, comparison, and visualization. Option A (Hyperparameter Tuning) only tunes, not tracks. Option B (Debugger) is for debugging.

Option D (Model Monitor) is for monitoring after deployment. Option E (Ground Truth) is for labeling.

858
MCQmedium

A company is building a recommendation system using Amazon SageMaker. The training data includes user-item interactions stored in a DataFrame with over 100 million rows. The data scientist wants to perform feature engineering, including one-hot encoding of categorical features with high cardinality. Which approach is MOST cost-effective and scalable?

A.Use Amazon EMR with Spark and store the processed data in HDFS.
B.Use SageMaker Processing with a Spark container to distribute the encoding job.
C.Use a SageMaker notebook instance with scikit-learn to perform the encoding in memory.
D.Use AWS Glue ETL jobs to perform the encoding and store the result in S3.
AnswerB

SageMaker Processing with Spark provides distributed processing and is cost-effective for large datasets.

Why this answer

Option B is correct because SageMaker Processing with a Spark job can scale horizontally and is cost-effective for large datasets. Option A is wrong because scikit-learn on a single instance may not handle 100M rows. Option C is wrong because Glue is serverless but may be more expensive for large processing.

Option D is wrong because EMR is more complex and costly for a simple job.

859
MCQmedium

A data scientist is training a binary classifier to predict customer churn. The dataset has 10,000 samples, with 500 churners (positive class). The scientist trains a logistic regression model and obtains an F1-score of 0.6. To improve the F1-score, which approach is MOST likely to be effective?

A.Increase the regularization strength (C)
B.Apply PCA to reduce feature dimensionality
C.Apply SMOTE to oversample the minority class
D.Use the original dataset without any modification
AnswerC

SMOTE generates synthetic samples for the minority class, balancing the dataset and often improving F1-score.

Why this answer

The dataset is highly imbalanced (500 churners out of 10,000 samples, a 5% positive rate). Logistic regression trained on such imbalance tends to bias toward the majority class, resulting in low recall for the minority class and a poor F1-score. SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class by interpolating between existing minority instances, which balances the class distribution and allows the model to learn a better decision boundary, directly improving recall and F1-score.

Exam trap

Cisco often tests the misconception that regularization (Option A) or dimensionality reduction (Option B) can fix class imbalance, when in fact they address overfitting and noise, not skewed class priors.

How to eliminate wrong answers

Option A is wrong because increasing regularization strength (C) reduces model complexity and can lead to underfitting, which typically worsens performance on imbalanced data by pushing the decision boundary further toward the majority class. Option B is wrong because PCA reduces dimensionality by projecting data onto principal components that maximize variance, but it does not address class imbalance; it may even discard discriminative information for the minority class. Option D is wrong because using the original dataset without modification ignores the severe class imbalance, and the logistic regression model will continue to predict the majority class for most samples, yielding a low F1-score.

860
MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be transformed before being stored in Amazon S3. The transformations include enrichment with reference data from Amazon DynamoDB. Which AWS service should be used to perform the transformation with minimal operational overhead?

A.Amazon Kinesis Data Firehose with data transformation
B.AWS Lambda functions invoked by Kinesis Data Streams
C.Amazon Kinesis Data Analytics for Apache Flink
D.Amazon EMR with Apache Spark Streaming
AnswerC

Managed Flink application can perform complex transformations and enrichments with low operational overhead.

Why this answer

Amazon Kinesis Data Analytics for Apache Flink allows real-time stream processing with Flink, including enrichment from external sources like DynamoDB. Option A (AWS Lambda) is simpler but has a 15-minute timeout and may not handle high throughput well. Option C (Amazon EMR) requires cluster management.

Option D (Amazon Kinesis Data Firehose) is for delivery and can invoke Lambda for transformation, but complex transformations with external lookups are better handled by Data Analytics.

861
MCQeasy

A company wants to use Amazon SageMaker to automatically tune hyperparameters for a XGBoost model. Which built-in SageMaker feature should be used?

A.SageMaker Debugger
B.SageMaker Model Monitor
C.SageMaker Experiments
D.SageMaker Automatic Model Tuning
AnswerD

This is the service for hyperparameter tuning.

Why this answer

SageMaker Automatic Model Tuning performs hyperparameter optimization. Option B (SageMaker Experiments) tracks trials. Option C (SageMaker Debugger) monitors training.

Option D (SageMaker Model Monitor) detects drift.

862
MCQhard

A company uses Amazon EMR to run Spark jobs on a large dataset stored in Amazon S3. The jobs are failing with 'OutOfMemoryError' in the executors. The data is not skewed. Which configuration change will most likely resolve the issue?

A.Enable Kryo serialization
B.Decrease the number of shuffle partitions
C.Increase the spark.executor.memoryOverhead setting
D.Increase the number of executor cores
AnswerC

Memory overhead handles JVM overhead and off-heap memory, preventing OOM errors.

Why this answer

Increasing the executor memory overhead provides additional memory for JVM overhead and can prevent OutOfMemoryError. Option A (increasing cores) may increase parallelism but not memory. Option B (decreasing shuffle partitions) may reduce memory usage but is not a direct fix.

Option D (using Kryo serialization) reduces memory usage but is not as effective as increasing overhead.

863
MCQmedium

A data scientist is performing exploratory data analysis on a dataset with both numerical and categorical features. The scientist wants to visualize the pairwise relationships between numerical features and also see the distribution of each feature. Which type of plot should the scientist use?

A.Pair plot (scatter matrix) with histograms on the diagonal.
B.Box plot for each feature.
C.Heatmap of the correlation matrix.
D.Correlation matrix with numbers.
AnswerA

Shows pairwise scatter plots and distributions.

Why this answer

Option C is correct because a pair plot (scatter matrix) shows pairwise scatter plots and histograms on the diagonal. Option A is wrong because a correlation matrix does not show distributions. Option B is wrong because a heatmap only shows correlation values.

Option D is wrong because a box plot does not show pairwise relationships.

864
MCQeasy

A data scientist needs to profile a large dataset in Amazon S3 to understand its schema, data types, and quality. Which AWS service can automatically generate a data profile with statistics and visualizations?

A.Amazon Athena
B.AWS Glue DataBrew
C.Amazon QuickSight
D.Amazon Redshift
AnswerB

DataBrew can profile data and generate statistics.

Why this answer

AWS Glue DataBrew provides data profiling capabilities. Option B is wrong because Athena is a query service. Option C is wrong because Redshift is a data warehouse.

Option D is wrong because QuickSight is for visualization after data is prepared.

865
MCQeasy

A data scientist wants to visualize the correlation between a continuous feature and a binary target variable. Which plot is most appropriate?

A.Scatter plot with feature on x-axis and target on y-axis
B.Histogram of the feature
C.Box plot of the feature grouped by target class
D.Bar chart of target class counts
AnswerC

Box plot compares distributions across two groups.

Why this answer

Option B is correct because a box plot shows distribution of the continuous feature across categories of the binary target, highlighting differences. Option A is wrong because a scatter plot is for two continuous variables. Option C is wrong because a histogram shows distribution of a single variable.

Option D is wrong because a bar chart is for categorical vs categorical or counts.

866
MCQmedium

A data scientist is working on a regression problem to predict house prices. The dataset has 80 features, including categorical variables with high cardinality (e.g., zip code with 10,000 unique values). The target variable is log-transformed. The data scientist trains a linear regression model and obtains an R² of 0.45 on the test set. To improve performance, the data scientist considers: A) Applying one-hot encoding to all categorical features and using Ridge regression. B) Using target encoding for high-cardinality features and using a tree-based model like XGBoost. C) Removing all categorical features and using polynomial features for numerical features. D) Using principal component analysis (PCA) on all features before training a linear model. Which approach is MOST likely to improve the model's performance?

A.Remove categorical features and use polynomial features
B.Target encoding + XGBoost
C.One-hot encoding + Ridge regression
D.PCA on all features before linear regression
AnswerB

Target encoding reduces dimensionality and XGBoost captures complex patterns.

Why this answer

Target encoding efficiently handles high-cardinality features, and tree-based models like XGBoost can capture non-linear relationships and interactions, likely improving R². One-hot encoding would create too many features, causing sparsity. Removing categories loses information.

PCA may discard important information.

867
Multi-Selecteasy

A data scientist is evaluating a binary classification model. The model's AUC-ROC is 0.95. Which TWO statements are true?

Select 2 answers
A.The model has no false positives
B.The model has excellent discriminative ability
C.The model's performance is independent of the decision threshold
D.The model is well-calibrated
E.The model's accuracy is at least 95%
AnswersB, C

AUC close to 1 indicates strong separation between classes.

Why this answer

AUC-ROC measures the model's ability to distinguish between classes across all thresholds. A high AUC (close to 1) indicates good performance. AUC-ROC is threshold-independent.

It does not directly indicate accuracy or calibration.

868
MCQmedium

A data engineering team is building a pipeline to process terabytes of log data daily using Amazon EMR with Spark. The data arrives in hourly batches and must be processed within 4 hours. The team needs to minimize cost. Which cluster configuration is MOST cost-effective?

A.Use a single large instance with multiple cores to avoid data shuffling.
B.Use a transient cluster with a mix of on-demand and spot instances, terminated after the job completes.
C.Use a long-running cluster of on-demand instances to avoid startup time.
D.Use Amazon EMR Serverless to automatically scale.
AnswerB

Transient clusters reduce idle cost, spot instances lower compute cost.

Why this answer

Option B is correct because spot instances offer significant cost savings and are suitable for fault-tolerant Spark jobs. Option A is wrong because on-demand is more expensive. Option C is wrong because a single large instance reduces parallelism.

Option D is wrong because EMR Serverless may be more expensive for predictable, large workloads.

869
Multi-Selectmedium

A data scientist is training a linear regression model on a dataset with 10 numerical features. After training, the model's R-squared value is 0.99 on the training set but only 0.60 on the test set. Which TWO of the following are appropriate actions to reduce overfitting? (Choose TWO.)

Select 2 answers
A.Normalize the features
B.Add more features to the model
C.Use a subset of the most important features
D.Increase the number of training epochs
E.Apply L2 regularization (Ridge regression)
AnswersC, E

Reducing the number of features reduces model complexity and overfitting.

Why this answer

Regularization (L1 or L2) penalizes large coefficients and reduces overfitting. Reducing model complexity by using fewer features or simplifying the model also helps. Adding more features would increase complexity and overfitting.

Increasing the number of epochs is not relevant for linear regression (which has a closed-form solution).

870
Multi-Selecteasy

Which TWO actions are valid ways to handle missing data in a dataset before training a machine learning model? (Select TWO.)

Select 2 answers
A.Delete rows with missing values
B.Remove all features that have any missing values
C.Replace missing values with the maximum value
D.Ignore missing values and train the model
E.Impute missing values with the mean
AnswersA, E

Row deletion is valid if missingness is random.

Why this answer

Option A is correct because deleting rows with missing values (listwise deletion) is a straightforward and valid approach when the missing data is random and the dataset is large enough that the loss of rows does not significantly reduce statistical power or introduce bias. This method avoids the need to estimate missing values and is commonly used in practice when the proportion of missing data is low.

Exam trap

Cisco often tests the misconception that 'ignoring missing values' is acceptable because some algorithms like tree-based models can technically handle missing values internally, but the exam expects explicit data preprocessing steps as part of the modeling pipeline.

871
MCQmedium

A company is preparing a dataset for training a binary classification model. The dataset has a severe class imbalance (1% positive class). The data scientist wants to understand the impact of this imbalance on model performance before sampling. Which exploratory analysis step is MOST critical?

A.Compute the correlation matrix of all features with the target variable.
B.Check for missing values and outliers in the dataset.
C.Perform PCA and visualize the first two principal components colored by class.
D.Plot the distribution of each feature separately for the positive and negative classes.
AnswerD

Overlapping distributions indicate difficulty in classification.

Why this answer

Option B is correct because analyzing the distribution of features across classes can reveal separability and potential issues. Option A is wrong because correlation with target is not the primary concern. Option C is wrong because missing values are not the immediate concern.

Option D is wrong because PCA is not necessary at this stage.

872
MCQmedium

A company uses Amazon SageMaker to train a time-series forecasting model using the built-in DeepAR algorithm. The training data consists of daily sales for 1000 products over 2 years. The model performs well on most products, but for a few products with intermittent demand (sporadic sales), the predictions are poor. Which action should the data scientist take to improve predictions for these products?

A.Create a separate forecasting model specifically for intermittent demand products, using a model designed for such patterns (e.g., Croston's method).
B.Use a linear regression model for all products.
C.Increase the context length of the DeepAR model to capture longer history.
D.Add more training data by including additional product categories.
AnswerA

Intermittent demand requires specialized models like Croston's method or TSB.

Why this answer

Option A is correct. Creating separate models for different demand patterns allows specialized treatment. Option B is wrong because the dataset is already long enough.

Option C is wrong because using a linear model may underfit. Option D is wrong because increasing training data does not help with intermittent patterns.

873
MCQeasy

A data scientist is working with a dataset that contains text reviews and a numeric rating (1-5). The goal is to predict the rating from the review text. During EDA, the scientist wants to check if there are any spelling errors or unusual characters. Which tool is BEST suited for this task?

A.Amazon SageMaker Data Wrangler with a custom transform for text cleaning.
B.Amazon Athena with SQL queries to find anomalies.
C.Amazon Comprehend to detect syntax and entities.
D.Amazon QuickSight to create word clouds.
AnswerC

Comprehend can analyze text for structure.

Why this answer

Option C is correct because Amazon Comprehend can detect entities, key phrases, and syntax, but not spelling errors directly; however, it can be used to identify unusual patterns. Actually, for spelling errors, a custom solution may be needed. But among options, Comprehend is the only AWS AI service that processes text.

Option A is wrong because SageMaker Data Wrangler is for tabular data. Option B is wrong because Athena is for SQL. Option D is wrong because QuickSight is for visualization.

874
MCQmedium

A data scientist is analyzing a dataset with missing values in several columns. The dataset contains customer demographic information and purchase history. Which approach should the data scientist take to handle missing values without introducing bias into the dataset?

A.Drop all rows with any missing values.
B.Impute missing values with the mean of each column.
C.Replace missing values with a constant, such as 0.
D.Use multiple imputation to estimate missing values.
AnswerD

Multiple imputation accounts for uncertainty and reduces bias.

Why this answer

Option C is correct because multiple imputation accounts for the uncertainty of missing values by creating multiple imputed datasets and combining results, reducing bias compared to single imputation or deletion methods. Option A is wrong because dropping rows with missing values can introduce bias if the missingness is not completely random. Option B is wrong because mean imputation can reduce variance and bias relationships.

Option D is wrong because using a constant value (e.g., 0) is arbitrary and can distort the data distribution.

875
MCQhard

A team notices that a SageMaker training job using TensorFlow is running slower than expected. The training data is in S3 in TFRecord format. Which action is most likely to improve training throughput?

A.Use Pipe mode for data ingestion
B.Use distributed training with more instances
C.Increase the batch size in the training script
D.Switch from Pipe mode to File mode
AnswerA

Pipe mode streams data, reducing I/O wait time.

Why this answer

Option D is correct because SageMaker Pipe mode streams data directly from S3, eliminating download latency. Option A is wrong because using 'File' mode is default but slower. Option B is wrong because increasing batch size may cause memory issues.

Option C is wrong because increasing instances adds complexity but not per-instance throughput.

876
MCQeasy

A data scientist is analyzing a dataset with many features and wants to identify which features are most correlated with the target variable. Which EDA technique should be used?

A.Box plots grouped by target
B.Scatter plot matrix
C.Histogram of each feature
D.Correlation matrix
AnswerD

Correlation matrix provides a compact view of pairwise correlations.

Why this answer

A correlation matrix shows pairwise correlations between all numeric features and the target. Option A is wrong because scatter plots can only show one pair at a time. Option B is wrong because histograms show distributions, not correlations.

Option D is wrong because box plots show distributions per category, not correlations.

877
Matchingmedium

Match each SageMaker built-in metric to its meaning.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Fraction of correct predictions on validation set

Root mean square error on validation set

Area under ROC curve on validation set

Logistic loss on validation set

Harmonic mean of precision and recall on validation set

Why these pairings

These metrics are used to evaluate model performance.

878
MCQhard

A data engineer is using AWS Glue to catalog a dataset with 200 columns. During exploratory data analysis, they run a crawler and then view the table schema in the AWS Glue Data Catalog. They notice that many columns are inferred as 'string' even though they contain numeric values. What is the most likely cause?

A.The data is stored in JSON format, which only supports string types.
B.The crawler sample size is too small, and the sampled rows contain non-numeric values.
C.The data is stored in Parquet format, which does not support numeric types.
D.The column names contain special characters that prevent type inference.
AnswerB

The crawler samples a subset; if the sample includes non-numeric values, it infers string.

Why this answer

Option D is correct because the crawler samples data and may not see enough numeric values if the sample size is small or if the first few rows contain non-numeric values (e.g., headers or missing values). Option A is incorrect because the crawler does not rely on column names for type inference. Option B is incorrect because Parquet files store schema, but if the data is CSV, the crawler infers types.

Option C is incorrect because JSON files also have type information, but the crawler can still infer incorrectly.

879
MCQhard

During EDA, a data scientist discovers that two numerical features have a Pearson correlation coefficient of 0.95. Which action should the scientist take to avoid multicollinearity in a linear regression model?

A.Remove one of the features
B.Apply PCA to the two features
C.Use Ridge regression to penalize coefficients
D.Create polynomial features from the correlated pair
E.Apply min-max scaling to both features
AnswerA

Removing one feature eliminates multicollinearity and retains interpretability.

Why this answer

High correlation indicates multicollinearity, which can be addressed by removing one of the correlated features. Option A is wrong because PCA reduces dimensionality but loses interpretability. Option B is wrong because regularization (e.g., Ridge) can handle multicollinearity but does not remove it; removing one feature is simpler.

Option D is wrong because polynomial features introduce more multicollinearity. Option E is wrong because scaling does not address correlation.

880
MCQmedium

Refer to the exhibit. A data scientist is unable to query a table in Amazon Athena that is located in the 'my-data-bucket' S3 bucket. The IAM policy shown is attached to the scientist's role. What is the most likely reason for the failure?

A.The policy does not allow decrypting data encrypted with AWS KMS.
B.The policy does not allow athena:StartQueryExecution.
C.The policy does not allow s3:GetObject on the bucket.
D.The policy does not allow s3:PutObject to write query results to an S3 bucket.
AnswerD

Athena writes results to S3, requiring s3:PutObject.

Why this answer

Athena queries also require permission to write query results to a S3 bucket, typically specified as 's3:PutObject' on an output location. The policy lacks that permission. Option A is wrong because the policy allows s3:GetObject and s3:ListBucket.

Option C is wrong because athena:StartQueryExecution is allowed. Option D is wrong because there is no encryption restriction in the policy.

881
Multi-Selectmedium

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target is binary. The scientist wants to reduce dimensionality while preserving information related to the target. Which TWO methods are appropriate?

Select 2 answers
A.Principal Component Analysis (PCA)
B.Autoencoders
C.L1-regularized logistic regression
D.Mutual information-based feature selection
E.t-Distributed Stochastic Neighbor Embedding (t-SNE)
AnswersC, D

Can perform feature selection by shrinking coefficients to zero.

Why this answer

Options A and D are correct. Mutual information selection selects features with highest dependency on target, and L1-regularized logistic regression can drive coefficients to zero for feature selection. Option B is wrong because PCA is unsupervised and may discard target-related variance.

Option C is wrong because t-SNE is for visualization only. Option E is wrong because Autoencoders are unsupervised.

882
MCQhard

A data scientist is working with a dataset containing text reviews. The goal is to classify sentiment. During EDA, they compute the word frequency distribution. They notice that the most frequent words are common stop words like 'the', 'and', 'a'. Which action should they take to improve the feature representation for modeling?

A.Use n-grams instead of unigrams to capture phrase patterns.
B.Add more stop words to the default list to remove even more common words.
C.Remove the stop words from the text before creating the bag-of-words representation.
D.Apply stemming to reduce words to their root forms.
AnswerC

Stop words are usually not informative for sentiment; removing them reduces noise.

Why this answer

Option B is correct because removing stop words focuses on content words that carry sentiment. Option A is wrong because adding more stop words would remove even more potentially useful words. Option C is wrong because stemming reduces words to root forms but does not address stop words.

Option D is wrong because n-grams capture phrases but still include stop words.

883
MCQeasy

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with timeouts. What is the most likely cause?

A.The S3 bucket policy is too restrictive.
B.The AWS Glue job does not have enough DPUs (Data Processing Units) allocated.
C.The Amazon Redshift cluster is in maintenance mode.
D.The source data is not compressed.
AnswerB

Insufficient resources can cause timeouts.

Why this answer

Insufficient DPU allocation can cause timeouts in Glue jobs. The other options are less likely: S3 bucket policies would cause permission errors, not timeouts; Redshift maintenance would affect all jobs; compression would improve performance.

884
MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour. The job reads from an Amazon DynamoDB table and writes to Amazon S3. Which AWS service should the engineer use to trigger the Glue job on schedule?

A.Amazon Kinesis Data Streams
B.AWS Step Functions
C.Amazon EventBridge (CloudWatch Events)
D.AWS Lambda
AnswerC

EventBridge can schedule events to trigger Glue jobs.

Why this answer

Option A is correct. Amazon CloudWatch Events (now Amazon EventBridge) can trigger Glue jobs on a schedule. Option B is wrong because AWS Lambda is not a scheduler.

Option C is wrong because Amazon Kinesis is for streaming. Option D is wrong because AWS Step Functions is for orchestrating workflows, but scheduling is typically done via EventBridge.

885
Multi-Selectmedium

Which TWO approaches are valid for handling missing categorical values in a dataset before training a machine learning model?

Select 2 answers
A.Remove all rows with missing values
B.Impute missing values with the mode of the column
C.Impute missing values with the median of the column
D.Impute missing values with the mean of the column
E.Treat missing values as a separate category
AnswersB, E

Mode is appropriate for categorical data.

Why this answer

Option B is correct because the mode (most frequent value) is the only valid measure of central tendency for categorical data, as it identifies the most common category. Imputing with the mode preserves the distribution of categories and is a standard technique for handling missing categorical values in preprocessing pipelines like scikit-learn's SimpleImputer with strategy='most_frequent'.

Exam trap

AWS often tests the distinction between numerical and categorical imputation methods, trapping candidates who apply mean or median imputation to categorical features without recognizing that these statistics are invalid for non-numeric data.

886
MCQmedium

A company uses Amazon SageMaker to train a model. The training job runs successfully but the model artifacts are not saved to the specified S3 output path. What is a likely cause?

A.The training script does not save the model to /opt/ml/model.
B.The model size exceeds the S3 bucket limit.
C.The training job used spot instances.
D.The S3 bucket is in a different AWS Region.
AnswerA

SageMaker uploads contents of /opt/ml/model to S3; saving elsewhere means artifacts are lost.

Why this answer

Option A is correct because Amazon SageMaker expects the training script to save the model artifacts to the `/opt/ml/model` directory. After the training job completes, SageMaker automatically copies the contents of this directory to the specified S3 output path. If the script saves the model elsewhere (e.g., `/tmp` or a custom path), no artifacts will be uploaded, resulting in an empty or missing S3 output.

Exam trap

The trap here is that candidates assume any successful training job automatically saves artifacts, but SageMaker only uploads what is explicitly placed in `/opt/ml/model`, and the exam tests this specific SageMaker convention.

How to eliminate wrong answers

Option B is wrong because S3 bucket limits are based on total bucket size (unlimited) and object size (up to 5 TB per object), not model size; a model exceeding these limits would cause a different error (e.g., upload failure), not a silent missing artifact. Option C is wrong because using spot instances does not affect where the model is saved; spot instances can be preempted, but if the training completes successfully, artifacts are still saved to `/opt/ml/model` and uploaded. Option D is wrong because SageMaker can write to S3 buckets in any region as long as the bucket policy and IAM role grant cross-region access; a region mismatch would cause a permission or access error, not a silent failure to save artifacts.

887
MCQmedium

A company uses SageMaker to train a time-series forecasting model using Amazon Forecast. The dataset contains historical sales data for 10,000 products over 2 years. Which data format is required for the target time series?

A.A single JSON file with nested arrays
B.A CSV file with columns: timestamp, target_value, item_id
C.A text file with one value per line
D.A Parquet file partitioned by date
AnswerB

This is the required format for target time series in Forecast.

Why this answer

Amazon Forecast requires the target time series data to be in a CSV format with specific columns: timestamp, target_value, and item_id. This structured format allows the service to correctly identify the time series for each product and the target metric to forecast. The CSV format is the standard input for Forecast's built-in algorithms and ensures compatibility with the dataset import process.

Exam trap

The trap here is that candidates may assume Amazon Forecast supports flexible data formats like JSON or Parquet for all dataset types, but the target time series is strictly restricted to CSV to ensure consistent parsing and algorithm compatibility.

How to eliminate wrong answers

Option A is wrong because Amazon Forecast does not accept JSON files for target time series data; it requires CSV format for dataset import. Option C is wrong because a text file with one value per line lacks the necessary metadata (timestamp and item_id) to define multiple time series and their temporal alignment. Option D is wrong while Parquet is a supported format for related time series (RTS) or item metadata, the target time series dataset must be in CSV format as per Forecast's documentation.

888
Multi-Selecteasy

A data scientist is evaluating a binary classification model that predicts whether a customer will churn. The model achieves an AUC of 0.85 on the test set. Which TWO statements about AUC are correct? (Choose two.)

Select 2 answers
A.AUC represents the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.
B.An AUC of 0.85 indicates the model is no better than random guessing.
C.AUC is the average precision across all thresholds.
D.AUC is equivalent to the accuracy of the model at the default threshold of 0.5.
E.AUC is threshold-independent, meaning it evaluates the model's ranking performance across all thresholds.
AnswersA, E

This is the statistical interpretation of AUC.

Why this answer

Options B and D are correct. AUC measures the model's ability to rank positive instances higher than negative ones (B). AUC is threshold-independent (D).

Option A is false because AUC 0.85 is better than random (0.5). Option C is false because AUC is not accuracy. Option E is false because AUC is not mean precision.

889
MCQeasy

A data scientist is training a linear regression model using Amazon SageMaker's built-in Linear Learner algorithm. The dataset has 500 features and 1 million rows. After training, the model's training RMSE is 2.5 and validation RMSE is 2.6, which is acceptable. However, the scientist notices that many feature coefficients are very small but non-zero, and the model takes a long time to train. The scientist wants to reduce training time while maintaining similar accuracy. Which action should the scientist take?

A.Increase the mini-batch size
B.Increase the L1 regularization strength
C.Switch to a neural network model
D.Increase the L2 regularization strength
AnswerB

L1 induces sparsity, removing features and speeding training.

Why this answer

Option B (increase L1 regularization) will drive many coefficients to zero, reducing effective features and thus training time. Option A (increase L2) shrinks coefficients but doesn't zero them out, so less impact on speed. Option C (increase batch size) may speed training but could affect convergence.

Option D (use a different algorithm) is unnecessary.

890
MCQhard

A data scientist is analyzing a dataset where the target variable is highly imbalanced (1% positive class). They are performing EDA. Which metric is most appropriate for evaluating class separation in the feature space?

A.Area Under the ROC Curve (AUC-ROC)
B.Accuracy
C.F1 score
D.Log loss
AnswerA

AUC-ROC measures separability independent of class distribution.

Why this answer

Option D is correct because AUC-ROC is robust to class imbalance and measures separability. Option A is wrong because accuracy is misleading on imbalanced data. Option B is wrong because F1 score is a model evaluation metric, not for EDA.

Option C is wrong because log loss is a probabilistic metric.

891
MCQeasy

A machine learning engineer is deploying a model to an Amazon SageMaker endpoint. The model requires GPU for inference. Which instance type should be selected?

A.ml.p3.2xlarge
B.ml.m5.large
C.ml.c5.xlarge
D.ml.r5.large
AnswerA

GPU instance suitable for inference.

Why this answer

Option D is correct because P3 instances (e.g., ml.p3.2xlarge) provide GPU capabilities. Options A, B, and C are CPU-only instances.

892
MCQmedium

A company is using SageMaker to train a linear learner algorithm. The training log shows that the algorithm converges but the final loss is still high. Which change is most likely to improve the model?

A.Reduce the early stopping tolerance
B.Increase the maximum runtime
C.Add feature crosses or polynomial features
D.Increase the number of training instances
AnswerC

Linear models benefit from feature engineering to capture non-linear relationships.

Why this answer

Option B is correct: feature engineering can help linear models capture non-linear patterns. Option A (increasing instances) may not help if model is underfitting. Option C (early stopping) would stop prematurely.

Option D (max runtime) does not affect model quality.

893
MCQhard

A data scientist is using Amazon SageMaker to train a deep learning model for image classification. The training job is using a single GPU instance and is taking too long. The scientist wants to reduce training time without sacrificing model accuracy. The dataset contains 100,000 images of size 256x256. Which change would most effectively reduce training time?

A.Reduce the batch size
B.Use a smaller image size (e.g., 128x128)
C.Increase the learning rate
D.Switch to a distributed training setup with multiple GPUs
AnswerB

Fewer pixels mean faster forward/backward passes, significantly reducing training time.

Why this answer

Reducing image resolution (e.g., to 128x128) significantly reduces the number of pixels and thus the computational cost per epoch, often with minimal impact on accuracy for many tasks. Using a smaller batch size increases the number of iterations but can actually slow down training. Distributed training with multiple GPUs would reduce time but the question asks for a change that does not sacrifice accuracy; distributed training can sometimes affect convergence but is generally safe.

However, reducing resolution is a direct and effective method.

894
MCQmedium

A company is streaming e-commerce events to Amazon Kinesis Data Streams. The data science team needs to join events from multiple shards in near real-time and then store the joined results in Amazon S3. Which solution would meet these requirements with the LEAST operational overhead?

A.Use AWS Lambda functions with Kinesis triggers to process each record, join across shards using a DynamoDB table for state, and write to S3.
B.Use Amazon Kinesis Data Firehose to buffer the data and write to S3, then use Amazon Athena to join the data after it is stored.
C.Use AWS Glue ETL jobs that read from the Kinesis stream via the Kinesis connector and write the joined results to S3.
D.Use Amazon Kinesis Data Analytics for Apache Flink to read from the Kinesis stream, perform a join operation using Flink SQL, and write the results to S3 using a sink connector.
AnswerD

Kinesis Data Analytics for Apache Flink supports stateful stream processing and can join across shards natively.

Why this answer

Option C is correct because Amazon Kinesis Data Analytics for Apache Flink can read from a Kinesis stream, perform stateful joins across shards using Flink's SQL or DataStream API, and write results to S3. Option A is wrong because while Glue ETL can process data, it is batch-oriented and not designed for near real-time streaming joins. Option B is wrong because Lambda with Kinesis triggers processes each shard independently; joining across shards would require external state management and is not a typical pattern.

Option D is wrong because Kinesis Data Firehose cannot perform joins; it only writes data to destinations.

895
MCQmedium

A SageMaker training job fails with the failure reason shown in the exhibit. What is the most likely cause?

A.The training instance ran out of memory
B.The S3 bucket with training data is not accessible
C.The SageMaker service limit for the instance type has been exceeded
D.There is an error in the custom training script
AnswerD

ExecuteUserScriptError with ExitCode 1 indicates script error.

Why this answer

Option B is correct: ExitCode 1 from UserScript indicates an error in the training script. Option A (insufficient memory) would show OutOfMemory error. Option C (S3 access) would show AccessDenied.

Option D (instance limit) would show ResourceLimitExceeded.

896
MCQeasy

A data scientist is training a binary classification model on a highly imbalanced dataset where the positive class represents only 1% of the data. The model achieves 99% accuracy but only identifies 5% of the actual positives. Which metric should the data scientist use to evaluate model performance?

A.Mean squared error
B.Accuracy
C.Recall
D.Precision
AnswerC

Recall measures the proportion of actual positives correctly identified.

Why this answer

Recall (sensitivity) measures the proportion of actual positives correctly identified by the model. With only 5% of positives detected, recall is 0.05, which directly reveals the model's failure to capture the minority class despite high accuracy. In imbalanced datasets, accuracy is misleading because the model can achieve 99% accuracy by simply predicting the majority class (negative) for all instances.

Exam trap

Cisco often tests the trap that high accuracy implies good performance on imbalanced datasets, leading candidates to choose accuracy without considering class distribution or the specific failure mode (low recall).

How to eliminate wrong answers

Option A is wrong because mean squared error (MSE) is a regression metric that measures average squared differences between predicted and actual values, not suitable for binary classification evaluation. Option B is wrong because accuracy is misleading in imbalanced datasets; a model predicting all negatives achieves 99% accuracy but fails to identify positives, as seen here. Option D is wrong because precision measures the proportion of positive predictions that are correct, which could be high if the model makes very few positive predictions, but it does not capture the low detection rate of actual positives (recall).

897
MCQhard

A data engineer is exploring a dataset with a timestamp column and wants to resample the data to a consistent 1-hour frequency. The data is irregularly spaced. Which approach is most efficient using AWS services?

A.Use Amazon EMR with Spark
B.Use AWS Glue with built-in transforms
C.Use Amazon Athena with SQL window functions
D.Use Amazon SageMaker Processing with a custom script
AnswerD

SageMaker Processing allows custom scripts for flexible resampling.

Why this answer

Option D is correct because Amazon SageMaker Processing jobs allow custom scripts (e.g., using pandas resample) to handle irregular time series, and they are fully managed. Option A is wrong because Amazon Athena is a query engine and cannot resample easily. Option B is wrong because AWS Glue is more suited for batch ETL but may be overkill.

Option C is wrong because Amazon EMR requires cluster management and is more complex for simple resampling.

898
MCQhard

A data scientist is analyzing a dataset with 100,000 observations and 50 features. The scientist uses a Jupyter notebook on Amazon SageMaker. During EDA, the scientist runs a command to check for missing values and notices that 20% of the data in one feature is missing. The missing values are not random; they are correlated with another feature. Which imputation method is MOST appropriate?

A.Median imputation
B.Listwise deletion (remove rows with missing values)
C.Mean imputation
D.Multiple imputation by chained equations (MICE)
AnswerD

Models missing values using other features.

Why this answer

Option D is correct because MICE uses multiple imputation based on other features, accounting for correlations. Option A is wrong because mean imputation ignores correlation. Option B is wrong because median imputation also ignores correlation.

Option C is wrong because removing rows loses data and may introduce bias.

899
MCQmedium

A data scientist is exploring a dataset of customer transactions. The dataset has 1 million rows and 50 columns. The target variable is a binary flag indicating whether a customer churned. The data scientist runs a correlation matrix on all numerical features and finds that two features have a correlation coefficient of 0.98. Which action should be taken to improve model performance?

A.Create an interaction term between the two features.
B.Remove one of the two highly correlated features from the dataset.
C.Increase the regularization parameter (e.g., lambda) in the model.
D.Apply mean-centering to both features to reduce correlation.
AnswerB

Removing one feature eliminates multicollinearity, simplifying the model and improving interpretability.

Why this answer

Two features with a correlation coefficient of 0.98 are nearly perfectly multicollinear. This inflates the variance of coefficient estimates in linear models, making them unstable and reducing interpretability. Removing one of the highly correlated features is a standard dimensionality reduction technique that mitigates multicollinearity without significant information loss, as the remaining feature captures almost the same variance.

Exam trap

AWS often tests the misconception that regularization alone fixes multicollinearity, but regularization only penalizes coefficient magnitude, not the linear dependency between features.

How to eliminate wrong answers

Option A is wrong because creating an interaction term between two nearly perfectly correlated features would introduce even more severe multicollinearity (the interaction term will be highly correlated with the original features), worsening model stability. Option C is wrong because increasing the regularization parameter (e.g., lambda in L2 regularization) can shrink coefficients but does not eliminate the underlying multicollinearity; the model remains sensitive to small data changes and coefficient interpretation is still problematic. Option D is wrong because mean-centering only shifts the features' means to zero and does not change the correlation coefficient between them; it has no effect on multicollinearity.

900
MCQeasy

A machine learning engineer needs to store and version datasets for reproducibility. Which AWS service is designed for this purpose?

A.AWS CodeCommit
B.Amazon S3
C.SageMaker Feature Store
D.Amazon Redshift
AnswerC

Feature Store is designed for feature storage, versioning, and retrieval.

Why this answer

SageMaker Feature Store stores, manages, and versions features for ML. Option B is correct. Option A is a general object store.

Option C is for code versioning. Option D is for data warehousing.

Page 11

Page 12 of 24

Page 13