AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 13511425

1755 questions total · 24pages · All types, answers revealed

Page 18

Page 19 of 24

Page 20
1351
MCQeasy

A company wants to deploy a machine learning model that provides real-time inference with low latency. The model is a small ensemble of three tree-based models. Which Amazon SageMaker approach is most appropriate?

A.Use a SageMaker real-time endpoint with a single inference container.
B.Use a SageMaker batch transform job.
C.Use AWS Lambda with the model packaged in a layer.
D.Use a SageMaker Serverless Inference endpoint.
AnswerA

Real-time endpoints provide low-latency inference.

Why this answer

A SageMaker real-time endpoint with a single inference container is the most appropriate approach because it provides persistent, low-latency inference by keeping the model loaded in memory and handling requests synchronously. For a small ensemble of three tree-based models, a single container can host all models (e.g., using a custom inference script or a multi-model endpoint) and deliver sub-second response times, meeting the real-time requirement.

Exam trap

The trap here is that candidates often confuse 'real-time inference' with 'serverless' or 'batch processing,' assuming that serverless or Lambda are always cheaper or simpler, but they fail to account for cold-start latency and execution limits that break low-latency requirements.

How to eliminate wrong answers

Option B is wrong because SageMaker batch transform jobs are designed for asynchronous, offline inference on large datasets and do not provide real-time, low-latency responses. Option C is wrong because AWS Lambda has a maximum execution timeout of 15 minutes and limited memory (up to 10 GB), making it unsuitable for hosting even a small ensemble of models that require persistent, low-latency inference; additionally, packaging models in Lambda layers adds cold-start latency and complexity. Option D is wrong because SageMaker Serverless Inference endpoints automatically scale to zero when not in use, incurring cold-start latency that can exceed acceptable thresholds for real-time inference, and they are optimized for intermittent or bursty traffic, not sustained low-latency workloads.

1352
MCQeasy

A data scientist is training a linear regression model to predict house prices. The dataset contains 10 features. After training, the data scientist notices that the model has high bias (underfitting). Which action should the data scientist take to reduce bias?

A.Reduce the amount of training data
B.Add more features, such as polynomial features
C.Increase the regularization strength
D.Use a simpler model, such as ridge regression
AnswerB

Adding features increases model complexity, reducing bias.

Why this answer

High bias (underfitting) means the model is too simple to capture the underlying patterns in the data. Adding more features, such as polynomial features, increases model complexity, allowing the linear regression model to fit non-linear relationships and reduce bias. This directly addresses the underfitting issue by giving the model more expressive power.

Exam trap

Cisco often tests the bias-variance tradeoff by making candidates confuse regularization (which reduces variance) with the need to increase model complexity to fix underfitting; the trap here is that increasing regularization or using a simpler model seems like a 'safe' choice, but it actually worsens bias.

How to eliminate wrong answers

Option A is wrong because reducing the amount of training data would increase variance and potentially worsen bias, as the model would have even less information to learn from. Option C is wrong because increasing regularization strength penalizes model complexity, which would further increase bias by forcing the model to be simpler. Option D is wrong because using a simpler model, such as ridge regression (which is a regularized linear model), would also increase bias by constraining the coefficients, making underfitting worse.

1353
MCQmedium

A company is deploying a machine learning model to production on Amazon SageMaker. The model requires low-latency inference (under 10 ms) for real-time predictions. The data scientist has trained a model using XGBoost and wants to minimize cost while meeting latency requirements. Which SageMaker hosting option should be used?

A.Use a real-time endpoint with a single model
B.Use a serverless inference endpoint
C.Use a real-time endpoint with multi-model hosting
D.Use a batch transform job
E.Use an asynchronous inference endpoint
AnswerA

Real-time endpoints provide low-latency inference.

Why this answer

Option B is correct because SageMaker real-time endpoints provide low-latency inference suitable for real-time predictions. Option A (batch transform) is for offline predictions, not real-time. Option C (serverless inference) has cold starts and may not guarantee under 10 ms.

Option D (asynchronous inference) is for near-real-time with higher latency. Option E (multi-model endpoint) can reduce cost by sharing resources, but may introduce higher latency due to model loading.

1354
MCQeasy

A data scientist is performing exploratory data analysis on a dataset with missing values. They want to understand the distribution of each feature and identify outliers. Which AWS service can be used to create visualizations such as histograms and box plots without writing any code?

A.Amazon EMR
B.AWS Glue
C.Amazon QuickSight
D.Amazon SageMaker Studio
E.Amazon Athena
AnswerC

QuickSight provides code-free visualizations like histograms and box plots.

Why this answer

Amazon QuickSight is a serverless, machine learning-powered business intelligence service that allows users to create interactive dashboards and visualizations without writing code. Option A is wrong because SageMaker Studio requires coding for custom visualizations. Option B is wrong because AWS Glue is used for ETL, not visualization.

Option D is wrong because Amazon Athena is a query service. Option E is wrong because Amazon EMR is a big data platform not primarily for visualization.

1355
MCQhard

A data scientist is training a deep learning model on a GPU instance. The training data is stored in S3 and is 50 GB. To reduce I/O bottlenecks, which storage option should be used to cache the data locally on the instance?

A.Attach an Amazon EFS file system to the instance and copy data from S3
B.Mount an Amazon FSx for Lustre file system linked to the S3 bucket
C.Provision an Amazon EBS io2 volume and copy data from S3 using AWS DataSync
D.Use instance store volumes to cache the data from S3
AnswerB

FSx for Lustre provides high throughput and can cache S3 data locally.

Why this answer

Option A is correct because Amazon FSx for Lustre provides a high-performance file system integrated with S3 that can cache data locally. Option B is wrong because EBS with io2 volumes offers high IOPS but is not optimized for S3 caching; Option C is wrong because EFS is a shared file system with lower throughput; Option D is wrong because Instance Store is ephemeral and not persistent.

1356
MCQhard

A machine learning engineer is performing exploratory data analysis on a large dataset stored in Amazon S3 using AWS Glue. The dataset contains a mix of numeric and categorical features. The engineer wants to efficiently compute summary statistics (e.g., mean, median, standard deviation) for the numeric columns. Which AWS service or feature should the engineer use to achieve this with minimal setup?

A.Launch an Amazon EMR cluster and use Spark.
B.Use AWS Glue DataBrew to profile the dataset.
C.Use Amazon Athena to run SQL queries on the data.
D.Use Amazon SageMaker Data Wrangler.
AnswerB

DataBrew provides an easy interface for profiling and statistics.

Why this answer

Option B is correct because AWS Glue DataBrew provides a visual interface to profile data and compute summary statistics without writing code. Option A is wrong because Amazon Athena requires SQL queries and more manual effort. Option C is wrong because Amazon EMR requires cluster setup and management.

Option D is wrong because Amazon SageMaker Data Wrangler is a good tool but requires more configuration than DataBrew for simple summary statistics.

1357
MCQmedium

A data scientist is using Amazon SageMaker to train a deep learning model using a built-in algorithm. The training job uses an ml.p3.2xlarge instance and takes 10 hours to complete. The scientist wants to reduce training time without changing the algorithm or model architecture. The instance's GPU utilization is consistently at 95%, but CPU utilization is only 20%. The data input pipeline uses SageMaker Pipe mode with the 'TrainingInputMode' set to 'Pipe'. The training dataset is 200 GB in CSV format stored in S3. Which approach is most likely to reduce training time?

A.Switch from Pipe mode to File mode to reduce I/O overhead
B.Use Pipe mode with 'S3DataType' as 'AugmentedManifestFile'
C.Use a larger instance type with more GPUs, such as ml.p3.8xlarge
D.Reduce the batch size to improve GPU utilization
AnswerC

More GPUs can parallelize computation and reduce training time.

Why this answer

Option D is correct. Since GPU utilization is high (95%), the GPU is the bottleneck. Upgrading to a more powerful GPU instance (e.g., p3.8xlarge with 4 GPUs) can reduce training time by parallelizing computation.

Option A is wrong because File mode may not help and could increase I/O overhead. Option B is wrong because Pipe mode is already being used. Option C is wrong because reducing batch size could underutilize GPU further.

1358
MCQmedium

A data analyst is working with a time series dataset that shows increasing variance over time. To stabilize the variance before modeling, which transformation is most appropriate?

A.First-order differencing
B.Box-Cox transformation
C.Log transformation
D.Min-max scaling
AnswerC

Log transformation compresses high values and stabilizes increasing variance.

Why this answer

Option A is correct because the log transformation is commonly used to stabilize variance when variance increases with the mean. Box-Cox (B) is more general but requires positive data. Differencing (C) is for trend/seasonality, not variance.

Min-max scaling (D) does not stabilize variance.

1359
MCQeasy

A data scientist wants to build a binary classifier to predict customer churn. The dataset has 10,000 records with 500 churners (5%). Which technique should the data scientist use to address class imbalance?

A.Randomly undersample the majority class.
B.Use SMOTE (Synthetic Minority Over-sampling Technique) to create synthetic samples.
C.Assign higher class weights to the minority class.
D.Downsample the majority class to match the minority class size.
AnswerB

SMOTE generates synthetic examples, effectively balancing the dataset.

Why this answer

SMOTE generates synthetic samples for the minority class, which is appropriate for imbalanced datasets. Option A (downsampling majority class) would lose data. Option B (upweighting minority class) is possible but less common.

Option D (random undersampling) also loses data.

1360
Multi-Selectmedium

A data scientist is exploring a dataset containing customer transaction records. The target variable is 'churn' (1 = churned, 0 = not churned). Which TWO actions should the scientist take to understand the data distribution and prepare for modeling?

Select 2 answers
A.Apply Principal Component Analysis (PCA) to reduce dimensionality.
B.Train a gradient boosting model to identify important features.
C.Plot the frequency of the target variable to check for class imbalance.
D.Check for missing values in each column and decide on an imputation strategy.
E.Convert categorical variables into one-hot encoded vectors.
AnswersC, D

Essential to detect imbalance.

Why this answer

Visualizing class imbalance and identifying missing values are fundamental EDA steps. Option B (PCA) is for dimensionality reduction, not initial EDA. Option D (one-hot encoding) is for categorical variables, but not an EDA action.

Option E (gradient boosting) is modeling, not EDA.

1361
MCQmedium

A team is training a large NLP model using SageMaker. The training job fails with an OutOfMemory error. The instance type is ml.p3.2xlarge with 61 GB GPU memory. Which action should the team take to resolve the issue without changing the model architecture?

A.Switch to a regression model
B.Increase the number of epochs
C.Enable SageMaker Managed Warm Pools
D.Reduce the batch size in the training script
AnswerD

Smaller batch size reduces GPU memory consumption per step.

Why this answer

Reducing the batch size decreases GPU memory usage per iteration. Option A is correct. Option B changes the problem to regression.

Option C increases memory usage. Option D is unrelated to GPU memory.

1362
MCQeasy

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

A.Create a Glue crawler that runs continuously.
B.Schedule a Glue ETL job to run every hour.
C.Use Glue DataBrew to transform data and schedule it daily.
D.Create a Glue ETL job triggered by an S3 event notification via Lambda.
AnswerD

Event-driven trigger ensures cost-effectiveness.

Why this answer

Option D is correct because it uses an S3 event notification to invoke a Lambda function, which then triggers an AWS Glue ETL job only when new data arrives. This event-driven architecture ensures cost-effectiveness by avoiding continuous or scheduled runs, and it directly transforms raw JSON into Parquet format as required.

Exam trap

The trap here is that candidates may confuse Glue crawlers (which only catalog metadata) with Glue ETL jobs (which transform data), or assume scheduled jobs are always cost-effective without considering event-driven triggers.

How to eliminate wrong answers

Option A is wrong because a Glue crawler runs continuously to update the Data Catalog, not to transform data into Parquet; it would incur unnecessary costs and does not perform ETL transformations. Option B is wrong because scheduling a Glue ETL job every hour runs regardless of whether new data has arrived, leading to wasted compute resources and higher costs. Option C is wrong because Glue DataBrew is a visual data preparation tool, not designed for automated, event-driven ETL transformations; scheduling it daily would also run even without new data and is less cost-effective than an event-triggered approach.

1363
MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed before delivery using AWS Lambda. The Lambda function adds a timestamp field. The Firehose stream receives up to 10,000 records per second. The transformation currently takes 500 ms per record. What should the team do to ensure the transformation can keep up with the incoming data without data loss?

A.Increase the number of shards in the Kinesis stream.
B.Place the Lambda function in a VPC to improve network performance.
C.Increase the Lambda concurrency limit for the function to handle parallel invocations.
D.Increase the S3 buffer size and buffer interval in the Firehose delivery stream.
AnswerC

More concurrency allows processing more records in parallel.

Why this answer

Option B is correct because increasing the Lambda concurrency limit ensures that multiple Lambda invocations can run in parallel to handle the high throughput. Option A is wrong because increasing the buffer size would cause delays and potential data loss if the buffer fills up. Option C is wrong because increasing the number of shards applies to Kinesis Data Streams, not Firehose.

Option D is wrong because Lambda functions in a VPC may have reduced network performance and are not needed for adding a timestamp.

1364
Multi-Selecthard

Which THREE of the following are best practices for feature engineering during EDA? (Select THREE.)

Select 3 answers
A.Remove all outliers from the dataset
B.Standardize all features to have zero mean and unit variance
C.Apply log transformation to highly skewed features
D.Create interaction features between numeric variables
E.Encode categorical variables using one-hot encoding
AnswersC, D, E

Log transformation reduces skewness.

Why this answer

Option C is correct because applying a log transformation to highly skewed features helps normalize their distribution, reducing the impact of extreme values and making the data more suitable for many machine learning algorithms that assume normally distributed features. This is a common technique during exploratory data analysis (EDA) to stabilize variance and improve model performance, especially for linear models and neural networks.

Exam trap

Cisco often tests the misconception that all preprocessing steps, like outlier removal and standardization, should be performed during EDA, when in fact EDA is for understanding data distributions and relationships, while transformations and scaling are part of data preprocessing that may follow EDA based on insights gained.

1365
MCQeasy

A data scientist is performing EDA on a dataset with 1,000 features. The goal is to select the most important features for a regression model. Which technique can be used to rank feature importance quickly?

A.Calculate the correlation coefficient of each feature with the target
B.Use t-SNE to visualize feature relationships
C.Run k-means clustering and use cluster centroids
D.Apply Principal Component Analysis (PCA) and examine component loadings
AnswerA

Quick and provides a ranking.

Why this answer

Correlation analysis with the target variable is a quick way to rank features. Option B is wrong because PCA is unsupervised and does not rank features by importance to target. Option C is wrong because t-SNE is for visualization.

Option D is wrong because k-means is clustering.

1366
MCQmedium

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3 with minimal latency (under 5 minutes) for real-time analytics. The data volume is approximately 10 MB per second. Which solution is MOST cost-effective and meets the latency requirement?

A.Use Amazon MSK to mirror the on-premises Kafka cluster, then use Kinesis Firehose to write to S3
B.Use Amazon S3 Transfer Acceleration for direct uploads from on-premises
C.Use Amazon Kinesis Data Streams with a Direct Connect connection from on-premises
D.Set up a VPN connection and use AWS Lambda to consume from Kafka and write to S3
AnswerA

MSK provides managed Kafka with low latency, and Firehose can buffer and write to S3 every 60 seconds.

Why this answer

Option B is correct. Amazon MSK (Managed Streaming for Kafka) can replicate the on-premises Kafka topics to the cloud with low latency, and then Kinesis Firehose can deliver to S3 with a 60-second buffer. Option A (Direct connect + Lambda) has higher latency.

Option C (Kinesis Data Streams) requires additional ingestion from on-premises. Option D (S3 Transfer Acceleration) is for file uploads, not streaming.

1367
MCQeasy

A data scientist is using SageMaker to train a linear regression model. The target variable has a long-tail distribution. Which data transformation is LEAST likely to improve model performance?

A.Add interaction terms between features
B.Apply log transformation to the target variable
C.Normalize all feature values to [0,1]
D.Remove outliers from the target variable
AnswerC

Normalization does not affect linear regression's coefficients; it's not needed.

Why this answer

Option C (Normalization of features) is least likely to help because linear regression is scale-invariant; normalization does not change the model's performance. Option A (Log transformation) can reduce skewness. Option B (Removing outliers) can improve fit.

Option D (Adding interaction terms) can capture relationships.

1368
MCQmedium

A data engineering team needs to process streaming data from thousands of IoT devices. They want to aggregate data in 1-minute windows and store results in an S3 data lake for downstream analytics. Which architecture should they use?

A.Use AWS Glue ETL jobs running in streaming mode to read from Kinesis Data Streams, apply window aggregations, and write to S3.
B.Use Kinesis Data Streams with enhanced fan-out and multiple consumers to aggregate windows, then write to S3 via Firehose.
C.Use Kinesis Data Streams, trigger a Lambda function for 1-minute window aggregation using Python, and write results to S3.
D.Use Kinesis Data Analytics for SQL-based windowed aggregations and send results to Kinesis Data Firehose for delivery to S3.
AnswerD

Kinesis Data Analytics supports tumbling windows and continuous queries; Firehose is the natural sink for S3.

Why this answer

Option D is correct because Kinesis Data Analytics provides real-time SQL-based processing with windowing functions, and Kinesis Firehose can deliver aggregated data directly to S3. Option A is wrong because Lambda scales but has a 15-minute timeout and is not ideal for heavy streaming aggregation. Option B is wrong because Kinesis Data Streams alone does not process data; it requires a consumer.

Option C is wrong because Glue is batch-oriented, not real-time.

1369
MCQeasy

A company wants to build a model to detect fraudulent transactions. The dataset has a highly imbalanced class distribution. Which technique should be used during training to handle class imbalance?

A.Add more features to the dataset
B.Use SageMaker's built-in fraud detection algorithm that applies random under-sampling
C.Reduce the learning rate
D.Increase the tree depth in XGBoost
AnswerB

The algorithm handles imbalance by under-sampling.

Why this answer

Option A is correct because the built-in fraud detection algorithm in SageMaker uses random under-sampling of the majority class. Option B is wrong because adding more features does not directly handle imbalance. Option C is wrong because increasing tree depth may overfit.

Option D is wrong because reducing the learning rate does not address imbalance.

1370
MCQhard

A team is training a deep learning model on SageMaker using a custom PyTorch container. Training takes 24 hours on a single ml.p3.2xlarge instance. The team wants to reduce training time using distributed training. Which strategy is MOST appropriate?

A.Use data parallelism with Horovod across multiple instances
B.Use model parallelism to split the model across multiple GPUs
C.Use SageMaker Managed Spot Training to reduce cost
D.Use SageMaker Automatic Model Tuning to find optimal hyperparameters
AnswerB

Model parallelism splits the model across devices, suitable for large models.

Why this answer

Option C (Model parallelism) is correct for deep learning models that are too large to fit on a single GPU. Option A (Data parallelism) would require the model to fit on each GPU, which may not be the case. Option B (Hyperparameter tuning) does not reduce training time directly.

Option D (Spot instances) may cause interruptions and does not guarantee speedup.

1371
MCQmedium

A data scientist is working with a dataset that contains a feature with many outliers. Which transformation should the scientist apply to reduce the impact of outliers?

A.Min-max scaling
B.Log transformation
C.Standardization (z-score)
D.Binning
AnswerB

Log transformation reduces skewness and dampens outlier effects.

Why this answer

Log transformation compresses the range of values and reduces the impact of outliers. Standardization (z-score) does not reduce outlier impact. Min-max scaling is sensitive to outliers.

Square root transformation is less effective than log for large outliers. Binning loses information.

1372
MCQmedium

A data scientist is using an IAM role with the policy shown in the exhibit to train a model in SageMaker. The training job fails with a permissions error. What is the missing permission?

A.sagemaker:InvokeEndpoint
B.sagemaker:DescribeTrainingJob
C.s3:ListBucket
D.iam:PassRole
AnswerD

SageMaker requires iam:PassRole to use the execution role.

Why this answer

The training job fails because SageMaker needs to assume the IAM role specified in the training job configuration to access resources like S3 buckets. The `iam:PassRole` permission is required to allow the SageMaker service to pass that role to the training job. Without it, SageMaker cannot assume the role and thus cannot perform actions such as reading training data from S3.

Exam trap

The trap here is that candidates often focus on S3 or SageMaker-specific actions (like `s3:GetObject` or `sagemaker:CreateTrainingJob`) and overlook the prerequisite `iam:PassRole` permission, which is required for SageMaker to assume the role on behalf of the user.

How to eliminate wrong answers

Option A is wrong because `sagemaker:InvokeEndpoint` is used for invoking a deployed endpoint for inference, not for training jobs. Option B is wrong because `sagemaker:DescribeTrainingJob` is a read-only action that allows viewing training job metadata, not a permission required to launch or execute a training job. Option C is wrong because `s3:ListBucket` is an S3 action that might be needed for listing objects in a bucket, but the core issue is that SageMaker cannot assume the IAM role at all, so S3 permissions are irrelevant until the role is passed.

1373
MCQeasy

A data analyst wants to check for duplicate rows in a dataset stored in S3. Which AWS service can be used to run a SQL query to count duplicates without moving the data?

A.Amazon Athena
B.Amazon Redshift Spectrum
C.Amazon SageMaker Studio
D.AWS Glue
AnswerA

Athena can run SQL queries on S3 data to count duplicates.

Why this answer

Option B is correct because Amazon Athena allows running SQL queries directly on data in S3, including counting duplicates. Option A is wrong because AWS Glue is an ETL service, not a query engine. Option C is wrong because Amazon Redshift Spectrum can query data in S3 but requires a Redshift cluster.

Option D is wrong because Amazon SageMaker Studio is an IDE, not a query service.

1374
MCQhard

A data scientist is training a binary classifier using logistic regression. The dataset has 100,000 samples and 500 features. After training, the model achieves 95% accuracy on the training set but only 70% on the test set. The data scientist suspects overfitting. Which technique would best reduce overfitting while preserving interpretability?

A.Apply L1 regularization (Lasso)
B.Increase the maximum number of iterations
C.Add polynomial features
D.Use a random forest model instead
AnswerA

L1 regularization performs feature selection, reducing overfitting and keeping the model interpretable.

Why this answer

L1 regularization (Lasso) adds a penalty equal to the absolute value of the magnitude of coefficients, which drives many feature weights to exactly zero. This performs automatic feature selection, reducing model complexity and overfitting while keeping the model as a simple linear logistic regression, thus preserving interpretability.

Exam trap

AWS often tests the distinction between regularization techniques that shrink coefficients (L2/Ridge) versus those that zero them out (L1/Lasso), and candidates may mistakenly choose L2 or fail to recognize that L1 directly improves interpretability by removing irrelevant features.

How to eliminate wrong answers

Option B is wrong because increasing the maximum number of iterations only ensures the optimization algorithm converges; it does not address overfitting and may even lead to further overfitting if the model is already fitting noise. Option C is wrong because adding polynomial features increases model complexity and the number of parameters, which would worsen overfitting rather than reduce it. Option D is wrong because while a random forest can reduce overfitting through ensemble averaging, it is a non-linear black-box model that sacrifices the interpretability of logistic regression's coefficient-based explanations.

1375
MCQhard

A company is using Amazon SageMaker to train a time series forecasting model using the DeepAR algorithm. The training data contains multiple time series. The model is overfitting. Which action is LEAST likely to reduce overfitting?

A.Decrease the number of layers in the neural network.
B.Increase the dropout rate.
C.Decrease the context length.
D.Reduce the number of time series in the training set.
AnswerD

Less data may worsen overfitting.

Why this answer

Option D is correct because reducing the number of time series in the training set reduces the diversity of training data, which typically increases overfitting rather than reducing it. DeepAR relies on learning patterns across multiple related time series to generalize well; fewer time series mean less shared statistical strength, making the model more likely to memorize noise in the remaining series.

Exam trap

The trap here is that candidates mistakenly think reducing training data always reduces overfitting, but in time series forecasting with DeepAR, fewer time series actually weaken the cross-series learning that regularizes the model, making overfitting worse.

How to eliminate wrong answers

Option A is wrong because decreasing the number of layers reduces the model's capacity, which directly combats overfitting by limiting the complexity of learned representations. Option B is wrong because increasing the dropout rate randomly drops neurons during training, which acts as a regularization technique to prevent co-adaptation and reduce overfitting. Option C is wrong because decreasing the context length shortens the look-back window, forcing the model to rely on fewer historical points and reducing its ability to memorize long-term patterns, which helps mitigate overfitting.

1376
MCQhard

A company uses Amazon SageMaker to host a model for real-time inference. The model is a large ensemble of 10 deep learning models, each 500 MB. The total model size is 5 GB, which exceeds the 5 GB limit for SageMaker real-time endpoints. The data scientist wants to reduce the model size without significantly impacting accuracy. The ensemble uses averaging of predictions from all models. The scientist has access to a validation set with 10,000 samples. Which technique should the scientist use to reduce the model size?

A.Use model distillation to train a smaller model that approximates the ensemble
B.Use a more expensive instance type to host the model
C.Use SageMaker Neo to compile and optimize the model
D.Apply weight pruning to each model in the ensemble
AnswerA

Distillation produces a compact model with similar performance.

Why this answer

Option A is correct. Model distillation trains a smaller student model to mimic the ensemble, reducing size while preserving accuracy. Option B is wrong because price-aware instance selection does not reduce model size.

Option C is wrong because SageMaker Neo is for optimization, not size reduction below 5 GB. Option D is wrong because pruning alone may not reduce size enough.

1377
Multi-Selecteasy

During EDA, a data scientist generates a pairplot of the dataset and observes that two features have a Pearson correlation coefficient of 0.95. Which TWO conclusions can the scientist draw from this observation? (Choose 2)

Select 2 answers
A.The two features may be multicollinear
B.The two features have a strong linear relationship
C.The two features move in opposite directions
D.The two features are statistically independent
E.One feature causes the other
AnswersA, B

High correlation between features can cause multicollinearity in regression models.

Why this answer

Options B and C are correct because a high correlation indicates a strong linear relationship and suggests multicollinearity. Option A is wrong because correlation does not imply causation. Option D is wrong because a high positive correlation means they move together, not opposite.

Option E is wrong because correlation measures linear relationship, not independence.

1378
MCQmedium

A company uses Amazon SageMaker to train a deep learning model for image classification. The training job is taking longer than expected. The data scientist observes that GPU utilization is low (around 30%) and CPU utilization is high. Which action is most likely to reduce training time?

A.Reduce the batch size
B.Increase the batch size
C.Increase the learning rate
D.Increase the number of data loading workers
AnswerD

More data loading workers can parallelize data preprocessing and reduce I/O bottleneck, improving GPU utilization.

Why this answer

Option C is correct because low GPU utilization indicates that the data pipeline is not feeding data fast enough, causing the GPU to idle. Increasing the number of data loading workers can improve data throughput. Option A is wrong because larger batch sizes may increase memory usage and not directly address the bottleneck.

Option B is wrong because reducing batch size may further underutilize GPU. Option D is wrong because increasing learning rate does not address data loading bottleneck.

1379
MCQmedium

A data scientist is building a fraud detection model using a highly imbalanced dataset. The model uses a random forest classifier. The recall for the minority class is 0.6, and precision is 0.9. The business requires recall above 0.8. Which action should the data scientist take to improve recall?

A.Perform feature selection to remove noisy features.
B.Increase the maximum depth of the trees.
C.Increase the class weight for the minority class in the algorithm.
D.Decrease the probability threshold for classifying a transaction as fraudulent.
E.Increase the number of trees in the random forest.
AnswerD

Lower threshold increases true positives (recall) but may reduce precision.

Why this answer

Option D is correct because decreasing the classification threshold for the positive class increases recall (more positives predicted) at the cost of precision. Option A (more trees) reduces variance, may not improve recall. Option B (class weights) can help but is already used.

Option C (feature selection) may reduce recall. Option E (increase max depth) could lead to overfitting and not necessarily improve recall.

1380
Multi-Selecthard

Which TWO approaches can reduce inference latency on a SageMaker real-time endpoint? (Choose 2.)

Select 2 answers
A.Attach an Elastic Inference accelerator
B.Increase the batch size
C.Enable SageMaker Model Monitor
D.Use a GPU instance type
E.Compile the model using SageMaker Neo
AnswersA, E

Provides GPU acceleration at lower cost.

Why this answer

Using a GPU instance (Option A) and enabling SageMaker Model Monitor (Option E) are not directly for latency reduction. Actually, correct: Option B (Elastic Inference) and Option D (compiled model with SageMaker Neo) reduce latency. Option A is wrong because GPU does not always reduce latency; it can add overhead.

Option C is wrong because larger batch sizes increase latency. Option E is wrong because Model Monitor adds overhead.

1381
MCQmedium

A company's ML model is deployed on a SageMaker endpoint. The model's predictions are used in a customer-facing application that requires low latency. Over time, the model's performance degrades due to data drift. What is the most suitable approach to detect this drift automatically?

A.Set up a CloudWatch alarm on the endpoint's invocation latency
B.Periodically retrain the model using all historical data
C.Use Amazon S3 events to trigger a Lambda function that compares distributions
D.Enable Amazon SageMaker Model Monitor to continuously check for data drift
AnswerD

Built-in drift detection.

Why this answer

SageMaker Model Monitor can detect data drift automatically. Option A is wrong because CloudWatch alarms are for infrastructure metrics, not drift. Option B is wrong because S3 events trigger on object changes, not drift.

Option D is wrong because retraining on all data is inefficient.

1382
MCQhard

A data scientist runs a SageMaker training job and receives the above error. The S3 bucket 'my-bucket' contains a folder 'data' with a file 'data.csv'. What is the MOST likely cause of the error?

A.The instance type ml.m5.large does not have enough memory
B.The VolumeSizeInGB is too small to download the data
C.The S3 URI should be s3://my-bucket/data/data.csv instead of s3://my-bucket/data
D.The S3 bucket and the training job are in different regions
AnswerC

If the training script expects a single file, the S3 URI must point to the file directly.

Why this answer

The error occurs because the SageMaker training job expects a specific S3 object URI (pointing to a file), not a prefix (pointing to a folder). When you specify `s3://my-bucket/data`, SageMaker interprets it as a prefix and attempts to list objects under that prefix, but the training channel requires a direct file reference. Using `s3://my-bucket/data/data.csv` provides the exact object path, allowing SageMaker to download the file correctly.

Exam trap

The trap here is that candidates confuse S3 prefixes (folders) with S3 objects (files), assuming SageMaker can automatically resolve a folder to its contents, when in fact it requires an explicit file path for training data channels.

How to eliminate wrong answers

Option A is wrong because the error is about S3 URI format, not instance memory; ml.m5.large has sufficient memory for typical CSV processing. Option B is wrong because VolumeSizeInGB controls the local storage volume for the training instance, not the download of data from S3; SageMaker downloads data to the volume regardless of its size. Option D is wrong because cross-region S3 access would cause a different error (e.g., 'Access Denied' or 'BucketRegionError'), not a URI parsing error, and SageMaker training jobs can access buckets in different regions if the IAM role allows it.

1383
Multi-Selecthard

Which THREE factors should be considered when selecting the appropriate algorithm for a regression problem? (Choose 3.)

Select 3 answers
A.The number of features relative to the number of samples
B.The interpretability requirements of the business stakeholders
C.The presence of non-linear relationships in the data
D.The time of day the training will occur
E.The color of the data scientist's laptop
AnswersA, B, C

High-dimensional data may require regularization.

Why this answer

Option A is correct because the ratio of features to samples directly impacts model complexity and overfitting risk. In high-dimensional settings (e.g., p >> n), algorithms like linear regression may fail due to singular covariance matrices, while regularized methods (Ridge, Lasso) or tree-based models become necessary. This is a core consideration in the bias-variance tradeoff for regression problems.

Exam trap

AWS often tests the distinction between operational concerns (like training time or hardware) and core modeling factors, expecting candidates to recognize that irrelevant options (time of day, laptop color) are clear distractors while the three correct factors directly influence algorithm performance and business suitability.

1384
MCQmedium

A company is using SageMaker built-in object detection algorithm to detect defects in manufacturing images. The model is trained on 10,000 labeled images and achieves 95% accuracy. However, in production, the model misclassifies many defective items as non-defective (false negatives). The business requires recall > 90% for the defect class. Which action should they take?

A.Use a different algorithm such as semantic segmentation
B.Adjust the decision threshold of the model to increase recall at the expense of precision
C.Use SageMaker's Automatic Model Tuning to find better hyperparameters
D.Retrain the model with more images of non-defective items
AnswerB

Lowering the threshold increases recall for the positive class.

Why this answer

Threshold tuning directly optimizes recall for a given class.

1385
Multi-Selecthard

A company is deploying a machine learning model for fraud detection. The model outputs a probability score. The cost of false negatives is very high. Which TWO metrics should the company focus on optimizing?

Select 2 answers
A.Precision
B.False positive rate (FPR)
C.F1 score
D.Area under the ROC curve (AUC-ROC)
E.Recall
AnswersC, E

F1 = harmonic mean of precision and recall; optimizing F1 also improves recall.

Why this answer

Recall (true positive rate) measures ability to find positives; minimizing false negatives is optimizing recall. AUC-ROC summarizes overall performance but not specific to false negatives. Precision focuses on false positives.

FPR is about false positives. F1 balances precision and recall, but recall directly addresses false negatives.

1386
MCQmedium

A team is deploying a real-time inference endpoint using Amazon SageMaker. The model is a large deep learning model that requires GPU for inference. The endpoint must handle variable traffic patterns with minimal latency. Which deployment strategy should the team use?

A.Deploy a single model endpoint with an auto-scaling policy.
B.Use a SageMaker multi-model endpoint with GPU instance type.
C.Deploy a serverless endpoint using SageMaker Serverless Inference.
D.Use SageMaker Batch Transform to process requests in batches.
AnswerB

Multi-model endpoints allow hosting multiple models on GPU instances, handling variable traffic efficiently.

Why this answer

B is correct because SageMaker multi-model endpoints (MMEs) allow multiple models to be hosted on a single GPU-backed endpoint, dynamically loading and unloading models from disk to GPU memory as needed. This reduces cost and cold-start latency compared to single-model endpoints, while still providing GPU acceleration for deep learning inference. MMEs are ideal for variable traffic patterns because they can scale horizontally and share GPU resources efficiently.

Exam trap

The trap here is that candidates often assume serverless inference (Option C) is suitable for GPU workloads, but AWS SageMaker Serverless Inference only supports CPU instances, making it incompatible with large deep learning models that require GPU acceleration.

How to eliminate wrong answers

Option A is wrong because a single model endpoint with auto-scaling can handle variable traffic but does not optimize GPU utilization for multiple models; it would require separate endpoints for each model, increasing cost and management overhead. Option C is wrong because SageMaker Serverless Inference does not support GPU instances; it uses CPU-based compute, which is unsuitable for large deep learning models requiring GPU acceleration. Option D is wrong because SageMaker Batch Transform is designed for offline, asynchronous batch processing, not real-time inference with minimal latency; it cannot handle variable traffic patterns dynamically.

1387
MCQmedium

A data scientist is training a deep learning model on Amazon SageMaker using a custom Docker container. The training job fails with an error 'OutOfMemoryError: CUDA out of memory'. The instance type is ml.p3.2xlarge (8 GB GPU memory). The model has 50 million parameters. What is the most likely cause and solution?

A.The instance type is insufficient; switch to ml.p3.8xlarge
B.The batch size is too large; reduce batch size
C.Enable gradient checkpointing to reduce memory
D.The model uses FP32 precision; enable mixed precision training
AnswerD

Mixed precision (FP16) halves memory usage, fitting the model into 8 GB.

Why this answer

Option B is correct because 50M parameters likely exceed GPU memory when using full precision. Mixed precision (FP16) reduces memory usage. Option A (batch size) could help but is secondary.

Option C (instance type) may be unnecessary if mixed precision works. Option D (checkpointing) doesn't address memory during training.

1388
MCQeasy

A data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (99% negative class, 1% positive class). The model currently achieves 99% accuracy but fails to detect most positive cases. Which metric should the data scientist primarily use to evaluate model performance?

A.ROC AUC
B.F1 score
C.Recall
D.Accuracy
AnswerB

F1 score balances precision and recall, suitable for imbalanced data.

Why this answer

In highly imbalanced datasets (99% negative, 1% positive), accuracy is misleading because a model can achieve 99% accuracy by simply predicting the majority class for all instances, failing to detect any positive cases. The F1 score (option B) is the harmonic mean of precision and recall, providing a balanced measure that penalizes models that trade off recall for precision or vice versa. This makes it the primary metric for evaluating binary classification performance on imbalanced data, as it directly reflects the model's ability to correctly identify positive cases while minimizing false positives.

Exam trap

The trap here is that candidates see 99% accuracy and assume the model is performing well, failing to recognize that accuracy is meaningless on imbalanced datasets, and they may incorrectly choose ROC AUC because it is commonly used for binary classification without understanding its limitations with extreme class imbalance.

How to eliminate wrong answers

Option A (ROC AUC) is wrong because it measures the model's ability to rank positive instances higher than negative ones across all thresholds, which can be overly optimistic on highly imbalanced datasets and does not directly reflect precision or recall for the minority class. Option C (Recall) is wrong because while it captures the proportion of actual positives correctly identified, it ignores false positives, so a model could achieve high recall by predicting all instances as positive, which is not useful. Option D (Accuracy) is wrong because it is dominated by the majority class; a model that always predicts the negative class achieves 99% accuracy but fails entirely to detect positive cases, making it a poor metric for imbalanced classification.

1389
MCQmedium

A company is building a fraud detection model. The dataset is highly imbalanced (99% legitimate, 1% fraud). The data scientist trains a model using Amazon SageMaker's built-in XGBoost algorithm. The model achieves 99% accuracy but only catches 10% of fraud cases. Which technique should the data scientist apply to improve recall for the minority class?

A.Use random under-sampling of the majority class.
B.Set the scale_pos_weight hyperparameter in XGBoost.
C.Use mean squared error as the objective function.
D.Use SMOTE to oversample the minority class.
AnswerB

This adjusts the weight of positive class to handle imbalance.

Why this answer

Option B is correct because setting scale_pos_weight balances class weights. Option A is wrong because SMOTE creates synthetic samples, but XGBoost has built-in handling. Option C is wrong because under-sampling loses data.

Option D is wrong because it's for regression.

1390
Multi-Selecthard

Which THREE AWS services can be used together to build a serverless data pipeline that ingests streaming data, transforms it, and loads it into Amazon Redshift for analysis?

Select 3 answers
A.Amazon EMR
B.Amazon SQS
C.Amazon Kinesis Data Firehose
D.Amazon Kinesis Data Streams
E.AWS Lambda
AnswersC, D, E

Delivers transformed data directly to Redshift.

Why this answer

Kinesis Data Streams ingests streaming data. Lambda processes it. Firehose delivers to Redshift.

EMR is not serverless (managed). SQS is not ideal for streaming. Glue can also be used but is not the only option.

1391
MCQeasy

During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?

A.Apply log transformation.
B.Standardize the features using Z-scores.
C.Apply a sine transformation.
D.Apply Box-Cox transformation with lambda=0.
AnswerA

Log transformation stabilizes variance when variance increases with mean.

Why this answer

A funnel-shaped pattern in a scatter matrix indicates heteroscedasticity, where variance increases with the mean. The log transformation is appropriate because it compresses the scale of the data, making the variance more constant across the range of values, which stabilizes variance for right-skewed or multiplicative data.

Exam trap

Cisco often tests the distinction between transformations that stabilize variance (log, Box-Cox) versus those that only standardize (Z-scores) or are domain-specific (sine), and candidates may incorrectly choose Box-Cox with lambda=0 thinking it is a separate technique, missing that the log transformation is the canonical answer for funnel-shaped heteroscedasticity.

How to eliminate wrong answers

Option B is wrong because standardizing using Z-scores centers and scales the data to unit variance but does not address the relationship between variance and mean; it assumes homoscedasticity and can amplify heteroscedasticity. Option C is wrong because a sine transformation is periodic and used for cyclical or angular data, not for stabilizing variance in funnel-shaped patterns. Option D is wrong because Box-Cox with lambda=0 is equivalent to the log transformation only when the data is positive, but the Box-Cox transformation is a family of power transformations; specifying lambda=0 directly is redundant and the question asks for the appropriate transformation, not a specific parameterization.

1392
MCQeasy

A machine learning team is using AWS Glue to prepare data for training. They notice that the ETL job takes a long time to process large datasets. Which change is most likely to improve performance?

A.Increase the number of DPUs for the Glue job.
B.Decrease the number of workers in the Glue job.
C.Disable Spark shuffle operations.
D.Reduce the dataset size by sampling.
AnswerA

More DPUs increase parallelism and speed up processing.

Why this answer

Option A is correct because increasing the number of DPUs (Data Processing Units) in AWS Glue can parallelize processing and reduce job duration. Option B is incorrect as it may reduce parallelism. Option C is incorrect because Spark shuffle is necessary; avoiding it may not be feasible.

Option D is incorrect because reducing data size is not always possible.

1393
MCQeasy

A machine learning team is using SageMaker to train a model with the built-in Linear Learner algorithm. The dataset has 1 million rows and 20 features. The training completes, but the model's mean squared error (MSE) is high. Which parameter adjustment is most likely to reduce MSE?

A.Increase the mini-batch size
B.Change the loss function to cross-entropy
C.Increase the number of epochs
D.Increase the learning rate
AnswerC

More epochs allow the algorithm to converge to a lower loss.

Why this answer

Option D is correct because increasing the number of epochs allows the model to converge better. Option A (learning rate increase) may cause instability. Option B (batch size increase) can slow convergence.

Option C (loss function change) is not straightforward.

1394
MCQhard

A machine learning team is using SageMaker to train a custom TensorFlow model on a dataset that fits in memory. The training job is taking too long. The team wants to reduce training time without changing the model architecture. Which approach is most effective?

A.Switch the input mode from File to Pipe
B.Use SageMaker managed spot training
C.Use Amazon EFS as the input data source instead of S3
D.Use a larger instance type with more vCPUs
AnswerA

Pipe mode streams data directly, reducing I/O wait time and speeding up training.

Why this answer

Switching the input mode from File to Pipe is the most effective approach because it streams data directly from Amazon S3 to the training container, eliminating the need to download the entire dataset to the local storage before training begins. This reduces the I/O bottleneck and significantly cuts down the time spent on data loading, especially for datasets that fit in memory, as the model can start training almost immediately while data is being streamed.

Exam trap

AWS often tests the misconception that larger instances always reduce training time, but the trap here is that the dataset fits in memory, so the bottleneck is typically I/O, not compute, making data streaming optimizations like Pipe mode more effective than scaling up hardware.

How to eliminate wrong answers

Option B is wrong because SageMaker managed spot training reduces cost by using spare EC2 capacity, but it does not inherently reduce training time; in fact, it can increase total time due to potential interruptions and checkpoint restarts. Option C is wrong because Amazon EFS as an input data source typically introduces higher latency and slower throughput compared to S3, and it does not support the Pipe input mode, so it would likely increase training time. Option D is wrong because using a larger instance type with more vCPUs may improve compute parallelism but does not address the data loading bottleneck that is the primary cause of slow training; the dataset fits in memory, so the issue is likely I/O-bound, not compute-bound.

1395
MCQmedium

A data engineer is exploring a dataset with 1 million rows and 50 features. They notice that some features have missing values. The 'Age' column has 5% missingness, and 'Income' has 20% missingness. The target variable is 'LoanDefault' (binary). The engineer wants to impute missing values. Which of the following strategies is most appropriate?

A.Impute missing 'Age' with median and 'Income' with median.
B.Impute missing 'Age' with mode and 'Income' with mode.
C.Use a k-NN model to predict missing values.
D.Drop all rows with missing values.
AnswerA

Median is robust to outliers and suitable for skewed distributions.

Why this answer

Option B is correct because median imputation is robust to outliers and simple for numerical features. Option A is wrong because dropping rows with any missing values would remove 25% of data. Option C is wrong because mode is for categorical.

Option D is wrong because model-based imputation is complex for initial EDA.

1396
Multi-Selectmedium

A company is deploying a SageMaker model for real-time inference. The endpoint must be highly available and cost-effective. Which TWO actions should the company take? (Select TWO.)

Select 2 answers
A.Use managed spot training for inference
B.Deploy the endpoint with at least two instances in different Availability Zones
C.Use GPU instances for all models even if not required
D.Configure automatic scaling based on latency or request count
E.Use a single large instance to handle peak load
AnswersB, D

Multi-AZ deployment provides high availability.

Why this answer

Options A and C are correct. A: Multiple instances across AZs ensures HA. C: Auto-scaling adjusts capacity based on demand, improving cost.

B (single instance) lacks HA. D (spot instances) are cheaper but not for real-time HA. E (GPU) is not necessarily cost-effective.

1397
MCQhard

Refer to the exhibit. An IAM policy is attached to a SageMaker notebook instance role. A data scientist is trying to train a model using the SageMaker built-in XGBoost algorithm with training data in 'my-bucket/training-data/' and expects output in 'my-bucket/output/'. The training job fails with an access denied error. What is the most likely missing permission?

A.iam:PassRole on the SageMaker execution role.
B.ecr:GetAuthorizationToken on the ECR repository.
C.s3:ListBucket on the S3 bucket.
D.sagemaker:DescribeTrainingJob on the training job.
AnswerA

The policy is missing iam:PassRole, which is required to allow SageMaker to assume the execution role for the training job.

Why this answer

The training job uses the SageMaker built-in algorithm, which downloads the training data from S3 and uploads output. The policy allows s3:GetObject on training-data and s3:PutObject on output. However, SageMaker also needs to read the algorithm image from ECR (elasticcontainerregistry).

The missing permission is likely ecr:GetDownloadUrlForLayer or ecr:BatchGetImage. Also, SageMaker needs to pass roles. But the error is 'access denied', likely from ECR.

Option A (ecr:GetAuthorizationToken) is needed to authenticate, but typically SageMaker uses the role to pull images. Option B (s3:ListBucket) is needed if the training job lists objects. Option C (sagemaker:DescribeTrainingJob) is not needed for execution.

Option D (iam:PassRole) is needed for the training job to assume the role. Actually, SageMaker needs iam:PassRole to pass the execution role to the training job. But the error 'access denied' could be due to missing iam:PassRole.

However, the policy does not include iam:PassRole. The most likely missing permission is iam:PassRole. Let's check: The policy allows creating training job, but the training job also needs to pass a role to SageMaker.

Without iam:PassRole, the API call fails. So D is correct. But also ECR permissions might be needed.

However, IAM PassRole is a common missing permission. I'll go with D.

1398
MCQmedium

A data engineering team needs to build a data lake on Amazon S3 that will be queried by Amazon Athena and Amazon Redshift Spectrum. The data will be ingested from multiple sources in various formats (CSV, JSON, Parquet). Which partitioning strategy will provide the best query performance for date-range queries?

A.Partition by date with one partition per day in a flat structure.
B.Do not partition; let Athena scan the entire dataset.
C.Partition by year, month, and day in a hierarchical structure.
D.Partition by source system first, then by date.
AnswerC

Hierarchical date partitioning enables partition pruning for date-range queries.

Why this answer

Option D is correct because partitioning by year, month, and day as separate prefixes (e.g., s3://bucket/year=2024/month=01/day=15/) allows Athena and Redshift Spectrum to prune partitions efficiently. Option A is wrong because a single partition per day results in too many partitions for large datasets. Option B is wrong because no partitioning leads to full table scans.

Option C is wrong because partition by source first then date may be less optimal if queries often filter by date across sources.

1399
MCQmedium

A company uses Amazon Kinesis Data Analytics for Apache Flink to process real-time clickstream data. The application uses event time and watermarks for windowed aggregations. The team notices that the output from tumbling windows is delayed, and many late records are being dropped. What is the MOST likely cause?

A.The checkpointing interval is too long, causing state to be lost
B.The parallelism is too low, causing backpressure
C.The source is marking itself as idle, causing watermarks to stall
D.The allowed lateness is set too low, causing late records to be discarded
AnswerD

Low allowed lateness means records arriving after the watermark are dropped.

Why this answer

Option B is correct because late records are dropped when the watermark has passed the window's end; increasing the allowed lateness gives more time for late records to arrive. Option A is wrong because idle sources cause watermarks to stall, not drop late records. Option C is wrong because checkpointing interval does not affect watermark progress.

Option D is wrong because parallelism affects throughput, not watermark behavior.

1400
MCQhard

An e-commerce company uses Amazon Redshift for analytics. The data engineering team needs to load daily sales data from an S3 bucket that receives new files every hour. The data must be loaded into Redshift with minimal impact on query performance during the day, and they need to handle late-arriving data (files that appear after the daily load). Which approach should they use?

A.Use AWS Glue ETL to copy the data from S3 to Redshift, overwriting the existing data each day.
B.Use a staging table to load data incrementally with a MERGE operation, and schedule a late-arriving data job to merge files that arrive after the daily load.
C.Stream the data from S3 using Amazon Kinesis Firehose to load into Redshift continuously.
D.Use Amazon Redshift Spectrum to query data directly from S3 and create external tables.
AnswerB

Staging tables allow incremental upserts and handling of late data without blocking queries.

Why this answer

Option A is correct because staging tables allow incremental loads with upsert logic using a staging table, and a late-arriving data process can merge the additional records later without blocking queries. Option B (COPY with auto staging) Redshift Spectrum queries data in S3 without loading, which may be slower for frequent queries. Option C (Kinesis Firehose) is real-time streaming, suitable for near-real-time but not for batched daily loads with late data handling.

Option D (Glue ETL with overwrite) overwrites data, losing late-arriving data.

1401
MCQmedium

An ML engineer runs the AWS CLI command above to list files in a training data bucket. The engineer notices that the three CSV files have different sizes but the same number of columns. What is the MOST likely cause of the size variation?

A.The files are compressed with different algorithms.
B.Some files have duplicate headers.
C.The files contain a different number of rows.
D.The files have different column data types.
AnswerC

Row count directly affects file size.

Why this answer

Option D is correct because the number of rows can vary between files, leading to different file sizes. Option A is wrong because different column types would cause inconsistent schemas, but the engineer says same number of columns. Option B is wrong because compression would be applied uniformly.

Option C is wrong because S3 does not add headers multiple times.

1402
Multi-Selecteasy

Which TWO AWS services are suitable for real-time stream processing?

Select 2 answers
A.Amazon Athena
B.AWS Glue
C.Amazon EMR
D.Amazon Kinesis Data Analytics
E.AWS Lambda
AnswersD, E

Kinesis Data Analytics processes streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics and AWS Lambda can process streams in real-time. AWS Glue is batch-oriented, Amazon EMR can process streams but is more batch, and Amazon Athena is for ad-hoc SQL queries on S3.

1403
MCQmedium

A data scientist is training a gradient boosting model using SageMaker's built-in XGBoost algorithm. The model is overfitting on the training data. Which hyperparameter adjustment is most likely to reduce overfitting?

A.Increase learning rate (eta)
B.Increase max_depth
C.Increase num_round
D.Increase lambda (L2 regularization)
AnswerD

Higher lambda penalizes large weights, reducing overfitting.

Why this answer

Increasing the L2 regularization term (lambda) penalizes large weights, reducing overfitting. Option A is wrong because increasing max_depth increases model complexity. Option B is wrong because increasing num_round can increase overfitting.

Option D is wrong because increasing learning rate may cause overfitting if not paired with regularization.

1404
MCQmedium

A machine learning engineer needs to deploy a model that performs real-time inference with strict latency requirements of under 100 milliseconds. The model is a large ensemble of 10 deep learning models. Which SageMaker deployment strategy is MOST appropriate?

A.Use batch transform and cache predictions.
B.Deploy each model as a separate endpoint and route traffic using Application Load Balancer.
C.Use a SageMaker Inference Pipeline with serial inference within a single endpoint.
D.Use a multi-model endpoint to host all models.
AnswerC

Inference Pipelines allow chaining containers in a single endpoint, reducing latency.

Why this answer

Large ensemble models can be deployed using SageMaker Inference Pipelines to chain multiple containers. Real-time endpoints with a single variant are standard for low latency. Multi-model endpoints are for multiple models, not ensembles.

Batch transform is for offline. Multi-variant endpoints are for A/B testing.

1405
MCQmedium

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 1 Gbps internet connection and wants to complete the transfer within 5 days. What is the MOST cost-effective and reliable solution?

A.Use AWS Snowball Edge device to physically ship the data
B.Use S3 multipart upload over the internet
C.Set up AWS Direct Connect and transfer over the dedicated line
D.Use S3 Transfer Acceleration to speed up the transfer
AnswerA

Snowball can transfer 50 TB in a few days, cost-effective for large data.

Why this answer

AWS Snowball Edge is a physical device that can transfer large amounts of data faster than over the internet. Option A is wrong because the internet connection would take about 5 days at full bandwidth, but is unreliable and may incur high costs. Option B is wrong because Direct Connect requires setup time and ongoing costs.

Option D is wrong because S3 Transfer Acceleration may help but still relies on internet.

1406
MCQhard

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The stream has 8 shards. A Lambda function processes each record and writes to Amazon DynamoDB. The Lambda function sometimes fails due to DynamoDB write throttling, causing duplicate processing of records after retries. The data engineering team needs to ensure exactly-once processing semantics for the DynamoDB writes. What should the team do?

A.Use an Amazon SQS FIFO queue between Kinesis and Lambda to deduplicate records.
B.Configure the Lambda event source mapping with a maximum retry count of 0 and a DLQ.
C.Increase the DynamoDB write capacity units to avoid throttling.
D.Use DynamoDB conditional writes with the Kinesis sequence number as a unique attribute to make writes idempotent.
AnswerD

Conditional writes based on the sequence number ensure each record is written only once.

Why this answer

Option B is correct because DynamoDB transactions allow conditional writes. Using the Kinesis sequence number as a condition check ensures idempotency: if a record with that sequence number already exists, the write is skipped. Option A is wrong because increasing write capacity may reduce throttling but does not guarantee exactly-once; duplicates can still occur.

Option C is wrong because SQS delays are for visibility timeout, not idempotency. Option D is wrong because FIFO queues support exactly-once delivery but the Lambda would still need to deduplicate within the batch.

1407
MCQeasy

A machine learning engineer needs to deploy a model that performs real-time fraud detection. The model must be highly available and scalable. Which AWS service should be used to host the model?

A.AWS Lambda
B.Amazon ECS with a custom container
C.Amazon SageMaker batch transform
D.Amazon SageMaker real-time endpoint
AnswerD

Purpose-built for real-time inference with auto-scaling.

Why this answer

Option D is correct because Amazon SageMaker real-time endpoints are designed for low-latency, scalable, and highly available model hosting. Option A is wrong because AWS Lambda has limited execution time and is not suitable for heavy inference. Option B is wrong because Amazon ECS can host containers but requires more management; SageMaker is purpose-built.

Option C is wrong because SageMaker batch transform is for offline predictions.

1408
MCQmedium

An ML engineer is deploying a model to a SageMaker endpoint for real-time inference. The model requires a custom inference script that preprocesses input data and postprocesses predictions. Which SageMaker feature should be used to implement this custom logic?

A.Use SageMaker Ground Truth to transform inference requests
B.Use SageMaker Processing jobs to preprocess data before inference
C.Use a built-in SageMaker algorithm with the default inference code
D.Create a SageMaker model with a custom inference script that includes pre- and post-processing functions
AnswerD

Custom inference scripts allow full control over request handling.

Why this answer

Option B is correct because a custom inference script is packaged in the inference code and used by SageMaker to handle requests. Option A (built-in algorithm) does not allow custom logic. Option C (SageMaker Processing) is for batch jobs.

Option D (SageMaker Ground Truth) is for labeling.

1409
MCQeasy

A machine learning engineer is analyzing feature distributions in a dataset and notices that one feature has a long tail. Which transformation is most appropriate to reduce skewness and make the distribution more normal?

A.Apply one-hot encoding
B.Apply a log transformation
C.Apply min-max normalization
D.Apply standardization (Z-score)
AnswerB

Log transformation compresses the long tail and reduces skewness.

Why this answer

Option A is correct because log transformation is commonly used to reduce right skewness. Option B is wrong because min-max scaling does not change distribution shape. Option C is wrong because one-hot encoding is for categorical variables.

Option D is wrong because standardization does not reduce skewness.

1410
MCQhard

A machine learning engineer is building a binary classification model to predict customer churn. The dataset is highly imbalanced (5% churn). The engineer wants to use Amazon SageMaker's built-in XGBoost algorithm. Which combination of hyperparameters is most appropriate for this scenario?

A.scale_pos_weight=19, subsample=0.8
B.scale_pos_weight=0.05, subsample=0.8
C.scale_pos_weight=19, subsample=1.0
D.scale_pos_weight=1, subsample=1.0
AnswerA

Correct ratio and subsample for regularization.

Why this answer

In a highly imbalanced dataset with only 5% churn, the ratio of negative to positive classes is 95:5, or 19:1. The `scale_pos_weight` hyperparameter in XGBoost should be set to this ratio (19) to penalize misclassifications of the minority class more heavily. A `subsample` of 0.8 introduces stochasticity and helps prevent overfitting, which is especially important when the minority class is small.

Exam trap

The trap here is that candidates often confuse `scale_pos_weight` with a simple class weight or mistakenly think a value less than 1 is needed for the minority class, when in fact it should be the ratio of majority to minority class counts.

How to eliminate wrong answers

Option B is wrong because `scale_pos_weight=0.05` would actually down-weight the minority class, making the model ignore churn cases entirely. Option C is wrong because `subsample=1.0` uses the full dataset for every tree, which increases the risk of overfitting on the minority class without any regularization from row sampling. Option D is wrong because `scale_pos_weight=1` treats both classes equally, failing to address the 19:1 class imbalance, and `subsample=1.0` again provides no overfitting protection.

1411
MCQmedium

A company uses SageMaker to deploy a real-time inference endpoint for a fraud detection model. The model is an XGBoost model trained on 50 features. The endpoint receives 100 requests per second, but latency is higher than the required 200 ms. The team wants to reduce latency without retraining. What should they do?

A.Increase the number of instances behind the endpoint
B.Use SageMaker's batch transform instead of real-time endpoint
C.Reduce the number of features by selecting the most important ones
D.Use SageMaker's Elastic Inference to attach an acceleration to the endpoint
AnswerC, D

Fewer features reduce inference time.

Why this answer

Reducing features directly lowers latency. Elastic Inference does not apply to XGBoost.

1412
MCQeasy

A company is using Amazon SageMaker to train a model. The training data is stored in an S3 bucket in a different AWS account. Which IAM policy configuration is required to allow SageMaker to access the data?

A.Add a bucket policy that allows s3:GetObject for the SageMaker execution role's ARN.
B.Add a bucket policy allowing access from the SageMaker execution role ARN, and ensure the SageMaker execution role has an IAM policy allowing s3:GetObject on the bucket.
C.Create an IAM user in the data owner's account and use its credentials in SageMaker.
D.Use the data owner's IAM role as the SageMaker execution role.
AnswerB

Both policies are needed for cross-account access.

Why this answer

Option B is correct because cross-account access requires the SageMaker execution role to have an IAM policy allowing access to the S3 bucket, and the S3 bucket policy must grant access to that role. Option A is wrong because SageMaker cannot assume a role in another account without proper trust policy. Option C is wrong because the data owner's role cannot be used directly.

Option D is wrong because SageMaker does not use the data owner's IAM user credentials.

1413
MCQhard

A machine learning engineer is deploying a PyTorch model to SageMaker. The model requires custom inference logic. Which approach should the engineer use?

A.Use a SageMaker built-in PyTorch container as-is
B.Use SageMaker Ground Truth to deploy the model
C.Use SageMaker Processing to run inference
D.Create a custom inference script and use the SageMaker PyTorch container
AnswerD

SageMaker PyTorch container supports custom entry points.

Why this answer

Option B is correct because SageMaker allows you to provide a custom inference script (entry point) for PyTorch. Option A is wrong because SageMaker built-in containers do not support arbitrary custom logic. Option C is wrong because SageMaker Processing is for data processing, not inference.

Option D is wrong because SageMaker Ground Truth is for labeling.

1414
MCQmedium

Refer to the exhibit. A data scientist runs the above CLI command to create a SageMaker training job. The job fails with an error 'Unable to read data from s3://bucket/train/'. What is the MOST likely cause?

A.The training image is not accessible
B.The instance type does not support the required memory
C.The IAM role does not have permissions to read from the S3 bucket
D.The training job is in a different region than the S3 bucket
AnswerC

The role must have s3:GetObject permission for the training data.

Why this answer

The error 'Unable to read data from s3://bucket/train/' indicates that the SageMaker training job cannot access the S3 input data. The most common cause is that the IAM role specified in the command does not have the necessary s3:GetObject permission on the S3 bucket or objects. SageMaker uses the IAM role to assume permissions for reading training data, and without proper S3 read access, the job fails at the data loading stage.

Exam trap

The trap here is that candidates may confuse the error message with a network or region issue, but the 'Unable to read data' error is almost always an IAM permissions problem, not a connectivity or resource constraint issue.

How to eliminate wrong answers

Option A is wrong because if the training image were not accessible, the error would typically be 'Unable to pull image' or 'Image not found', not a data read error from S3. Option B is wrong because insufficient memory would cause an out-of-memory or resource-exhausted error, not a failure to read data from S3. Option D is wrong because SageMaker automatically handles cross-region S3 access by copying data to the training job's region; a region mismatch would not produce an 'Unable to read data' error unless the bucket policy explicitly denies cross-region access, which is not the default behavior.

1415
Multi-Selectmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data. They need to archive raw data to S3 every hour and also enable real-time processing with sub-second latency. Which TWO actions should they take? (Choose two.)

Select 2 answers
A.Use Kinesis Data Analytics to write output to S3.
B.Configure a Lambda function as a consumer of the stream for real-time processing.
C.Use S3 events to trigger a Lambda function that reads from the stream.
D.Create a Kinesis Data Firehose delivery stream with S3 as destination and set a buffer interval of 3600 seconds.
E.Install the Kinesis Agent on an EC2 instance to write data to S3.
AnswersB, D

Lambda can process records with low latency.

Why this answer

Options A and D are correct. Option A provides real-time processing via Lambda. Option D archives data to S3 using Firehose with hourly buffering.

Option B is wrong because Kinesis Agent is for log files, not archiving. Option C is wrong because Data Analytics does not write to S3 directly. Option E is wrong because S3 events are not used for archiving.

1416
MCQhard

A data scientist is using SageMaker to train a model with a custom algorithm. The training script uses TensorFlow and runs on GPU instances. The training job fails with 'CUDA_ERROR_OUT_OF_MEMORY'. What is the most likely cause?

A.The S3 bucket is in a different region
B.The batch size is too large for the GPU memory
C.The GPU driver is outdated
D.The training script has a memory leak on CPU
E.The instance type does not have enough CPU cores
AnswerB

Large batch sizes can exceed GPU memory, causing out-of-memory errors.

Why this answer

Option C is correct because the error indicates GPU memory exhaustion, often due to batch size being too large. Option A (insufficient CPU) would not cause CUDA errors. Option B (S3 bandwidth) is unrelated.

Option D (CPU memory) is not the issue. Option E (driver version) would give a different error.

1417
MCQhard

A data scientist is working with a dataset containing geospatial coordinates (latitude and longitude) of customer locations. The scientist wants to engineer features such as distance to the nearest store, and cluster customers into regions. Which AWS service is best suited for performing geospatial analysis and clustering during exploratory data analysis?

A.Amazon SageMaker with custom Python scripts using scikit-learn and Geopy
B.Amazon Athena with PostGIS extensions
C.AWS Glue with geospatial transforms
D.Amazon Location Service
AnswerA

SageMaker allows custom code for distance calculations and clustering using libraries like scikit-learn.

Why this answer

Option B is correct because Amazon SageMaker provides built-in algorithms like K-Means for clustering, and the scientist can use custom code with libraries like Geopy to compute distances. Option A is wrong because Amazon Athena with PostGIS is for querying geospatial data, not for clustering. Option C is wrong because Amazon Location Service is for maps and location tracking, not for analytical clustering.

Option D is wrong because AWS Glue is for ETL, not for analysis and clustering.

1418
MCQeasy

A data scientist is training a TensorFlow model on a single GPU instance. The training is taking too long. Which AWS service should be used to reduce training time by distributing the workload across multiple GPUs?

A.Amazon SageMaker
B.AWS Glue
C.Amazon EMR
D.AWS Batch
AnswerA

SageMaker provides built-in distributed training libraries for multi-GPU training.

Why this answer

Amazon SageMaker supports distributed training across multiple GPUs using the SageMaker distributed training libraries. Option B is correct. Option A is wrong because AWS Batch is for batch computing, not specifically optimized for GPU training.

Option C is wrong because Amazon EMR is for big data processing. Option D is wrong because AWS Glue is for ETL jobs.

1419
MCQhard

A company uses AWS Glue ETL jobs to transform CSV data from an S3 bucket into Parquet. The jobs often fail with memory errors when processing large datasets. They want to minimize cost and improve reliability. What should they do?

A.Use G.1X or G.2X worker types and increase the number of DPUs per worker.
B.Use Amazon Athena with CTAS queries to convert the data to Parquet.
C.Switch to S3 Batch Operations with AWS Lambda to process the files individually.
D.Increase the number of workers in the Glue job configuration.
AnswerA

G.1X workers provide more memory and vCPU per worker, reducing OOM errors for memory-intensive transformations.

Why this answer

Option B is correct because increasing the number of DPUs per worker and using G.1X workers provides more memory and compute per task, reducing out-of-memory errors. Option A is wrong because Spark's shuffle behavior can still cause memory issues even with more workers. Option C is wrong because S3 batch operations are not suitable for complex transformations.

Option D is wrong because Presto is interactive, not designed for scheduled ETL.

1420
MCQhard

A data scientist is using Amazon SageMaker to train a custom TensorFlow model. The training job is failing with the error: 'OutOfRangeError: End of sequence'. The input data is stored in TFRecord format in S3. What is the most likely cause?

A.The TFRecord files are corrupted.
B.The number of training steps or epochs specified exceeds the dataset size.
C.The instance type does not have enough memory.
D.The shuffle buffer size is too large.
AnswerB

The training loop continues beyond available data, causing the error.

Why this answer

The 'OutOfRangeError: End of sequence' error in TensorFlow occurs when the training loop attempts to read more data than is available in the dataset. This typically happens when the number of training steps or epochs specified exceeds the total number of records in the TFRecord files, causing the iterator to reach the end of the dataset prematurely.

Exam trap

The trap here is that candidates often confuse 'OutOfRangeError' with data corruption or memory issues, but the error specifically indicates the dataset has been fully iterated, not that the data is damaged or resources are insufficient.

How to eliminate wrong answers

Option A is wrong because corrupted TFRecord files would typically cause parsing errors (e.g., 'DataLossError' or 'InvalidArgumentError'), not an 'End of sequence' error which indicates the iterator has exhausted valid data. Option C is wrong because insufficient memory would manifest as an 'OutOfMemoryError' or a resource exhaustion error, not a dataset iteration boundary error. Option D is wrong because a large shuffle buffer size may increase memory usage but does not cause an 'End of sequence' error; it only affects the randomness of data ordering within the available dataset.

1421
MCQhard

A company wants to automate the retraining of a model weekly using new data. The training script is in a SageMaker notebook. Which implementation is most maintainable?

A.Set up a cron job on an EC2 instance to run the training script
B.Schedule the notebook to run via a SageMaker Lifecycle Configuration script
C.Convert the notebook to a Python script, create a Docker container, and use SageMaker Pipelines with a schedule
D.Use AWS CloudFormation to provision a training job on a schedule
AnswerC

Pipelines provide a robust, scheduled workflow for training.

Why this answer

Option C is correct: convert the notebook to a Python script, package it in a Docker container, and schedule SageMaker Pipeline runs. Option A (run notebook via Lifecycle Config) is brittle. Option B (IAC) is for infrastructure, not training.

Option D (Cron job on EC2) is less managed.

1422
Multi-Selecthard

A data scientist is using Amazon SageMaker to train a deep learning model. The training job is taking too long. Which THREE actions can reduce training time?

Select 3 answers
A.Use incremental training to continue from a previous model
B.Use Spot Instances to reduce cost
C.Use Pipe input mode to stream data directly from Amazon S3
D.Decrease the batch size to reduce memory usage
E.Use a GPU instance type for faster computation
AnswersA, C, E

Incremental training starts from an existing model, requiring fewer epochs.

Why this answer

Incremental training allows you to start from a previously trained model, which reduces training time because the model does not need to learn from scratch. SageMaker's incremental training loads the existing model artifacts and continues training on new data, significantly cutting down the time required to converge compared to full retraining.

Exam trap

The trap here is that candidates often confuse cost-saving techniques (like Spot Instances) with performance-improving techniques, or they mistakenly think decreasing batch size always speeds up training, when in fact it can slow it down due to increased overhead.

1423
MCQmedium

A company is building a data pipeline to process streaming data from IoT devices. The data is ingested via Amazon Kinesis Data Streams. Each record is about 1 KB. The company wants to use AWS Lambda for real-time transformations and then store the results in Amazon DynamoDB. The expected throughput is 10,000 records per second. The Lambda function currently runs in about 200 ms. The company is concerned about Lambda concurrency limits and wants to ensure there are no throttling errors. The default concurrency limit for Lambda is 1,000. Which approach should the team take to handle the expected throughput without throttling?

A.Increase the Lambda function memory to 3,000 MB to reduce the execution time below 100 ms.
B.Use Amazon Kinesis Data Firehose instead of Lambda to load data directly into DynamoDB.
C.Reduce the Lambda batch size to 10 so that each invocation processes fewer records, reducing the time per invocation.
D.Increase the number of shards in the Kinesis Data Stream to 10 and set the Lambda batch size to 100.
AnswerD

With 10 shards and batch size 100, at most 10 concurrent Lambda invocations, well within limits.

Why this answer

Option B is correct because increasing the shard count to 10 ensures that each shard can trigger a Lambda invocation concurrently, and with a batch size of 100, the number of concurrent Lambda executions is at most 10 (10 shards * 1 batch per shard). This stays well within the concurrency limit. Option A is incorrect because reducing batch size increases the number of invocations per second (10,000 / 10 = 1,000 invocations per second), which would exceed the concurrency limit if each invocation takes 200 ms.

Option C is incorrect because Kinesis Data Firehose does not support Lambda for per-record transformations. Option D is incorrect because increasing memory does not affect concurrency limits.

1424
Multi-Selectmedium

A data scientist is building a deep learning model using Amazon SageMaker. The model is overfitting the training data. Which THREE actions can help reduce overfitting?

Select 3 answers
A.Add L2 regularization to the loss function.
B.Use data augmentation to increase the training dataset size.
C.Increase the number of layers in the network.
D.Reduce the learning rate.
E.Use dropout layers in the network.
AnswersA, B, E

L2 regularization penalizes large weights, reducing overfitting.

Why this answer

Overfitting can be reduced by regularization (L2), dropout, data augmentation (increases effective data size), early stopping, and reducing model complexity. Increasing model complexity (more layers) would increase overfitting (Option B). So correct: A, C, D.

1425
MCQeasy

A data scientist is visualizing the distribution of a numerical feature that is heavily right-skewed. Which visualization technique is most appropriate?

A.Histogram with linear scale
B.Scatter plot
C.Box plot with log scale
D.Q-Q plot
AnswerC

Box plot with log scale handles skewness and shows outliers.

Why this answer

A box plot with log scale is effective for skewed data as it shows outliers and distribution shape after transformation. Histogram with log scale also works. KDE is similar to histogram.

Q-Q plot checks normality. Scatter plot is for two variables.

Page 18

Page 19 of 24

Page 20