AWS Certified Machine Learning Engineer Associate MLA-C01 MLA-C01 Questions 226–300 | Page 4/7

226

MCQeasy

Which of the following is a recommended practice for preparing training data in Amazon SageMaker?

A.Store training data in Amazon DynamoDB

B.Compress data using gzip to reduce transfer time

C.Convert data to RecordIO format for built-in algorithms

D.Use Amazon S3 with public read access

AnswerC

RecordIO is a binary format that SageMaker built-in algorithms use for efficient data loading.

Why this answer

RecordIO format is recommended for built-in algorithms as it improves I/O performance. Storing data in Amazon DynamoDB is not suitable for training datasets. Public read access on S3 is insecure.

While gzip compression is common, RecordIO is a specific best practice for SageMaker.

Full explanation →

227

MCQmedium

A company is using SageMaker endpoints for inference. To reduce costs, they want to use Automatic Scaling. However, they observe that scaling up takes several minutes, causing latency spikes during traffic bursts. What should they do to mitigate this?

A.Optimize the model to reduce inference time.

B.Use larger instance types to handle more requests per instance.

C.Configure the endpoint with a target tracking scaling policy and pre-warm additional instances during expected traffic surges.

D.Set the endpoint to scale down slowly to maintain capacity.

AnswerC

Pre-warming ensures instances are ready, minimizing cold start latency.

Why this answer

Option A is correct because pre-warming the endpoint reduces cold start time. Option B (larger instances) might help but not cost-effective. Option C (optimizing model) reduces computational load but not scaling delay.

Option D (scale down slowly) addresses scale-down, not up.

Full explanation →

228

MCQmedium

A company runs an online retail business and wants to build a product recommendation system. They have a dataset of customer purchases stored in Amazon S3 as CSV files. The dataset includes columns: 'customer_id', 'product_id', 'purchase_date', 'quantity', 'price', and 'category'. The data science team plans to use Amazon SageMaker to train a factorization machines model. During data exploration, they discover that the 'category' column has 1,200 unique values, and many categories appear only a few times. The 'product_id' column has 50,000 unique values. They want to include both features in the model. The team is concerned about the high cardinality of these features. Which approach should they take to prepare these features for the factorization machines model?

A.Apply one-hot encoding to both 'product_id' and 'category' columns.

B.Drop the 'category' column and only use 'product_id' since it has more granularity.

C.Encode both columns as integer indices and feed them directly to the factorization machines algorithm as categorical features.

D.Apply principal component analysis (PCA) to reduce the dimensionality of the categorical features.

AnswerC

Factorization machines natively handle sparse categorical data via feature interactions and do not require one-hot expansion.

Why this answer

Option C is correct because Amazon SageMaker's factorization machines algorithm natively supports categorical features encoded as integer indices (0-based). This avoids the explosion of features from one-hot encoding (which would create 51,200 columns) and leverages the algorithm's ability to learn interactions between high-cardinality features via factorized parameters, making it both memory-efficient and effective for sparse data.

Exam trap

The trap here is that candidates default to one-hot encoding (Option A) as the standard categorical encoding technique, not realizing that factorization machines are specifically designed to avoid that explosion by accepting raw integer indices as categorical features.

How to eliminate wrong answers

Option A is wrong because one-hot encoding 1,200 categories and 50,000 products would create 51,200 binary columns, causing extreme sparsity and memory blowup, which undermines the factorization machine's efficiency and can lead to poor generalization. Option B is wrong because dropping the 'category' column discards valuable hierarchical information (e.g., product type) that could improve recommendation quality; factorization machines are designed to handle high-cardinality features, so there is no need to drop it. Option D is wrong because PCA is a linear dimensionality reduction technique for continuous features, not suitable for categorical data; applying PCA to one-hot encoded categories would destroy the interpretability of interactions and is not a standard preprocessing step for factorization machines.

Full explanation →

229

Multi-Selecteasy

A company ingests daily log data into an S3 bucket. They need to update the existing ML training dataset with new data without reprocessing the entire history. Which two strategies should they adopt? (Choose two.)

Select 2 answers

A.Store all data in a single large file and use append operations

B.Use AWS Glue to incrementally process new partitions

C.Use a partition key such as date to add new partitions

D.Manually copy new files to the same S3 bucket

E.Overwrite the entire existing dataset with the new data

AnswersB, C

Glue can process only new partitions using job bookmarks.

Why this answer

AWS Glue can perform incremental processing by using job bookmarks to track previously processed data and only process new partitions or files. This avoids reprocessing the entire historical dataset, making it efficient for updating ML training datasets with daily log data.

Exam trap

AWS often tests the misconception that S3 supports append operations or that simply copying new files to the same bucket constitutes an incremental update strategy, when in reality S3 objects are immutable and a proper processing framework like AWS Glue with job bookmarks is required.

Full explanation →

230

MCQhard

A healthcare company uses Amazon SageMaker to deploy a real-time inference endpoint for a diagnostic model. The endpoint is configured with a single ml.p3.2xlarge instance. The model processes patient data and returns a risk score. Recently, the endpoint has been experiencing intermittent 504 errors along with increased latency. The team uses Amazon CloudWatch to monitor the endpoint's InvocationsPerInstance and ModelLatency metrics. They observe that InvocationsPerInstance is well below the throttling threshold, but ModelLatency shows periodic spikes lasting 5-10 seconds. The endpoint's CPU utilization remains below 60%, but memory utilization occasionally spikes to 90% during those spikes. The team has checked the inference code and found no obvious memory leaks or performance bottlenecks in the custom logic. The model itself is a deep neural network hosted using Apache MXNet. The team suspects that the issue might be related to resource contention or an external dependency. What should the team do FIRST to diagnose and resolve the issue?

A.Implement request batching to increase throughput and reduce the number of inference requests.

B.Increase the instance type to a more memory-intensive instance like ml.p3.8xlarge to handle memory spikes.

C.Set up SageMaker Model Monitor to track data drift and model quality metrics.

D.Enable SageMaker Debugger rules and profiling to monitor memory and CPU utilization at a fine-grained level during inference.

AnswerD

Debugger can provide detailed profiling to pinpoint resource contention or memory issues in the model or framework.

Why this answer

Option B is correct because the symptoms point to a possible memory contention issue, and enabling detailed profiling for memory and CPU can identify the root cause. Option A is wrong because increasing instance size might mask the problem without identifying it. Option C is wrong because request batching can increase memory usage and may worsen the issue.

Option D is wrong because Model Monitor is for data drift, not performance diagnostics.

Full explanation →

231

MCQhard

A team deploys a machine learning model using a SageMaker endpoint with an ML.T4 instance. After a week, they notice that the endpoint's CPU utilization is consistently below 10% and latency is low. However, the endpoint is incurring high costs. Which action should the team take to reduce costs while maintaining the ability to serve traffic?

A.Switch to a multi-model endpoint to share instances across models

B.Reduce the number of instances to one

C.Migrate to a SageMaker Serverless Inference endpoint

D.Implement an asynchronous inference endpoint

AnswerC

Serverless endpoints scale to zero when idle, reducing cost.

Why this answer

Option C is correct because a serverless inference endpoint scales to zero when not in use, reducing cost. Option A is wrong because multi-model endpoints still have always-on instances. Option B is wrong because reducing instances may cause throttling.

Option D is wrong because asynchronous inference is for batch, not real-time.

Full explanation →

232

MCQmedium

A machine learning model is deployed on SageMaker and its predictions are used in a production application. The model's accuracy has degraded over time. What is the most likely cause?

A.The training data was not shuffled properly.

B.The model was not compiled for inference.

C.The model experienced concept drift.

D.The endpoint instance type is too small.

AnswerC

Concept drift is a common cause of accuracy degradation in production.

Why this answer

Concept drift occurs when the statistical properties of the target variable change over time, leading to accuracy degradation. Training data shuffling and instance size affect training performance and latency, not accuracy post-deployment.

Full explanation →

233

MCQmedium

A company's SageMaker endpoint is experiencing increased latency during peak hours. The endpoint uses a single ml.m5.large instance. The deployment is critical and must maintain low latency. Which action is MOST effective to reduce latency without sacrificing cost efficiency?

A.Deploy multiple variants with A/B testing

B.Use Elastic Inference to attach an accelerator

C.Switch to a ml.c5.large instance

D.Add an auto-scaling policy based on request count

E.Enable SageMaker Model Monitor

AnswerD

Auto-scaling adjusts instance count to match demand, reducing latency during spikes while minimizing cost.

Why this answer

Adding auto-scaling based on request count allows the endpoint to handle spikes without over-provisioning, balancing cost and latency.

Full explanation →

234

MCQmedium

A machine learning engineer has configured a SageMaker Model Monitor schedule for data quality monitoring as shown in the exhibit. The schedule is set to run hourly. However, the engineer notices that the monitoring jobs are not producing output in the specified S3 bucket. What is the most likely cause?

A.The output_path is incorrectly placed; it should be under the MonitoringOutputConfig.

B.The DataAnalysisStartTime and DataAnalysisEndTime are set to a past date, so no data is analyzed.

C.The MonitoringType should be 'ModelQuality' to enable data quality monitoring.

D.The cron expression is incorrectly formatted for an hourly schedule.

AnswerB

The monitoring job looks for data within the specified time range; if it's in the past and no data exists, no output is produced.

Why this answer

Option B is correct because the DataAnalysisStartTime and DataAnalysisEndTime parameters define the time window for which SageMaker Model Monitor analyzes data. When both are set to a past date that has already passed, the monitoring job finds no new data to analyze within that window, resulting in no output being written to the S3 bucket. The schedule runs hourly, but the analysis window is fixed to a historical period, so each execution produces no results.

Exam trap

The trap here is that candidates often overlook the significance of the DataAnalysisStartTime and DataAnalysisEndTime parameters, assuming they are optional or default to the current time, when in fact they strictly define the data range and can cause silent failures if set to a past date.

How to eliminate wrong answers

Option A is wrong because the output_path is correctly placed under the MonitoringOutputConfig in the exhibit; SageMaker Model Monitor requires the output location to be specified within MonitoringOutputConfig, not as a separate top-level parameter. Option C is wrong because the MonitoringType should be 'DataQuality' for data quality monitoring, not 'ModelQuality'; 'ModelQuality' is used for model quality monitoring (e.g., accuracy, precision), which is a different monitoring type. Option D is wrong because the cron expression 'cron(0 * * * ? *)' is correctly formatted for an hourly schedule at the start of each hour; there is no syntax error in the expression.

Full explanation →

235

MCQeasy

Refer to the exhibit. A data scientist is trying to use AWS Glue to read data from the S3 bucket `ml-data-bucket`. The Glue job fails with an access denied error. What is the most likely cause?

A.The policy allows s3:PutObject but the job only reads

B.The policy does not specify the bucket ARN without /*

C.The Glue job role does not have the required permissions

D.The policy does not include s3:ListBucket permission on the bucket

AnswerD

Glue needs ListBucket to discover objects in the bucket.

Why this answer

The error occurs because the IAM policy attached to the Glue job role grants s3:GetObject on the bucket objects (via the `arn:aws:s3:::ml-data-bucket/*` resource) but does not include the s3:ListBucket permission on the bucket itself (`arn:aws:s3:::ml-data-bucket`). When AWS Glue reads data from S3, it first performs a ListBucket operation to enumerate objects in the bucket or prefix, and without that permission, the request is denied even if GetObject is allowed.

Exam trap

AWS often tests the subtle distinction between bucket-level permissions (like s3:ListBucket) and object-level permissions (like s3:GetObject), where candidates assume that granting GetObject on objects is sufficient for reading data, forgetting that listing the bucket is a prerequisite for discovering those objects.

How to eliminate wrong answers

Option A is wrong because the error is an access denied on a read operation, not a write operation; s3:PutObject is irrelevant to reading data. Option B is wrong because the policy does specify the bucket ARN without `/*` for the s3:ListBucket permission (as required), but the issue is that the s3:ListBucket permission itself is missing entirely. Option C is wrong because the Glue job role does have some permissions (as shown in the exhibit), but the specific missing permission is s3:ListBucket, not a general lack of permissions.

Full explanation →

236

MCQmedium

A team is evaluating classification models for a medical diagnosis application. The cost of a false negative is much higher than the cost of a false positive. Which metric should be optimized during model selection?

A.Recall

B.Accuracy

C.F1 score

D.Precision

AnswerA

Recall minimizes false negatives, directly addressing the high cost of missed diagnoses.

Why this answer

Option B is correct because recall (true positive rate) focuses on minimizing false negatives, which is the priority when a missed diagnosis is costly. Option A (precision) minimizes false positives. Option C (accuracy) treats all errors equally.

Option D (F1 score) balances precision and recall but does not emphasize recall over precision.

Full explanation →

237

MCQhard

What is the most likely cause of this error?

A.The account has not requested a service limit increase for the specified instance type.

B.The training job is configured to use Managed Spot Training and the spot market is unavailable.

C.The IAM role used for the training job does not have sufficient permissions to launch the instance.

D.The instance type specified is not available in the current AWS Region.

AnswerA

A ResourceLimitExceeded error with a limit of 0 means the account needs to request a service limit increase from AWS Support.

Why this answer

The error indicates that the service limit for the specified instance type is set to 0, meaning the account has not been granted a limit increase for that instance. The instance might be available in the region, but the limit is zero. IAM permissions would give an access denied error, not ResourceLimitExceeded.

Spot unavailability generates a different error.

Full explanation →

238

MCQeasy

A data scientist is using Amazon SageMaker to train a linear regression model. After training, the scientist notices that the training and validation errors are both low, but the model performs poorly on new test data. What is the MOST likely cause?

A.There is data leakage from the validation set into the training set

B.The features are not scaled properly

C.The model is overfitting the training data

D.The model has high bias

AnswerA

Data leakage artificially inflates performance on validation but fails on true unseen data.

Why this answer

Option A is correct because data leakage from the validation set into the training set would allow the model to learn patterns that are not present in truly unseen data, leading to artificially low training and validation errors but poor generalization to new test data. In SageMaker, this can occur if the dataset is not properly split before feature engineering or if preprocessing (e.g., scaling or imputation) is applied to the entire dataset before splitting, causing the validation set to influence the training process.

Exam trap

The trap here is that candidates often confuse overfitting (low training error, high validation error) with data leakage (low training and validation errors, but poor test performance), so they incorrectly select Option C without recognizing that the validation error is also low.

How to eliminate wrong answers

Option B is wrong because improper feature scaling typically leads to slow convergence or suboptimal performance during training, but it would not cause low training and validation errors with poor test performance; scaling issues usually affect both training and validation errors similarly. Option C is wrong because overfitting would result in low training error but high validation error, not low validation error as described in the scenario. Option D is wrong because high bias (underfitting) would cause both training and validation errors to be high, not low.

Full explanation →

239

MCQeasy

A healthcare startup is building a model to predict patient readmission within 30 days. The data is stored in Amazon Redshift and includes patient demographics, admission history, lab results, and medication records. The data scientist extracts a sample of 10,000 records to Amazon S3 as CSV files for initial prototyping. During exploratory data analysis, they find that the 'age' column has values like '150', '0', and negative numbers. The 'diagnosis_code' column contains codes like 'E11', 'E11.9', and 'e11' (inconsistent formatting). The 'readmitted' target column has 60% 'Yes' and 40% 'No'. The data scientist wants to use AWS Glue DataBrew for data cleaning. Which combination of steps should they use?

A.In AWS Glue DataBrew: 1) Filter age between 0 and 120 to remove invalid values. 2) Standardize diagnosis_code to uppercase using a formula. 3) Apply Random Oversampling to balance the target column.

B.In AWS Glue DataBrew: 1) Impute age with the mean. 2) Apply Standard Scaler to all numeric columns. 3) Use Random Oversampling to balance the target column.

C.In AWS Glue DataBrew: 1) Replace age with median. 2) Convert diagnosis_code to uppercase. 3) Apply SMOTE to balance the target column.

D.In AWS Glue DataBrew: 1) Remove rows where age is outside 0-120. 2) Drop diagnosis_code column. 3) Use Random Undersampling to balance the target column.

AnswerA

Filtering removes invalid ages, standardizing codes ensures consistency, and oversampling addresses imbalance.

Why this answer

Option A is correct because it uses AWS Glue DataBrew's built-in capabilities to filter invalid age values (0–120), standardize the diagnosis_code to uppercase via a formula, and apply Random Oversampling to address the 60/40 class imbalance. DataBrew supports filtering, formula-based transformations, and built-in ML transforms like Random Oversampling, making this combination valid and efficient for data cleaning.

Exam trap

The trap here is that candidates may assume SMOTE or Standard Scaler are available in DataBrew, but AWS Glue DataBrew has a limited set of built-in ML transforms (e.g., Random Oversampling, Random Undersampling) and does not include SMOTE or Standard Scaler, which are typically handled in Amazon SageMaker or custom scripts.

How to eliminate wrong answers

Option B is wrong because imputing age with the mean is inappropriate when values include '150', '0', and negative numbers, which would skew the mean and introduce bias; also, Standard Scaler should be applied after cleaning and splitting, not during initial prototyping, and DataBrew does not natively support Standard Scaler as a built-in transform. Option C is wrong because replacing age with median still contaminates the dataset with invalid values (e.g., negative numbers) and DataBrew does not support SMOTE (Synthetic Minority Oversampling Technique) as a built-in transform; SMOTE is typically applied in SageMaker or custom scripts. Option D is wrong because dropping the diagnosis_code column removes potentially predictive information without attempting to standardize it, and Random Undersampling would discard 20% of the majority class, which may lead to loss of valuable data and is less preferred than oversampling for a 60/40 imbalance.

Full explanation →

240

MCQeasy

A data scientist trained a model using SageMaker and wants to automate the retraining process when new data becomes available. Which AWS service is best suited to trigger a SageMaker training job based on an S3 event?

A.AWS Step Functions with a scheduled trigger.

B.Amazon Simple Workflow Service (SWF) decider.

C.Amazon EventBridge with a rule matching S3 object creation.

D.Amazon Simple Queue Service (SQS) with a polling script.

AnswerC

EventBridge can invoke a Lambda function that starts the training job.

Why this answer

Option B is correct because Amazon EventBridge can listen to S3 events (e.g., PutObject) and trigger a Lambda function to start a training job. Option A (SQS) is a queue, not a trigger for direct training jobs. Option C (SWF) is a workflow service not typically used for this simple pattern.

Option D (Step Functions) can orchestrate but is not directly triggered by S3 events without EventBridge.

Full explanation →

241

Multi-Selectmedium

A data science team uses SageMaker Studio to collaborate. They need to restrict access to certain SageMaker Studio applications (e.g., only JupyterLab, no RStudio). Which THREE steps should they take? (Choose THREE.)

Select 3 answers

A.Use an S3 bucket policy to prevent access to RStudio application artifacts.

B.Attach an IAM policy to users that denies sagemaker:CreateApp for specific app types.

C.Enable CloudTrail to log user activity and monitor for prohibited app usage.

D.Define a SageMaker Studio domain-level policy that specifies allowed apps.

E.Create a custom lifecycle configuration that disables unauthorized apps.

AnswersB, D, E

IAM policies can deny creation of specific apps like RStudio.

Why this answer

Options A, B, and E are correct. A domain-level policy can restrict apps, a lifecycle configuration can enforce settings at launch, and IAM policies can limit specific applications. Option C (S3 bucket policy) doesn't control Studio apps.

Option D (CloudTrail) is for auditing, not restriction.

Full explanation →

242

MCQmedium

What will the debugger do with this configuration?

A.It will only capture gradients and not run any rules because the rule name is misspelled.

B.It will capture gradients every 10 steps and trigger a rule if loss does not decrease for 500 epochs.

C.It will capture gradients every 500 steps and trigger a rule if loss does not decrease for 10 steps with a threshold of 0.001.

D.It will capture gradients every 500 steps and trigger a rule if loss does not decrease for 500 iterations with a patience of 10.

AnswerC

The collection captures gradients every 500 steps; the rule parametrs (patience=10, threshold=0.001) define when to alert.

Why this answer

The 'save_interval' in the collection captures gradients every 500 steps. The rule 'LossNotDecreasing' checks if the loss does not decrease for 'patience' consecutive steps (10) within a tolerance of 'threshold' (0.001). Option B incorrectly interprets the timing; option C swaps values; option D incorrectly states rules run despite the misspelling? Actually rule name is valid.

Full explanation →

243

MCQhard

Refer to the exhibit. A SageMaker training job logs show training AUC increasing but validation AUC plateauing at 0.880. What is the most likely issue?

A.Overfitting

B.Learning rate too high

C.Underfitting

D.Insufficient training data

AnswerA

The model is memorizing training data (train AUC up) but not generalizing (validation AUC flat).

Why this answer

Training AUC continues to increase while validation AUC stops improving and even drops slightly, indicating overfitting. Underfitting would show both low, high learning rate would cause erratic behavior, and insufficient data typically causes high variance but not this pattern.

Full explanation →

244

MCQeasy

A company deploys a deep learning model to a real-time SageMaker endpoint. After deployment, users report high inference latency. Which action is the MOST effective first step to reduce latency?

A.Switch to a larger instance type with more GPU memory.

B.Compile the model using SageMaker Neo to optimize for the target instance.

C.Enable SageMaker Model Monitor to capture inference data.

D.Increase the number of instances in the endpoint to handle more requests.

AnswerB

Neo optimizes the model for the specific hardware, reducing inference latency with minimal accuracy loss.

Why this answer

SageMaker Neo compiles the trained model to optimize it for the target instance hardware, reducing inference latency without requiring additional resources. This is the most effective first step because it directly addresses model execution efficiency, often yielding significant speedups for deep learning models.

Exam trap

The trap here is that candidates often confuse latency reduction with throughput improvement, incorrectly choosing horizontal scaling (Option D) or vertical scaling (Option A) as the first step, when model optimization via compilation is the most direct and cost-effective approach.

How to eliminate wrong answers

Option A is wrong because switching to a larger instance type with more GPU memory may reduce latency if the model is memory-bound, but it is not the most effective first step—it increases cost and does not address software-level inefficiencies. Option C is wrong because SageMaker Model Monitor is used for capturing inference data to detect data drift and model quality issues, not for reducing latency. Option D is wrong because increasing the number of instances (horizontal scaling) improves throughput and handles more concurrent requests, but it does not reduce the latency of individual inference requests; it may even add network overhead.

Full explanation →

245

MCQeasy

A company is training a binary classifier in SageMaker and observes that the training loss decreases but validation loss increases after a few epochs. What is the most likely issue?

A.Learning rate too high

B.Overfitting

C.Underfitting

D.Data imbalance

AnswerB

Correct: Overfitting occurs when the model performs well on training data but poorly on validation data.

Why this answer

Option A is correct because overfitting occurs when the model performs well on training data but poorly on validation data. Option B is wrong because underfitting would show high loss on both datasets. Option C is wrong because a high learning rate may cause divergence but not necessarily validation loss increase.

Option D is wrong because data imbalance typically affects both training and validation metrics.

Full explanation →

246

MCQmedium

Which feature scaling method is most robust to outliers in the data?

A.Normalization (L2)

B.Standardization (Z-score)

C.Robust scaling

D.Min-max scaling

AnswerC

Robust scaling uses median and IQR, thus resilient to outliers.

Why this answer

Robust scaling uses median and interquartile range, making it less sensitive to outliers than standardization (mean and standard deviation) or min-max scaling (range dependent on extremes).

Full explanation →

247

MCQeasy

A company has a dataset of 2 billion records stored as text files in Amazon S3. The data is partitioned by year and month. The data science team wants to read only the last 6 months of data for model training using SageMaker. To minimize data scanned and reduce costs, which approach should the team use?

A.Use S3 Select to retrieve only the last 6 months of data by applying an SQL expression on each object.

B.Use AWS Glue to create a catalog table with partitions, then query with Athena to create a filtered dataset in S3.

C.Use SageMaker Processing with a script that lists all objects in the bucket and reads only those with the desired prefixes.

D.Use SageMaker Processing with Input Mode 'File' and specify the S3 prefix for the last 6 months.

AnswerB

Partition pruning ensures only relevant data is scanned.

Why this answer

Option B is correct because AWS Glue can crawl the S3 data to create a catalog table with partitions by year and month. Athena can then query only the partitions corresponding to the last 6 months, scanning minimal data and writing the filtered results back to S3 for SageMaker training. This approach leverages partition pruning to reduce costs and avoids loading or processing the full 2 billion records.

Exam trap

AWS often tests the misconception that SageMaker's Input Mode 'File' or S3 Select can efficiently filter partitioned data, but the key trap is that partition pruning requires a catalog service (like Glue) and a query engine (like Athena) to avoid scanning all objects or listing the entire bucket.

How to eliminate wrong answers

Option A is wrong because S3 Select operates on a single object at a time and cannot filter across multiple objects or partitions; applying it to 2 billion records would require iterating over all objects, negating cost savings. Option C is wrong because listing all objects in the bucket and reading only those with desired prefixes still requires enumerating the entire bucket, which incurs significant API costs and does not minimize data scanned (the script must still list all objects). Option D is wrong because SageMaker Processing with Input Mode 'File' downloads the entire dataset to the training instance; specifying a prefix for the last 6 months would still download all files under that prefix, but the data is partitioned by year and month, so using the prefix alone does not guarantee partition pruning—the team would need to explicitly list only the relevant prefixes, which is inefficient compared to Glue+Athena.

Full explanation →

248

Multi-Selecteasy

A company uses SageMaker Model Monitor to detect drift. They want to receive notifications when drift is detected. Which TWO services can be used together to send notifications? (Choose TWO.)

Select 2 answers

A.Amazon SNS topics to send email or SMS.

B.Amazon EventBridge to trigger a notification.

C.AWS Lambda to process the drift and send an email via SES.

D.Amazon SQS to queue the notification.

E.Amazon CloudWatch Alarms set on the drift metric.

AnswersA, E

SNS is used for sending notifications.

Why this answer

Options B and D are correct. Amazon CloudWatch Alarm can trigger on a Model Monitor metric, and it can publish to an SNS topic to send email or SMS. Option A (Lambda) could be used but is not a direct notification service.

Option C (SQS) is for queues. Option E (EventBridge) can route events but not directly send notifications; usually triggers Lambda.

Full explanation →

249

MCQmedium

A company is building a binary classifier for credit default prediction. The dataset is highly imbalanced (98% no default). They want to maximize recall for the minority class while maintaining reasonable precision. Which metric should be optimized during hyperparameter tuning?

A.AUC-ROC

B.F1 score

C.Accuracy

D.Precision

AnswerB

F1 score is the harmonic mean of precision and recall, addressing both metrics.

Why this answer

F1 score balances precision and recall, making it suitable for imbalanced datasets when both metrics are important. Other options are less appropriate because accuracy is misleading due to imbalance, precision ignores recall, and AUC-ROC does not directly optimize recall at a decision threshold.

Full explanation →

250

MCQhard

A team is using SageMaker Pipelines to train a model. The pipeline has multiple steps: data processing, training, evaluation, and registration. They use a Condition step to evaluate the model's accuracy and if it exceeds a threshold, register the model. They run the pipeline and the training step succeeds, but the pipeline fails at the Condition step with an error: 'Unable to evaluate condition: the property 'Accuracy' does not exist.' The evaluation step output is a JSON file with key 'accuracy'. What is the most likely cause?

A.The evaluation step did not produce the output correctly.

B.The training step output is being used instead of the evaluation step output.

C.The pipeline definition has a syntax error.

D.The Condition step is referencing the wrong property name.

AnswerD

Correct: 'Accuracy' vs 'accuracy' case mismatch causes the error.

Why this answer

Option A is correct because the Condition step references 'Accuracy' (capital A) but the evaluation output uses 'accuracy' (lowercase). Property names are case-sensitive. Option B is wrong because the evaluation step produced output correctly.

Option C is wrong because if training step output were used, the property name would still be mismatched. Option D is wrong because the error is specific to property name, not syntax.

Full explanation →

251

Multi-Selecthard

A company is building a real-time inference pipeline for an ML model. The raw data arrives in JSON format via Amazon Kinesis Data Streams. Before invoking the SageMaker endpoint, the data must be preprocessed to match the training data format. Which THREE steps should be included in the preprocessing function? (Select THREE)

Select 3 answers

A.Ensure that missing values are handled consistently with the training phase

B.Convert the data to a CSV string for model input

C.Apply the same feature engineering transformations (e.g., scaling, encoding) that were used during training

D.Re-train the model periodically using new data

E.Parse the JSON payload

AnswersA, C, E

Missing value handling must be identical to training to avoid errors.

Why this answer

Option A is correct because the preprocessing function must handle missing values identically to how they were handled during training to maintain data consistency. If the training phase used mean imputation for a numeric feature, the inference pipeline must apply the same mean value; otherwise, the model will receive unexpected input distributions, degrading prediction accuracy.

Exam trap

The trap here is that candidates confuse the preprocessing function's scope with broader MLOps tasks like model retraining, or assume a specific serialization format like CSV is required when JSON is natively supported by SageMaker endpoints.

Full explanation →

252

MCQmedium

A team uses SageMaker Clarify to monitor bias drift in production. They schedule weekly analysis. After a month, Clarify reports a significant increase in a bias metric. What should the team do first?

A.Disable the bias monitor because the metric may be noisy.

B.Immediately retrain the model with a balanced dataset.

C.Increase the frequency of analysis to daily.

D.Review the analysis report to understand which feature and segment contributed to the drift.

AnswerD

The Clarify report provides details on which features and segments are driving the bias, guiding appropriate action.

Why this answer

Option D is correct because reviewing the report helps understand which feature and segment contributed to the drift. Option A is premature without understanding the cause. Option B is ignoring the issue.

Option C does not address the drift source.

Full explanation →

253

MCQmedium

A company uses SageMaker Pipelines to train and register models. They want to automate the deployment of approved models from the model registry to a staging endpoint. Which service should they use to orchestrate the deployment workflow?

A.AWS Step Functions

B.AWS CloudFormation

C.Amazon EventBridge

D.AWS CodePipeline

AnswerA

Step Functions can orchestrate SageMaker API calls and integrate with Model Registry.

Why this answer

AWS Step Functions is the correct choice because it is a serverless orchestration service designed to coordinate multiple AWS services into flexible, event-driven workflows. For SageMaker Pipelines, Step Functions can trigger model deployment from the registry to a staging endpoint by chaining actions like invoking a Lambda function for approval checks, calling SageMaker's CreateEndpoint API, and handling rollback logic on failure.

Exam trap

AWS often tests the distinction between orchestration (Step Functions) and event routing (EventBridge) or CI/CD (CodePipeline), leading candidates to pick EventBridge because they confuse event-driven triggers with the need for sequential workflow coordination.

How to eliminate wrong answers

Option B (AWS CloudFormation) is wrong because it is an Infrastructure as Code (IaC) service for provisioning and managing AWS resources declaratively, not for orchestrating event-driven deployment workflows with conditional logic and error handling. Option C (Amazon EventBridge) is wrong because it is a serverless event bus for routing events between services, but it lacks built-in workflow orchestration capabilities like sequencing, branching, and human approval steps required for deployment pipelines. Option D (AWS CodePipeline) is wrong because it is a CI/CD service focused on source code build, test, and deploy stages, but it does not natively integrate with SageMaker model registry approval workflows or provide the granular orchestration needed for ML model deployment from registry to endpoint.

Full explanation →

254

MCQmedium

A media company uses SageMaker to host a real-time video recommendation model. The model is deployed on a single ml.c5.xlarge endpoint. During a major live event, traffic surges to 10 times the normal load, and the endpoint becomes unresponsive, causing high latency and errors. The team had set up an Application Auto Scaling target tracking policy based on CPU utilization with a target of 70%. However, scaling did not trigger quickly enough. After the event, the team reviews CloudWatch metrics and notices that CPU utilization never exceeded 70% during the surge, but memory utilization peaked at 95%. The model is memory-bound. The team wants to ensure the endpoint scales automatically before performance degrades during future events. What should the team do?

A.Change the target tracking metric to memory utilization and set a target of 70%

B.Increase the target CPU utilization to 90% so that scaling triggers at higher load

C.Change the endpoint instance type to ml.c5.4xlarge to provide more memory per instance

D.Create a scheduled scaling policy to add instances during the known event time

AnswerA

Memory is the bottleneck; scaling on memory utilization will trigger before memory runs out.

Why this answer

Option A is correct because the model is memory-bound, and the current CPU-based target tracking policy failed to trigger scaling since CPU utilization never exceeded 70% during the surge. By switching to a memory utilization metric with a target of 70%, scaling will activate based on the actual resource constraint (memory), preventing performance degradation before the endpoint becomes unresponsive.

Exam trap

The trap here is that candidates assume CPU utilization is always the correct metric for scaling, but the question explicitly states the model is memory-bound, so the scaling policy must match the actual bottleneck to be effective.

How to eliminate wrong answers

Option B is wrong because increasing the CPU target to 90% does not address the root cause: CPU utilization never exceeded 70% during the surge, so the policy would still not trigger scaling. Option C is wrong because changing to a larger instance type (ml.c5.4xlarge) provides more memory per instance but does not enable automatic scaling; the endpoint would still be a single instance and could become overwhelmed under similar traffic spikes. Option D is wrong because a scheduled scaling policy assumes predictable event timing, but the question describes a major live event where the timing may be known; however, the team wants a reactive scaling mechanism that triggers automatically before performance degrades, not a pre-scheduled one that may not align with actual traffic patterns.

Full explanation →

255

MCQhard

A company is training a deep learning model on Amazon SageMaker using a dataset stored in Amazon S3. The training job is taking a long time due to I/O bottlenecks. The data is in JSON lines format. Which data preparation step combined with SageMaker's best practices would most effectively reduce training time?

A.Convert the JSON lines files to CSV format and use SageMaker's File mode for training.

B.Compress the JSON lines files using gzip and use File mode with local caching.

C.Convert the data to RecordIO-Protobuf format and use SageMaker's Pipe mode for training.

D.Split the data into multiple smaller files and use multiple training instances to parallelize.

AnswerC

RecordIO-Protobuf allows streaming data to the algorithm, minimizing I/O wait.

Why this answer

Option C is correct because converting JSON lines data to RecordIO-Protobuf format allows SageMaker's Pipe mode to stream data directly from Amazon S3 to the training algorithm without writing to disk, eliminating I/O bottlenecks. Pipe mode uses a FIFO pipe (named pipe) to feed data sequentially, which significantly reduces training time for deep learning models that iterate over the dataset multiple times.

Exam trap

The trap here is that candidates assume File mode is always faster because it caches data locally, but they overlook that Pipe mode eliminates the initial download latency entirely, which is the primary cause of I/O bottlenecks in large-scale deep learning training.

How to eliminate wrong answers

Option A is wrong because converting to CSV does not address the I/O bottleneck; File mode still downloads the entire dataset to the training instance's local storage before training begins, causing high latency. Option B is wrong because gzip compression reduces file size but File mode with local caching still requires a full download to disk, and decompression adds CPU overhead without eliminating the I/O bottleneck. Option D is wrong because splitting data into smaller files and using multiple instances parallelizes computation but does not reduce per-instance I/O latency; each instance still uses File mode by default, so the bottleneck persists.

Full explanation →

256

MCQmedium

A data science team is using Amazon SageMaker to train and deploy a binary classification model. They want to continuously monitor the model for data drift in production. Which combination of AWS services and SageMaker features should they use to implement automated drift detection with minimal operational overhead?

A.SageMaker Debugger and Amazon SNS

B.SageMaker Pipelines and AWS Lambda

C.SageMaker Clarify and AWS Config

D.SageMaker Model Monitor and Amazon CloudWatch

AnswerD

SageMaker Model Monitor detects drift and sends metrics to CloudWatch for alerting.

Why this answer

SageMaker Model Monitor is the native SageMaker feature designed specifically for continuously monitoring deployed models for data drift, bias drift, and feature attribution drift. It automatically captures inference requests and responses, computes statistics, and publishes metrics to Amazon CloudWatch, which can trigger alarms for drift detection. This combination provides automated drift detection with minimal operational overhead because it requires no custom infrastructure or manual scheduling.

Exam trap

The trap here is that candidates confuse SageMaker Debugger (training debugging) with SageMaker Model Monitor (production drift detection), or they overcomplicate the solution by adding unnecessary services like Lambda or Config when the native integration with CloudWatch already provides automated alerting.

How to eliminate wrong answers

Option A is wrong because SageMaker Debugger is used for debugging training jobs (e.g., monitoring gradients, weights, and loss during training), not for monitoring data drift in production inference. Option B is wrong because SageMaker Pipelines is a CI/CD orchestration tool for building and managing ML workflows, not a continuous monitoring service; while AWS Lambda could be used to process drift alerts, the core drift detection capability is missing. Option C is wrong because SageMaker Clarify is designed for bias detection and explainability (SHAP values) on datasets or during training, not for real-time drift monitoring of production endpoints; AWS Config tracks resource configuration changes, not model performance or data drift.

Full explanation →

257

MCQmedium

A data scientist is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column with date strings in the format 'YYYY-MM-DD'. The data scientist wants to extract the year, month, and day as separate features. Which Data Wrangler transform should be used?

A.Encode categorical transform.

B.Scale values transform.

C.Parse date transform.

D.Handle missing transform.

AnswerC

Parse date allows extracting date components from date strings.

Why this answer

The 'Parse date' transform in Amazon SageMaker Data Wrangler is specifically designed to convert date strings into structured datetime components. By applying this transform to the 'YYYY-MM-DD' column, the data scientist can automatically extract year, month, and day as separate features, enabling downstream feature engineering without manual string parsing.

Exam trap

The trap here is that candidates may confuse 'Parse date' with 'Encode categorical' because dates can be treated as categorical features, but the question specifically asks for extracting year, month, and day as separate features, which requires parsing the date string into its components, not encoding the entire date as a category.

How to eliminate wrong answers

Option A is wrong because 'Encode categorical' transform is used to convert categorical variables into numerical representations (e.g., one-hot encoding), not to parse date strings. Option B is wrong because 'Scale values' transform normalizes or standardizes numerical features (e.g., min-max scaling, z-score), which is irrelevant for extracting date components. Option D is wrong because 'Handle missing' transform addresses null or missing values through imputation or deletion, not date parsing.

Full explanation →

258

MCQhard

Refer to the exhibit. A data scientist uses a SageMaker notebook instance to read a model file from S3 bucket 'my-bucket'. The bucket uses SSE-KMS encryption with a KMS key. The IAM role attached to the notebook has the above policy. However, reading the file fails. What is the MOST likely reason?

A.The resource ARN for S3 does not include the bucket itself (only objects inside).

B.The policy allows s3:GetObject only if server-side encryption is AES256, but the bucket uses KMS.

C.The condition requires encryption to be AES256, which is SSE-S3, but the bucket uses KMS.

D.The kms key ARN is incorrect.

E.The policy allows kms:Decrypt but does not allow kms:GenerateDataKey.

AnswerC

Correct. The condition checks for 'AES256' header, but SSE-KMS uses 'aws:kms', so the condition fails and access is denied.

Why this answer

The S3 condition requires server-side encryption to be AES256 (SSE-S3), but the bucket uses SSE-KMS, so the S3 request does not satisfy the condition, denying access.

Full explanation →

259

MCQhard

A company is preparing a dataset with a categorical feature that has over 1000 unique values. They need to create features for a random forest model. Which feature engineering approach is most scalable and effective in AWS for high-cardinality categories?

A.Hash encoding using Apache Spark on Amazon EMR

B.One-hot encoding using SageMaker Processing with scikit-learn

C.Label encoding using Pandas in a SageMaker notebook

D.Target encoding with smoothing using SageMaker Data Wrangler

AnswerD

Target encoding reduces cardinality and is effective for tree models; Data Wrangler integrates natively.

Why this answer

Target encoding with smoothing in SageMaker Data Wrangler is the most scalable and effective approach because it replaces each high-cardinality category with the mean of the target variable, smoothed by a global prior to prevent overfitting. SageMaker Data Wrangler handles datasets with over 1000 unique values efficiently without exploding feature dimensions, unlike one-hot encoding, and avoids the ordinal bias of label encoding.

Exam trap

AWS often tests the misconception that one-hot encoding is always safe for categorical features, but the trap here is that high-cardinality categories require a dimensionality-reduction technique like target encoding, not a naive expansion that breaks scalability.

How to eliminate wrong answers

Option A is wrong because hash encoding can cause collisions (different categories mapping to the same hash value), which degrades model performance, and using Apache Spark on Amazon EMR adds unnecessary complexity and cost for a task that SageMaker Data Wrangler handles natively. Option B is wrong because one-hot encoding with over 1000 unique values creates over 1000 sparse binary columns, leading to the curse of dimensionality, memory issues, and poor performance in random forests. Option C is wrong because label encoding assigns arbitrary integer values (e.g., 1, 2, 3) that imply ordinal relationships, which random forests can misinterpret as meaningful order, introducing bias and reducing model accuracy.

Full explanation →

260

MCQmedium

A company is building a machine learning model on customer transaction data stored in Amazon S3. The data includes columns with missing values in the 'age' field. The data scientist wants to impute missing values with the median age across all customers. Which approach is MOST efficient for preparing the data at scale?

A.Use AWS Glue Transform with the FillMissingValues transform specifying the median strategy

B.Use a custom Python script with pandas to compute median and fill missing values, then upload to S3

C.Use a custom PySpark script in AWS Glue to compute median and fill missing values

D.Use Amazon Athena SQL query to compute median and update the table

AnswerC

PySpark provides the scalability of Spark with the ability to compute median (e.g., using approxQuantile) and fill missing values, making it efficient for large datasets.

Why this answer

Option C is correct because AWS Glue with PySpark provides a distributed, scalable environment that can efficiently compute the median and fill missing values across large datasets stored in S3. PySpark's DataFrame API handles the median computation natively, and the Glue job runs on a managed Spark cluster, making it the most efficient approach for data preparation at scale without moving data out of the AWS ecosystem.

Exam trap

The trap here is that candidates often assume AWS Glue Transform's FillMissingValues supports median, but it only supports mean or static values, leading them to choose Option A without verifying the available strategies.

How to eliminate wrong answers

Option A is wrong because AWS Glue Transform's FillMissingValues transform does not support a 'median' strategy; it only supports filling with a static value or the mean, not the median. Option B is wrong because a custom Python script with pandas runs on a single machine, which cannot scale to handle large datasets efficiently and requires manual upload to S3, introducing unnecessary latency and complexity. Option D is wrong because Amazon Athena SQL does not have a built-in function to compute the median; while you could use percentile_approx, Athena is primarily an interactive query service and not designed for efficient in-place data transformation or writing back to S3 at scale.

Full explanation →

261

Multi-Selecteasy

Which TWO actions are recommended best practices for securing an Amazon SageMaker notebook instance? (Select TWO.)

Select 2 answers

A.Use network ACLs to restrict API calls to the SageMaker API.

B.Enable Multi-AZ deployment for the notebook instance.

C.Use AWS KMS to encrypt the notebook instance's storage volume.

D.Associate the notebook instance with a public subnet that has an internet gateway.

E.Disable direct internet access for the notebook instance.

AnswersC, E

KMS encryption protects data at rest.

Why this answer

Option C is correct because encrypting the notebook instance's storage volume with AWS KMS ensures data-at-rest protection, which is a fundamental security best practice. SageMaker notebook instances use Amazon EBS volumes for storage, and KMS encryption safeguards sensitive code, datasets, and model artifacts stored on that volume against unauthorized access.

Exam trap

The trap here is that candidates often confuse network-level controls (network ACLs) with API-level controls (IAM/VPC endpoints), or they mistakenly think Multi-AZ applies to all AWS services, when in fact it is specific to database and high-availability services.

Full explanation →

262

MCQhard

A financial services company is developing a real-time fraud detection model using XGBoost on SageMaker. They have millions of transactions daily and train a model weekly on 6 months of historical data. The training dataset is 500 GB in CSV format stored in S3. The training job uses an ml.p3.16xlarge instance with 8 GPUs, but training takes over 12 hours, which is too long for the weekly cadence. The data scientist notices that GPU utilization averages only 15% during training. The training script uses the SageMaker XGBoost container with default hyperparameters. Which combination of actions would MOST likely reduce training time? (Choose the best answer.)

A.Increase the instance type to ml.p3dn.24xlarge and use EFA networking.

B.Tune hyperparameters using SageMaker Automatic Model Tuning to reduce training epochs.

C.Use SageMaker Debugger to profile the training and adjust the batch size to maximize GPU memory usage.

D.Convert the training data to Parquet format, use Pipe input mode in the training job, and increase the instance count to run distributed training.

AnswerD

Parquet reduces data size and improves I/O; Pipe mode streams data efficiently; distributed training scales out to reduce time.

Why this answer

The low GPU utilization suggests I/O bottleneck (data loading) or inefficient data format. Converting CSV to Parquet reduces data size and speeds up I/O. Using Pipe mode streamlines data loading from S3.

Increasing instance type would further help if I/O is resolved. Option C directly addresses the root cause. Option A might not help if GPU is underutilized.

Option B focuses on hyperparameters, which might not be the primary bottleneck. Option D spreads data but doesn't fix I/O if still CSV.

Full explanation →

263

MCQmedium

A retail company is preparing a dataset for a machine learning model to predict customer churn. The dataset includes customer_id, signup_date, last_purchase_date, total_purchases, average_order_value, and churn_label. The data scientist notices that the 'total_purchases' column has missing values for 15% of the records. The company wants to use AWS Glue for data preparation. Which approach should the data scientist take to handle the missing values while minimizing bias and preserving data integrity?

A.Use AWS Glue DataBrew to fill missing values with the median of total_purchases.

B.Drop all records with missing total_purchases values.

C.Use AWS Glue DynamicFrame to perform model-based imputation, predicting missing total_purchases using other features like average_order_value and signup_date.

D.Replace missing total_purchases with the mean of the non-missing values.

AnswerC

Model-based imputation leverages correlated features to estimate missing values more accurately, reducing bias.

Why this answer

Option C is correct because model-based imputation uses relationships between features (e.g., average_order_value and signup_date) to predict missing total_purchases values, minimizing bias compared to simple mean/median imputation. AWS Glue DynamicFrames support custom transformation logic, allowing you to implement a predictive model (e.g., using Spark MLlib) directly within the Glue ETL job. This approach preserves data integrity by leveraging existing data patterns rather than discarding records or introducing arbitrary constants.

Exam trap

The trap here is that candidates often choose simple imputation (mean/median) or deletion without considering the bias introduced when missing data is not MCAR, and they overlook that AWS Glue DynamicFrames can support custom model-based imputation within the ETL pipeline.

How to eliminate wrong answers

Option A is wrong because filling with the median is a univariate imputation method that ignores correlations with other features, potentially introducing bias when missingness is not completely at random (MCAR). Option B is wrong because dropping 15% of records reduces sample size and can introduce selection bias, especially if missingness is related to churn behavior. Option D is wrong because replacing with the mean is sensitive to outliers and also ignores feature relationships, leading to distorted distributions and biased model predictions.

Full explanation →

264

Multi-Selectmedium

A data team is preparing data for a machine learning pipeline. Which TWO practices are best for ensuring data quality and reproducibility? (Choose two.)

Select 2 answers

A.Use a fixed random seed when sampling data to ensure repeatability.

B.Shuffle the dataset before splitting into train and test sets.

C.Implement automated data validation checks to catch anomalies in new data.

D.Manually inspect and clean data to remove outliers.

E.Save cleaned and transformed datasets to S3 with versioning enabled.

AnswersC, E

Automated validation ensures data quality by catching issues early.

Why this answer

Option C is correct because automated data validation checks (e.g., using AWS Glue DataBrew or Deequ on Amazon EMR) proactively catch schema drift, missing values, and distribution anomalies in new data, ensuring that only high-quality data enters the ML pipeline. This practice is essential for maintaining data quality at scale without manual intervention.

Exam trap

AWS often tests the distinction between practices that improve data quality (automated validation, versioning) versus practices that improve model training stability (fixed seed, shuffling), leading candidates to mistakenly select options that only address repeatability of random processes.

Full explanation →

265

Multi-Selectmedium

A machine learning team is preparing a dataset for a regression model. The dataset contains numerical features that are on different scales (e.g., age 0-100, income 0-1,000,000). The team plans to use Amazon SageMaker to train a linear regression model. Which THREE data preparation steps should the team take to ensure the model performs well? (Select THREE.)

Select 3 answers

A.Apply feature selection to reduce the number of features.

B.Remove outliers from the dataset.

C.Handle missing values by imputation or removal.

D.Encode categorical features using one-hot encoding.

E.Scale numerical features using standardization (z-score) or normalization (min-max scaling).

AnswersC, D, E

Missing values can cause errors or biased models; handling them is necessary.

Why this answer

Option C is correct because missing values can cause errors or biased estimates in linear regression models. Amazon SageMaker's built-in linear regression algorithm does not handle missing data automatically, so imputation (e.g., mean/median) or removal is necessary to ensure the training process completes and produces reliable coefficients.

Exam trap

AWS often tests the misconception that feature selection or outlier removal are mandatory preprocessing steps for linear regression, when in fact scaling and handling missing values are the core requirements for model convergence and performance.

Full explanation →

266

MCQeasy

A company uses Amazon Rekognition to moderate user-generated images. They want to set up a monitoring system that alerts the team if the number of inappropriate images flagged by the model exceeds a threshold. Which combination of AWS services should they use?

A.Amazon CloudWatch Logs to store inference logs and create a metric filter.

B.Amazon CloudWatch to publish custom metrics and create an alarm, and AWS Lambda to process images and publish metrics.

C.AWS Config to track resource changes and trigger an SNS notification.

D.Amazon Simple Notification Service (SNS) to send alerts when threshold is exceeded.

AnswerB

Lambda can publish custom metrics to CloudWatch, which can trigger alarms.

Why this answer

Option A is correct because Amazon CloudWatch can publish custom metrics (number of flagged images) and trigger alarms. AWS Lambda can process the images and publish the metric. Option B is incomplete because CloudWatch Logs alone cannot trigger alarms on custom metrics.

Option C is incorrect because SNS alone does not monitor metrics. Option D is incorrect because AWS Config is for configuration tracking.

Full explanation →

267

MCQeasy

A data science team uses SageMaker notebooks to develop models. They want to automate the process of training and registering models whenever new data arrives in an S3 bucket. The team has limited DevOps experience and needs a solution that requires minimal maintenance. Which approach should the team use?

A.Configure an S3 event notification to trigger an AWS Step Functions state machine that runs a SageMaker Pipeline.

B.Use AWS Glue to detect new data and trigger a SageMaker training job via a Lambda function.

C.Write a Python script that runs on a scheduled EC2 instance to check S3 for new data and trigger training.

D.Use Amazon EventBridge to schedule a SageMaker training job every hour, regardless of whether new data exists.

AnswerA

Step Functions orchestrates training and model registration serverlessly, triggered by new data.

Why this answer

Option A is correct because S3 event notifications can directly trigger an AWS Step Functions state machine, which orchestrates a SageMaker Pipeline to automate model training and registration when new data arrives. This serverless approach requires minimal maintenance and aligns with the team's limited DevOps experience, as Step Functions handles retries, error handling, and workflow coordination without custom infrastructure.

Exam trap

The trap here is that candidates often choose a scheduled approach (Option D) or a Lambda-based trigger (Option B) because they seem simpler, but the exam tests the ability to select the fully managed, event-driven orchestration (Step Functions + SageMaker Pipeline) that minimizes operational burden while ensuring conditional execution based on new data.

How to eliminate wrong answers

Option B is wrong because AWS Glue is primarily an ETL service, not designed to detect new S3 objects; using it for this purpose adds unnecessary complexity and cost, and the Lambda trigger for training jobs would still require custom orchestration. Option C is wrong because running a Python script on a scheduled EC2 instance introduces manual maintenance overhead (patching, scaling, monitoring) and violates the 'minimal maintenance' requirement. Option D is wrong because scheduling a training job every hour with EventBridge ignores the condition of new data, leading to wasteful training runs and potential model versioning issues when no new data exists.

Full explanation →

268

MCQhard

A financial services company uses a custom container on Amazon SageMaker to serve a fraud detection model. The model's inference latency has recently increased, causing timeouts for some requests. The team reviews the SageMaker logs and finds that the container is consuming more memory than allocated. What should the team do to maintain service quality while ensuring cost-effectiveness?

A.Decrease the model's batch size to reduce memory usage

B.Increase the number of instances in the endpoint to distribute the load

C.Implement an auto-scaling policy based on memory utilization

D.Change the instance type to a memory-optimized instance, such as r5.large

AnswerD

Switching to a memory-optimized instance provides more memory per instance, resolving the issue cost-effectively.

Why this answer

The correct answer is D because the root cause is that the container is consuming more memory than allocated, leading to increased latency and timeouts. Switching to a memory-optimized instance like r5.large directly addresses the memory constraint by providing more memory per vCPU, which resolves the performance issue without over-provisioning compute resources. This approach is cost-effective because it targets the specific bottleneck (memory) rather than scaling out or changing unrelated parameters.

Exam trap

The trap here is that candidates often confuse scaling out (adding instances) with scaling up (choosing a larger instance type), and they may incorrectly assume that auto-scaling based on memory utilization will prevent timeouts, when in fact it only reacts after the problem occurs.

How to eliminate wrong answers

Option A is wrong because decreasing the batch size reduces throughput and may lower memory usage per request, but it does not fix the underlying memory allocation issue; it could also increase latency due to more frequent inference calls. Option B is wrong because increasing the number of instances distributes the load but does not solve the per-instance memory shortage; each container would still run out of memory, leading to continued timeouts and higher costs from additional instances. Option C is wrong because implementing auto-scaling based on memory utilization would only add more instances after the memory is already exhausted, causing intermittent failures and unpredictable costs; it does not prevent the memory exhaustion in the first place.

Full explanation →

269

Multi-Selectmedium

A company uses SageMaker Pipelines for model training and wants to incorporate model evaluation before deployment into production. Which THREE components are essential? (Choose three.)

Select 3 answers

A.A model registry approval step

B.A batch transform step for evaluation

C.A condition step in the pipeline

D.A human review step

E.A SageMaker Processing step for evaluation

AnswersA, C, E

Approval step creates a model version with approval status to gate deployment.

Why this answer

A model registry approval step is essential because it gates the deployment of a model based on its evaluation results. In SageMaker Pipelines, you register the model to the Model Registry after training, and the approval status (e.g., Approved or Rejected) determines whether downstream deployment steps execute. This ensures only models meeting quality thresholds are promoted to production.

Exam trap

The trap here is that candidates confuse batch transform (used for inference) with model evaluation (which requires a Processing step to compute metrics), and they overlook that a condition step is the core decision-making component, not a human review step.

Full explanation →

270

MCQeasy

A team wants to automatically retrain a model when new labeled data arrives. Which SageMaker feature can orchestrate this workflow?

A.SageMaker Pipelines

B.SageMaker Model Monitor

C.SageMaker Debugger

D.SageMaker Autopilot

AnswerA

Pipelines can orchestrate a retraining workflow when triggered.

Why this answer

SageMaker Pipelines is a workflow orchestration service that can automate retraining pipelines. Model Monitor detects drift, Debugger debugs training, and Autopilot automates model building.

Full explanation →

271

MCQhard

Refer to the exhibit. A SageMaker training job using this IAM role fails with an access denied error when trying to read a file from s3://my-bucket/training-data/model_input.csv. However, a different file at s3://my-bucket/training-data/input/data.csv can be read successfully. What is the most likely reason?

A.The file model_input.csv is encrypted with a KMS key that the role does not have access to.

B.The IAM policy restricts access to objects only under the 'training-data/' prefix using an incorrect condition key.

C.The file name contains special characters that are not encoded correctly.

D.The S3 bucket has a bucket policy that denies access to the specific file.

AnswerB

The condition 's3:prefix' is meant for list operations; for GetObject, it should be 's3:object key' with 'StringLike'. This misconfiguration causes the GetObject request to not match the condition, resulting in access denied for some objects.

Why this answer

Option D is correct because the condition 's3:prefix' evaluates to the object’s key prefix, and 'training-data/model_input.csv' does not start with 'training-data/'? Actually it does, but the condition 'StringEquals' requires an exact match? The s3:prefix condition key is used with 'StringLike' or 'StringEquals' to match the prefix. With 'StringEquals', it must exactly equal 'training-data/', but the object key is 'training-data/model_input.csv' which starts with that prefix. However, the condition is evaluated against the request's s3:prefix value, not the object key.

There is an AWS nuance: s3:prefix condition key checks the prefix used in the request (e.g., in ListObjects), not the object key itself. For GetObject, the condition does not apply because GetObject uses s3:object key. So the policy is misconfigured.

The correct condition for GetObject should be on s3:object key using a StringLike. Therefore, the most likely reason is that the policy condition is incorrectly applied. Option A is plausible but less likely since the other file works.

Option B is about KMS, which would affect both files. Option C is about bucket policy overriding. Option D about file name containing special characters is unrelated.

Full explanation →

272

MCQeasy

A company has a SageMaker endpoint that uses a trained model to classify images. The endpoint is experiencing high latency and the team suspects it is due to the model size. Which action can the team take to reduce latency without significantly impacting accuracy?

A.Switch to a compute-optimized instance type

B.Use SageMaker Neo to compile the model for the target instance

C.Reduce the batch size of inference requests

D.Convert the model to ONNX format

AnswerB

Neo optimizes model inference for specific hardware, reducing latency.

Why this answer

SageMaker Neo compiles trained models into an optimized binary for the target hardware, applying techniques like operator fusion, memory layout optimization, and quantization. This reduces model size and inference latency while preserving accuracy, making it the correct choice for addressing high latency caused by model size.

Exam trap

AWS often tests the misconception that converting to an open format like ONNX inherently optimizes performance, when in reality it is just a serialization format and requires a separate compilation step (e.g., Neo) to reduce latency.

How to eliminate wrong answers

Option A is wrong because switching to a compute-optimized instance (e.g., c5) may improve CPU-bound processing but does not reduce model size or memory footprint; the latency issue stems from the model itself, not insufficient compute. Option C is wrong because reducing batch size can lower throughput and increase per-request overhead, potentially worsening latency; it does not address the root cause of model size. Option D is wrong because converting to ONNX format alone does not guarantee latency reduction; ONNX is an interchange format that requires a compatible runtime (e.g., ONNX Runtime) and may still need optimization like Neo to achieve performance gains.

Full explanation →

273

MCQmedium

A company uses Amazon SageMaker Pipelines to automate its ML workflow. The pipeline includes a training step and a model evaluation step. If the evaluation step fails, the pipeline should stop and notify the team. How should the company configure the pipeline?

A.Define a ConditionStep that checks the evaluation metric and fail the pipeline if the metric is below a threshold.

B.Use Amazon SageMaker Model Monitor to detect failures in the evaluation step.

C.Create an AWS Step Function state machine that monitors the pipeline and stops it on failure.

D.Configure an Amazon CloudWatch alarm on the evaluation step's execution time to stop the pipeline.

AnswerA

A ConditionStep can be used to evaluate metrics and fail the pipeline if conditions are not met.

Why this answer

Option A is correct because SageMaker Pipelines natively supports a ConditionStep that can evaluate a metric (e.g., model accuracy) and branch the pipeline execution. By configuring the ConditionStep to check if the evaluation metric falls below a threshold, you can explicitly fail the pipeline and trigger a notification (e.g., via SNS) when the condition is not met. This is the idiomatic, pipeline-native way to halt execution on evaluation failure without external dependencies.

Exam trap

The trap here is that candidates confuse SageMaker Pipelines' built-in conditional branching (ConditionStep) with external monitoring services like Model Monitor or Step Functions, assuming that pipeline failures must be handled outside the pipeline itself.

How to eliminate wrong answers

Option B is wrong because Amazon SageMaker Model Monitor is designed for detecting data drift and model quality degradation in production endpoints, not for halting a pipeline execution step. Option C is wrong because while AWS Step Functions can orchestrate SageMaker Pipelines, creating a separate state machine to monitor and stop the pipeline adds unnecessary complexity and latency; the pipeline itself should handle conditional failures internally. Option D is wrong because a CloudWatch alarm on execution time would only stop the pipeline based on a timeout, not on the actual evaluation metric result, and it cannot directly fail the pipeline step based on model performance.

Full explanation →

274

Multi-Selectmedium

A data scientist is using SageMaker Data Wrangler to prepare features for a classification model. Which TWO statements about feature engineering in Data Wrangler are correct?

Select 2 answers

A.Data Wrangler only supports CSV and Parquet input formats

B.Data Wrangler enables writing custom PySpark transformations

C.Transformations created in Data Wrangler can be exported as a SageMaker Processing script

D.Data Wrangler automatically scales features for XGBoost models

E.Data Wrangler can export features to SageMaker Feature Store

AnswersC, E

Data Wrangler can generate a processing script for reuse.

Why this answer

Option C is correct because SageMaker Data Wrangler allows you to export the entire data flow, including all transformations, as a SageMaker Processing script. This script can be run at scale on managed infrastructure, enabling you to operationalize the feature engineering pipeline for training or inference without manual rework.

Exam trap

The trap here is that candidates assume Data Wrangler supports custom PySpark transformations (Option B) because it integrates with Spark, but in reality, custom code must be written outside the visual interface, and only built-in transforms are available within Data Wrangler itself.

Full explanation →

275

MCQmedium

A data scientist is using SageMaker Data Wrangler to prepare a large dataset. The data contains duplicate rows, which could bias the model. Which built-in step in Data Wrangler can automatically detect and remove duplicates?

A.Amazon QuickSight duplicate detection

B.Handle Duplicates transform in Data Wrangler

C.AWS Glue Studio FindDuplicates transform

D.Amazon DataZone catalog

AnswerB

Data Wrangler provides a built-in transform to drop duplicate rows.

Why this answer

The Handle Duplicates transform is a built-in step in SageMaker Data Wrangler specifically designed to detect and remove duplicate rows from a dataset. It provides configurable options such as selecting a subset of columns for duplicate detection and choosing whether to keep the first or last occurrence, directly addressing the bias risk from duplicate rows in ML training data.

Exam trap

The trap here is that candidates confuse AWS Glue Studio transforms (like FindDuplicates) with SageMaker Data Wrangler's built-in steps, as both are AWS data preparation services but operate in different environments and have distinct feature sets.

How to eliminate wrong answers

Option A is wrong because Amazon QuickSight is a business intelligence (BI) service for visualization and dashboards, not a data preparation tool with built-in duplicate detection for ML pipelines. Option C is wrong because AWS Glue Studio FindDuplicates is a transform available in AWS Glue Studio (a separate ETL service), not within SageMaker Data Wrangler's interface or step library. Option D is wrong because Amazon DataZone is a data catalog and governance service for managing data assets across an organization, not a data preparation tool that detects or removes duplicates.

Full explanation →

276

MCQhard

A financial services company is developing a fraud detection model using Amazon SageMaker. They have a dataset with 10 million transactions, each with 300 features. The dataset is highly imbalanced (0.1% fraud). They have performed feature engineering and now need to split the data for training, validation, and test sets. The data is stored in CSV files in Amazon S3. They plan to use SageMaker's built-in XGBoost algorithm. To ensure proper evaluation and avoid data leakage, which data splitting strategy should they use?

A.Randomly shuffle the entire dataset and then split into 80% training, 10% validation, 10% test.

B.Use k-fold cross-validation on the entire dataset and average the results.

C.Perform a stratified split on the target variable to ensure each set has the same fraud ratio.

D.Apply SMOTE to balance the dataset first, then split randomly into training, validation, and test sets.

AnswerC

Stratified splitting preserves class proportions, enabling reliable evaluation.

Why this answer

Option C is correct because a stratified split preserves the original 0.1% fraud ratio across training, validation, and test sets, which is critical for imbalanced datasets. This ensures each subset is representative of the population, allowing SageMaker's XGBoost to be evaluated fairly without data leakage. Random splits (Option A) could accidentally create a validation or test set with zero fraud cases, making evaluation meaningless.

Exam trap

The trap here is that candidates often choose random splitting (Option A) out of habit, forgetting that imbalanced datasets require stratified sampling to avoid evaluation sets with zero positive cases, which would render metrics like precision and recall undefined.

How to eliminate wrong answers

Option A is wrong because random shuffling and splitting an imbalanced dataset (0.1% fraud) risks producing validation or test sets with no fraud examples, leading to misleading accuracy metrics and inability to detect model overfitting. Option B is wrong because k-fold cross-validation on the entire dataset would leak information from future folds into training when used for final model selection, and it does not provide a held-out test set for unbiased final evaluation. Option D is wrong because applying SMOTE before splitting introduces synthetic data that can leak information across the split boundaries, causing data leakage and overly optimistic performance estimates; SMOTE should only be applied to the training set after splitting.

Full explanation →

277

Multi-Selecthard

A team is using SageMaker Pipelines to automate a training workflow. They need to ensure that if a step fails, the pipeline can resume from the failed step without reprocessing prior steps. Which TWO configurations are necessary? (Choose TWO.)

Select 2 answers

A.Set the Pipeline's parallel flag to True

B.Set a retry policy on the step

C.Use a Lambda step for retry logic

D.Store intermediate artifacts in S3

E.Enable caching on each step

AnswersB, E

Correct: Retry policies automatically retry a step upon failure.

Why this answer

Options B and D are correct. Enabling caching on each step (B) allows outputs to be reused from previous runs. Setting a retry policy (D) allows the pipeline to retry the failed step.

Option A is wrong because parallelism does not affect resumption. Option C is wrong because Lambda steps are for custom processing, not resumption. Option E is wrong because storing artifacts is common but not sufficient for resumption without caching.

Full explanation →

278

Multi-Selecteasy

A team uses SageMaker Ground Truth to create labeled datasets. They need to ensure labeling jobs are cost-effective. Which TWO measures should they take? (Select TWO.)

Select 2 answers

A.Use a smaller instance type for the labeling job.

B.Use a smaller workforce type.

C.Set up a labeling workflow with 'Incremental training'.

D.Enable the 'Consolidated billing' for labeling costs.

E.Use the 'Automated data labeling' feature.

AnswersC, E

Incremental training leverages existing models to reduce labeling needs.

Why this answer

Automated data labeling reduces manual labeling cost by using model predictions to label data, and incremental training reduces the number of items that need manual labeling by starting from an existing model.

Full explanation →

279

MCQhard

A machine learning team is building a model to predict customer churn. They have historical data that includes customer activity logs, each with a timestamp. The team wants to ensure that the training data does not contain any data leakage from the future. Which approach should they take when preparing the training and validation datasets?

A.Use stratified sampling based on churn label

B.Randomly split the data 80/20 for training and validation

C.Use k-fold cross-validation with shuffling

D.Split the data by time, using data before a certain date for training and after for validation

AnswerD

Time-based split ensures no future data influences training.

Why this answer

Option D is correct because splitting by time (chronological split) prevents data leakage by ensuring that the validation set contains only future data relative to the training set. In time-series or timestamped data, random splits can allow the model to learn from future patterns, artificially inflating performance. This approach respects the temporal dependency inherent in customer churn prediction.

Exam trap

AWS often tests the concept of data leakage in time-series contexts, where candidates mistakenly choose random splits or cross-validation with shuffling, overlooking that temporal order must be preserved to avoid future data leaking into training.

How to eliminate wrong answers

Option A is wrong because stratified sampling based on churn label preserves class distribution but does not address temporal leakage; it can still mix future and past data. Option B is wrong because random splitting ignores the timestamp order, allowing future data to leak into the training set and causing the model to learn from events that haven't occurred yet. Option C is wrong because k-fold cross-validation with shuffling randomly reorders the data, which breaks the time sequence and introduces future information into training folds.

Full explanation →

280

MCQeasy

A data scientist wants to train an XGBoost model using the SageMaker Python SDK with a custom training script. Which estimator class should be used?

A.sagemaker.sklearn.SKLearnEstimator.

B.sagemaker.tensorflow.TensorFlowEstimator.

C.sagemaker.xgboost.estimator.XGBoost with a script mode entry point.

D.sagemaker.xgboost.XGBoostEstimator with the built-in algorithm mode.

AnswerC

Framework estimator allows custom scripts and leverages the XGBoost container.

Why this answer

Option C is correct because the SageMaker XGBoost framework estimator allows users to bring their own training script while using the optimized XGBoost container. Option A is wrong because the built-in XGBoost algorithm does not support custom scripts; it expects a specific input format. Option B is wrong because scikit-learn estimator does not natively support XGBoost training.

Option D is wrong because TensorFlow estimator is for TensorFlow models.

Full explanation →

281

MCQeasy

A team uses SageMaker Experiments to track multiple training runs. They need to register the best-performing model in the model registry for approval. Which method ensures the model artifacts and metadata are captured correctly?

A.Write an AWS Lambda function to copy the best model to a specific S3 prefix.

B.Manually download the best model artifact and upload to S3, then create a model in SageMaker.

C.Use the SageMaker Model Registry's create_model_package_from_estimator or equivalent API to register the model.

D.Use Experiment analytics to view results and then create a model package using the Run's artifact URI.

AnswerC

Model Registry captures artifacts, metrics, and supports approval workflow.

Why this answer

Option D is correct because SageMaker Model Registry provides a centralized catalog for model versions with associated metadata, metrics, and approval status. Option A is wrong because manual comparison is error-prone. Option B is wrong because Experiments track runs but do not natively register models.

Option C is wrong because Lambda is not a direct mechanism for model registration.

Full explanation →

282

MCQmedium

A company is using SageMaker to train a neural network for image classification. The training job is taking too long. The team wants to reduce training time without sacrificing model accuracy. Which approach should they recommend?

A.Increase the batch size to the maximum possible

B.Use a GPU-based instance such as ml.p3.2xlarge

C.Use a learning rate scheduler that reduces the learning rate over time

D.Add more convolutional layers to the model

AnswerB

GPUs accelerate matrix operations in neural networks, reducing training time.

Why this answer

Option B is correct because GPU-based instances like ml.p3.2xlarge are specifically designed for parallel processing of matrix operations, which are fundamental to neural network training. By offloading compute-intensive tensor operations to GPU cores, training time can be significantly reduced without altering the model architecture or data, thus preserving accuracy.

Exam trap

AWS often tests the misconception that any change to hyperparameters or architecture can reduce training time without side effects, but the trap here is that candidates confuse 'reducing training time' with 'improving convergence speed'—only hardware acceleration (GPU) directly reduces wall-clock time without risking accuracy degradation.

How to eliminate wrong answers

Option A is wrong because increasing batch size to the maximum possible can lead to degraded model accuracy due to reduced gradient noise, causing the model to converge to sharp minima or even fail to converge; it also risks out-of-memory errors. Option C is wrong because a learning rate scheduler that reduces the learning rate over time helps with convergence stability and final accuracy, but it does not directly reduce training time—it may even extend it if the learning rate becomes too small too early. Option D is wrong because adding more convolutional layers increases model complexity and the number of parameters, which typically increases training time and can lead to overfitting without guaranteeing improved accuracy.

Full explanation →

283

MCQhard

A model deployed on SageMaker is returning inaccurate predictions for certain customer segments. The team suspects data drift. Which SageMaker feature should they use to continuously monitor input data distribution?

A.SageMaker Clarify

B.SageMaker Debugger

C.SageMaker Model Monitor

D.SageMaker Feature Store

AnswerC

Model Monitor can track input data distributions and alert on drift.

Why this answer

SageMaker Model Monitor continuously monitors input data for drift, alerting when distributions change. Other features serve different purposes: Clarify for bias and explainability, Debugger for training, Feature Store for feature management.

Full explanation →

284

MCQeasy

A team wants to apply a custom container for inference on SageMaker. The container needs to implement a web server that responds to API requests. Which protocol and port must the container listen on to be compatible with SageMaker hosting?

A.The container must listen on port 8080 and use HTTPS protocol.

B.The container must listen on port 8080 and use HTTP protocol.

C.The container can listen on any port as long as the port is specified in the endpoint configuration.

D.The container must listen on port 8000 and use HTTP protocol.

AnswerB

SageMaker expects HTTP on port 8080 for /invocations and /ping.

Why this answer

SageMaker requires custom inference containers to listen on port 8080 and communicate over HTTP (not HTTPS). The SageMaker hosting service uses a proxy that terminates HTTPS and forwards plain HTTP requests to the container on port 8080. This ensures compatibility with the built-in model serving infrastructure.

Exam trap

The trap here is that candidates assume SageMaker requires HTTPS for security, but the service actually handles encryption externally, so the container must use plain HTTP on port 8080.

How to eliminate wrong answers

Option A is wrong because SageMaker's proxy handles TLS termination, so the container must use HTTP, not HTTPS; using HTTPS would cause a protocol mismatch and connection failure. Option C is wrong because SageMaker mandates port 8080 for custom containers; the endpoint configuration does not allow overriding this port. Option D is wrong because the required port is 8080, not 8000; port 8000 is not recognized by SageMaker's hosting proxy.

Full explanation →

285

MCQmedium

A data scientist is training a logistic regression model and wants to use L1 regularization to create a sparse model. Which parameter should be adjusted?

A.alpha

B.lambda

C.penalty

D.C (inverse of regularization strength)

AnswerC

Setting penalty='l1' enables L1 regularization, which induces sparsity.

Why this answer

In scikit-learn's LogisticRegression, the 'penalty' parameter can be set to 'l1' to use L1 regularization. 'C' is the inverse of regularization strength, but without setting penalty='l1', it won't be L1. 'alpha' and 'lambda' are parameters in other libraries like scikit-learn's linear models (alpha) or XGBoost (lambda), but not for logistic regression default.

Full explanation →

286

Multi-Selectmedium

An ML team is running multiple SageMaker endpoints for various models. The monthly cost is higher than expected. Which TWO actions would help reduce costs without negatively impacting performance?

Select 2 answers

A.Consolidate multiple small models into a single Multi-Model Endpoint on a larger instance.

B.Increase the number of minimum instances to handle traffic spikes without scaling.

C.Right-size the instances by analyzing CloudWatch metrics and reducing instance size for underutilized endpoints.

D.Limit the maximum number of concurrent invocations per endpoint.

E.Use a scheduled scaling to turn off endpoints during non-business hours.

AnswersA, C

Multi-Model Endpoints reduce cost by sharing an instance among multiple models.

Why this answer

Option A is correct because SageMaker Multi-Model Endpoints allow you to host multiple small models on a single endpoint behind a common serving container, sharing the underlying instance resources. This reduces the number of endpoints and instances needed, lowering costs without degrading performance, as models are loaded and unloaded dynamically based on traffic.

Exam trap

The trap here is that candidates may confuse cost reduction with availability or scaling strategies, incorrectly assuming that reducing instance count or limiting concurrency is always beneficial, without considering the impact on performance or the specific capabilities of SageMaker Multi-Model Endpoints.

Full explanation →

287

MCQmedium

A machine learning engineer is using SageMaker Processing to run a scikit-learn preprocessing script. The script reads a CSV file from S3, applies a StandardScaler, and writes the output. The job fails with a 'MemoryError'. Which change should the engineer make to the data preparation process?

A.Use a SageMaker Spark container instead of scikit-learn

B.Increase the instance memory size for the processing job

C.Write the output as Parquet instead of CSV

D.Standardize the features before loading into the DataFrame

AnswerB

More memory allows larger datasets to be processed in memory.

Why this answer

The MemoryError indicates that the processing job's instance does not have enough RAM to hold the dataset and the intermediate results of the StandardScaler (which computes mean and variance in memory). Increasing the instance memory size (Option B) directly resolves this by providing more RAM for the scikit-learn operations. SageMaker Processing jobs allow you to choose instances with larger memory, such as the r5 or r6i families, to accommodate larger datasets.

Exam trap

The trap here is that candidates may confuse a memory error with a storage or format issue, leading them to choose Parquet (Option C) or Spark (Option A), when the actual fix is to allocate more RAM to the processing instance.

How to eliminate wrong answers

Option A is wrong because switching to a Spark container does not inherently fix a memory error; Spark also requires sufficient memory per executor and may introduce overhead without addressing the root cause of insufficient RAM. Option C is wrong because writing output as Parquet instead of CSV reduces disk I/O and storage size but does not reduce the memory footprint of the in-memory DataFrame or the StandardScaler computation. Option D is wrong because standardizing features before loading into the DataFrame is not a valid operation—standardization requires the entire dataset's statistics (mean and variance), which must be computed in memory after loading.

Full explanation →

288

MCQmedium

A data engineer is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column 'review_date' with timestamps. The engineer wants to extract the day of the week as a new feature. How should this transformation be performed in Data Wrangler?

A.Write a custom Python script using pandas dt.day_name()

B.Use one-hot encoding on the timestamp

C.Use the 'extract' transform with format '%A'

D.Use the 'day_of_week' transform on the 'review_date' column

AnswerD

Built-in transform extracts day of week (Monday=0, etc.).

Why this answer

Option D is correct because Amazon SageMaker Data Wrangler includes a built-in 'day_of_week' transform that directly extracts the day of the week (e.g., Monday, Tuesday) from a timestamp column without requiring custom code or additional formatting. This transform is optimized for Data Wrangler's visual interface and integrates seamlessly with its processing pipeline.

Exam trap

AWS often tests the distinction between built-in transforms and custom scripting, and the trap here is that candidates may assume they need to write a Python script (Option A) because they are familiar with pandas, overlooking Data Wrangler's native 'day_of_week' transform that is simpler and more appropriate for the visual workflow.

How to eliminate wrong answers

Option A is wrong because while a custom Python script using pandas dt.day_name() could technically extract the day of the week, Data Wrangler provides a native transform that avoids the overhead of writing and maintaining custom code, and the question asks how the transformation 'should be performed' in Data Wrangler, implying use of its built-in features. Option B is wrong because one-hot encoding is a technique for converting categorical variables into binary columns, not for extracting temporal features like the day of the week from a timestamp. Option C is wrong because the 'extract' transform in Data Wrangler is used to extract substrings or patterns from text columns using regular expressions, not to interpret timestamps; the format '%A' is a Python strftime directive, but Data Wrangler's 'extract' transform does not support strftime-style parsing for timestamps.

Full explanation →

289

MCQeasy

A company uses Amazon SageMaker to deploy a real-time inference endpoint. They notice increased latency in predictions during peak hours. Which should they investigate first to address the issue?

A.Review the endpoint auto-scaling policy

B.Check the data labeling job status

C.Modify the training instance type

D.Increase the model artifact size

AnswerA

Auto-scaling policy determines how instances are added/removed; insufficient capacity causes high latency.

Why this answer

Option B is correct because latency typically increases when the endpoint is under-provisioned; auto-scaling policies control scaling behavior. Option A is about training, not inference. Option C is unrelated to inference latency.

Option D may affect latency but is not the first thing to investigate.

Full explanation →

290

Multi-Selectmedium

A company is using an Amazon SageMaker pipeline for automated retraining. The pipeline fails intermittently due to transient errors in the training job. Which steps should the team take to ensure the pipeline completes successfully? (Choose THREE.)

Select 3 answers

A.Enable managed spot training for cost savings and use checkpointing to resume from interruptions.

B.Use a larger instance type for the training job to reduce the chance of failure.

C.Implement automatic model checkpointing by setting the CheckpointConfig in the pipeline step.

D.Configure the SageMaker pipeline step to retry on failure with a maximum number of attempts.

E.Add exponential backoff in any custom Python code that makes API calls to AWS services.

AnswersA, D, E

Spot instances can be interrupted; checkpointing helps.

Why this answer

Options A, C, and E are correct. A: Add retry policies for the training step. C: Use spot instances with managed spot training to handle interruptions.

E: Implement exponential backoff in custom code for API calls. Option B is wrong because increasing instance count does not solve transient errors; it adds cost. Option D is wrong because SageMaker does not support automatic checkpointing across retries; you need to implement custom checkpointing.

Full explanation →

291

Multi-Selecthard

A company uses SageMaker to train a model. They want to ensure that training data is encrypted at rest and in transit, and that only authorized users can access the training artifacts. Which three steps should they take? (Choose three.)

Select 3 answers

A.Configure IAM policies to restrict access to SageMaker resources

B.Use SageMaker Model Monitor

C.Use a VPC with private subnets and VPC endpoints

D.Enable S3 server-side encryption for training data

E.Use SageMaker Network Isolation

AnswersA, C, D

Controls who can create, modify, and access SageMaker resources.

Why this answer

Option A is correct because IAM policies allow you to define fine-grained permissions to control which users or roles can create, describe, or delete SageMaker resources (e.g., training jobs, endpoints). By restricting access via IAM, you ensure that only authorized principals can interact with training artifacts, such as model output in S3 or logs in CloudWatch. This directly addresses the requirement of limiting access to authorized users.

Exam trap

The trap here is that candidates often confuse network isolation (Option E) with encryption or access control, but network isolation only restricts network connectivity, not data encryption or authorization.

Full explanation →

292

MCQeasy

A retail company is building a machine learning model to predict customer churn. The data engineering team has extracted customer transaction data from Amazon Aurora and stored it as CSV files in Amazon S3. The data includes customer IDs, transaction amounts, timestamps, and product categories. A data scientist discovers that the dataset contains several missing values in the 'transaction_amount' column for about 15% of the records. The data scientist also notices that the 'customer_id' column has some duplicate entries. The team wants to prepare the data for training a churn model using Amazon SageMaker. The data is approximately 50 GB in size. What should the data scientist do to handle the missing values and duplicates efficiently while preparing the data for training?

A.Use a SageMaker notebook instance with Pandas to load the entire dataset into memory, fill missing values with the median, and drop duplicate customer IDs.

B.Use an AWS Glue ETL job to read the data from S3, apply transformations to fill missing values with the mean or median, and drop duplicate customer IDs, then write the cleaned data back to S3.

C.Drop all records with missing values in the transaction_amount column and remove duplicate customer IDs using an Athena SQL query, then store the result in S3.

D.Use an Amazon EMR cluster with Spark to read the CSV files, impute missing transaction amounts with the mean or median, and remove duplicate customers.

AnswerB

Glue is serverless, scales automatically, and is suitable for 50 GB. It can efficiently handle missing value imputation and deduplication.

Why this answer

Option B is correct because AWS Glue ETL jobs are serverless and designed to handle large-scale data transformations (like 50 GB) without requiring manual cluster management. Glue can read CSV files from S3, apply transformations to impute missing values with the mean or median, drop duplicate customer IDs, and write the cleaned data back to S3, all while scaling automatically to handle the data volume efficiently.

Exam trap

The trap here is that candidates often choose Option A (Pandas in a notebook) because it seems simple, but they overlook the memory limitations of a single-instance notebook when processing 50 GB of data, which is a classic 'scale vs. simplicity' trick in the MLA-C01 exam.

How to eliminate wrong answers

Option A is wrong because loading a 50 GB dataset into memory using Pandas in a SageMaker notebook instance is inefficient and likely to cause out-of-memory errors, as Pandas is single-threaded and not designed for distributed processing of large datasets. Option C is wrong because dropping all records with missing values (15% of data) would discard a significant portion of the dataset, potentially biasing the model, and Athena SQL queries do not natively support imputation of missing values with mean or median without complex workarounds. Option D is wrong because while Amazon EMR with Spark could handle the task, it requires provisioning and managing a cluster, which is more complex and less cost-effective than the serverless AWS Glue approach for this specific data preparation task.

Full explanation →

293

MCQhard

A data scientist trained a logistic regression model on a dataset with 100 features. After training, the training accuracy is 0.99 but validation accuracy is 0.75. Which action is MOST likely to reduce overfitting?

A.Increase the number of features

B.Increase the regularization strength

C.Use a more complex model like XGBoost

D.Use stratified cross-validation

AnswerB

Stronger regularization (e.g., higher L2 penalty) shrinks coefficients and reduces overfitting.

Why this answer

The model shows high training accuracy (0.99) but significantly lower validation accuracy (0.75), which is a classic sign of overfitting. Increasing the regularization strength (e.g., L1 or L2 penalty) in logistic regression directly penalizes large coefficients, reducing the model's complexity and improving generalization. This is the most direct way to address overfitting in a logistic regression model.

Exam trap

AWS often tests the misconception that adding more data or using more complex models always improves performance, but here the correct answer is to increase regularization strength, which directly counters overfitting in a logistic regression model.

How to eliminate wrong answers

Option A is wrong because increasing the number of features would give the model more parameters to fit the training data even more closely, worsening overfitting rather than reducing it. Option C is wrong because using a more complex model like XGBoost would increase the model's capacity to memorize noise, which typically exacerbates overfitting unless accompanied by strong regularization or pruning. Option D is wrong because stratified cross-validation ensures class distribution balance across folds but does not directly reduce overfitting; it improves the reliability of validation metrics but does not change the model's tendency to overfit.

Full explanation →

294

MCQeasy

A data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?

A.Use AWS Glue ETL to write a custom Python script that imputes missing values with the mean.

B.Use Amazon SageMaker Data Wrangler to impute missing values using built-in transforms.

C.Use pandas in a SageMaker notebook to impute missing values with the median.

D.Remove all rows with missing values from the dataset.

AnswerB

Data Wrangler provides efficient, scalable, and visual data preparation without custom code.

Why this answer

Amazon SageMaker Data Wrangler provides a visual interface and built-in transforms for handling missing values efficiently at scale, without writing custom code. Glue ETL is more code-heavy, and imputation with pandas is not scalable for large datasets. Removing all rows with missing values is not always optimal and may not be efficient.

Full explanation →

295

MCQmedium

A machine learning team deploys a custom container image for an Amazon SageMaker training job. The container needs to access an S3 bucket that contains sensitive data. The team wants to follow the principle of least privilege. How should the team grant access?

A.Create an IAM role with S3 access and assign it as the SageMaker execution role for the training job.

B.Attach an IAM instance profile to the training instance with permissions to the bucket.

C.Configure an S3 bucket policy that grants access to the training job's ARN.

D.Store AWS access keys in the container image and use them to access the bucket.

AnswerA

This is the standard secure method.

Why this answer

Option C is correct because SageMaker execution role assigned to the training job is the best practice. Option A is wrong because hardcoding keys is insecure. Option B is wrong because instance profile is for EC2, not SageMaker training jobs directly; SageMaker uses execution roles.

Option D is wrong because SageMaker does not support S3 bucket policies with principal as the training job ARN directly; the execution role is used.

Full explanation →

296

MCQhard

A company is deploying a large model (10GB) for real-time inference. The inference latency is too high. What optimization technique can help?

A.Increase the endpoint's memory allocation

B.Switch to a batch transform job

C.Use SageMaker Neo to compile the model for the target instance

D.Reduce the model size by quantization

AnswerC

Neo optimizes the model for inference speed on specific hardware.

Why this answer

SageMaker Neo compiles the model to optimize it for the target instance hardware, reducing inference latency without sacrificing accuracy. This is especially effective for large models (e.g., 10GB) where runtime performance gains come from hardware-specific optimizations like instruction set tuning and memory access pattern improvements.

Exam trap

The trap here is that candidates often assume quantization (Option D) is the only way to reduce latency for large models, but they overlook SageMaker Neo's compilation, which optimizes without accuracy loss and is specifically designed for deployment scenarios.

How to eliminate wrong answers

Option A is wrong because increasing memory allocation may help with out-of-memory errors but does not directly reduce inference latency; latency is more dependent on compute efficiency and model size. Option B is wrong because batch transform jobs are designed for offline, asynchronous processing, not real-time inference, and switching to batch would increase latency due to queuing and processing delays. Option D is wrong because quantization reduces model size and can improve latency, but it may degrade accuracy and is not a SageMaker-specific optimization; SageMaker Neo provides a more targeted, hardware-aware compilation that preserves accuracy while reducing latency.

Full explanation →

297

MCQmedium

A company is deploying a multi-model endpoint using SageMaker to serve multiple models from a single endpoint. They notice that one model consumes excessive memory and impacts others. What is the BEST practice to isolate resource usage?

A.Configure instance type with more memory.

B.Use separate endpoints for each model.

C.Use SageMaker Model Parallelism.

D.Use multi-model endpoint with model cache size limit.

AnswerB

Separate endpoints provide complete isolation of compute resources.

Why this answer

Option B is correct because using separate endpoints for each model ensures complete resource isolation at the instance level. When one model consumes excessive memory, it cannot impact others because each model runs on its own dedicated endpoint with its own compute resources. This is the best practice for isolating resource usage in production environments where memory-intensive models are deployed.

Exam trap

The trap here is that candidates often assume multi-model endpoints are designed for resource isolation, but in reality they share memory and compute, so the correct answer is to use separate endpoints for strict isolation.

How to eliminate wrong answers

Option A is wrong because simply configuring an instance type with more memory does not isolate resource usage; all models on the same multi-model endpoint still share the same memory pool, so a memory spike in one model can still starve others. Option C is wrong because SageMaker Model Parallelism is designed for splitting large models across multiple GPUs for training, not for isolating resource usage during inference on a multi-model endpoint. Option D is wrong because setting a model cache size limit only controls how many models are cached in memory, but does not prevent a single model from consuming excessive memory once loaded; the memory usage of an individual model is not capped by this setting.

Full explanation →

298

Multi-Selecteasy

A machine learning team has deployed a model using Amazon SageMaker and wants to set up continuous monitoring for data drift. Which TWO actions are essential for ongoing data drift detection?

Select 2 answers

A.Set up Amazon CloudWatch alarms on the endpoint's invocation latency metric.

B.Enable data capture on the SageMaker endpoint to store inference data in Amazon S3.

C.Configure Amazon SageMaker Model Monitor to run hourly monitoring schedules.

D.Deploy a shadow endpoint to compare predictions from the current model and a challenger model.

E.Create a baseline from the training data to serve as a reference distribution.

AnswersB, C

Data capture is necessary to collect the inference data for monitoring.

Why this answer

Option A (Enable data capture) is essential because data capture collects inference requests and responses, which are required for monitoring. Option C (Configure Model Monitor to run hourly) is essential because Model Monitor analyzes the captured data against a baseline to detect drift. Option B is a prerequisite but not an ongoing action.

Options D and E are unrelated to data drift monitoring.

Full explanation →

299

MCQeasy

A company is deploying a real-time inference endpoint for a natural language processing model using Amazon SageMaker. The model is a fine-tuned BERT variant. The endpoint has been running for two weeks with acceptable latency (average 200 ms). However, over the past 24 hours, the latency has increased to an average of 800 ms, and the number of simultaneous requests has doubled. The team expects traffic to continue to grow. The current endpoint configuration uses a single ml.m5.large instance. The model is loaded into memory once, and the inference framework is PyTorch. The team needs to maintain latency under 500 ms. Which course of action should the team take to address the latency increase while minimizing cost?

A.Switch to ml.c5.large instances because CPU-optimized instances provide better inference performance for NLP models.

B.Increase the instance size to ml.m5.xlarge and keep a single instance.

C.Enable automatic scaling for the endpoint with a target average latency of 500 ms and use multiple ml.m5.large instances.

D.Implement a multi-model endpoint with multiple ml.m5.large instances and use Amazon Elastic Inference (EI) accelerators.

AnswerC

Correct: Auto scaling adds instances based on latency, distributing load and maintaining under 500 ms, and minimizes cost by scaling only when needed.

Why this answer

With increased traffic, a single instance is overloaded. Auto scaling with a latency target dynamically adds instances to handle load, maintaining latency. Option A scales up but doesn't add redundancy; B switches instance family but doesn't address scaling; C suggests multi-model endpoint which is for hosting multiple models, not scaling a single model, and EI may not be cost-effective.

Therefore D is correct.

Full explanation →

300

MCQeasy

A data scientist wants to automate retraining of a model weekly and deploy the new model automatically after passing validation. Which AWS service combination is best?

A.SageMaker Pipelines + AWS Step Functions

B.Amazon EventBridge + SageMaker training job

C.Amazon SageMaker Autopilot

D.AWS Lambda + SageMaker training job

AnswerA

SageMaker Pipelines manages training and validation, Step Functions can orchestrate deployment on approval.

Why this answer

SageMaker Pipelines orchestrates the ML workflow including training and validation, and Step Functions can trigger deployment. SageMaker alone lacks native scheduling, and Lambda cannot orchestrate complex workflows.

Full explanation →

AWS Certified Machine Learning Engineer Associate MLA-C01 (MLA-C01) — Questions 226–300