Knowledge + Practice

AWS Certified Machine Learning Engineer Associate MLA-C01 (MLA-C01) — Questions 301–375

507 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 5 of 7

301

MCQmedium

During model training on Amazon SageMaker, the training job fails with a 'ResourceLimitExceeded' error. What is the most likely cause?

A.The algorithm's learning rate is too high

B.The dataset is too large for the instance

C.The training script has a syntax error

D.The account's instance limit for the chosen instance type has been reached

AnswerD

ResourceLimitExceeded indicates the account has exceeded the allowed number of instances for that instance type.

Why this answer

The ResourceLimitExceeded error typically indicates that the AWS account has reached its service limit for the specified instance type. A syntax error would cause a different error (e.g., ClientError). A dataset too large might cause an out-of-memory error but not this specific error.

Learning rate does not cause resource limits.

Full explanation →

302

MCQeasy

A team wants to track and compare multiple machine learning experiments, including hyperparameters, metrics, and artifacts. They are using Amazon SageMaker. Which AWS service or feature should they use to achieve this?

A.AWS CloudTrail

B.Amazon SageMaker Experiments

C.Amazon SageMaker Model Registry

D.Amazon SageMaker Studio

AnswerB

Experiments is the correct service for tracking.

Why this answer

Amazon SageMaker Experiments is the correct service because it is specifically designed to track and compare machine learning experiments, including hyperparameters, metrics, and artifacts. It provides a structured way to log, organize, and analyze multiple runs, enabling teams to identify the best-performing model configurations.

Exam trap

The trap here is that candidates confuse SageMaker Studio (the IDE) with SageMaker Experiments (the tracking service), assuming Studio alone provides experiment tracking, but Studio is merely the interface that can visualize experiment data stored by Experiments.

How to eliminate wrong answers

Option A is wrong because AWS CloudTrail records API activity for auditing and governance, not for tracking ML experiment metadata like hyperparameters or metrics. Option C is wrong because Amazon SageMaker Model Registry is used for cataloging and managing approved model versions, not for tracking the iterative experiments that produce them. Option D is wrong because Amazon SageMaker Studio is an integrated development environment (IDE) for ML workflows; while it can display experiment data, it is not the service that tracks experiments itself.

Full explanation →

303

MCQmedium

A machine learning team is deploying a fraud detection model using SageMaker. They use the SageMaker Model Registry to track model versions. They want to automatically deploy the latest approved model to a production endpoint whenever a new model version is approved. The team uses a CI/CD pipeline with AWS CodePipeline. The pipeline currently includes a source stage (S3), a build stage (CodeBuild), and a deploy stage (manual approval). They want to automate the deployment of approved models. Which solution will meet these requirements with the least operational overhead?

A.Add a custom action to CodePipeline that uses a SageMaker deployment step.

B.Create a Lambda function that triggers on Model Registry approval events and updates the endpoint using the boto3 SDK.

C.Configure an EventBridge rule to trigger a CodePipeline execution when the model approval status changes.

D.Use SageMaker Pipelines to deploy the model directly upon training completion.

AnswerC

EventBridge natively integrates with Model Registry events and triggers the pipeline automatically.

Why this answer

Option C is correct because it directly integrates SageMaker Model Registry approval events with CodePipeline via EventBridge, enabling fully automated deployment of the latest approved model to a production endpoint with minimal operational overhead. This approach avoids custom code or additional pipeline stages, leveraging native AWS event-driven architecture to trigger the pipeline only when a model version is approved.

Exam trap

AWS often tests the misconception that you must build a custom Lambda or pipeline action to integrate SageMaker Model Registry with CodePipeline, when in fact EventBridge provides a native, low-overhead solution for event-driven pipeline triggers.

How to eliminate wrong answers

Option A is wrong because adding a custom action to CodePipeline that uses a SageMaker deployment step would require significant custom development and maintenance, increasing operational overhead compared to a native EventBridge trigger. Option B is wrong because creating a Lambda function to poll or react to Model Registry approval events and update the endpoint directly bypasses the existing CodePipeline CI/CD process, losing pipeline visibility, approval gates, and rollback capabilities. Option D is wrong because SageMaker Pipelines are designed for orchestrating training and deployment workflows upon training completion, not for reacting to Model Registry approval events in a CI/CD pipeline, and would require additional integration to trigger on approval rather than training.

Full explanation →

304

Multi-Selectmedium

A company uses Amazon SageMaker to deploy a model for real-time inference. They want to perform A/B testing between two model versions. Which TWO actions should the company take to set up A/B testing? (Choose TWO.)

Select 2 answers

A.Create an endpoint configuration with multiple production variants, each with a different model.

B.Use Amazon CloudWatch Evidently to split traffic between models.

C.Set the initial weight of each production variant to the desired traffic split.

D.Enable auto scaling for each production variant individually.

E.Set the second production variant's weight to 0 and update later to 100.

AnswersA, C

Production variants allow multiple models on the same endpoint.

Why this answer

Option A is correct because in SageMaker, A/B testing between two model versions is achieved by creating an endpoint configuration with multiple production variants, each pointing to a different model. This allows the endpoint to host both models simultaneously and route traffic between them based on assigned weights.

Exam trap

The trap here is that candidates confuse the separate service Amazon CloudWatch Evidently with SageMaker's native traffic splitting, or think that auto scaling or zero-weight strategies are prerequisites for A/B testing.

Full explanation →

305

MCQeasy

A company wants to maintain multiple versions of a trained model in a central repository and track metadata such as training metrics, hyperparameters, and approval status. Which SageMaker feature should they use?

A.SageMaker Pipelines

B.SageMaker Feature Store

C.SageMaker Model Registry

D.SageMaker Experiments

E.SageMaker Studio

AnswerC

Correct. Model Registry provides a central repository for model versions, metadata, and approval status.

Why this answer

SageMaker Model Registry is designed for model versioning, metadata tracking, and approval workflows.

Full explanation →

306

MCQeasy

A company trained a model using SageMaker and wants to deploy it with low latency for real-time inference. Which SageMaker feature is MOST suitable?

A.SageMaker Endpoint with Auto Scaling

B.SageMaker Serverless Inference

C.SageMaker Real-Time Endpoint

D.SageMaker Batch Transform

AnswerC

Real-time endpoints provide low-latency inference suitable for online predictions.

Why this answer

SageMaker Real-Time Endpoint is the most suitable feature for low-latency real-time inference because it provisions dedicated, persistent instances that respond to requests synchronously with predictable latency. This option directly meets the requirement for serving individual predictions with minimal delay, unlike batch or serverless alternatives that introduce higher latency or are designed for asynchronous processing.

Exam trap

The trap here is that candidates confuse 'Auto Scaling' (a scaling mechanism) with a separate deployment option, or they assume 'Serverless' always provides low latency, ignoring the cold start penalty that makes it unsuitable for real-time inference.

How to eliminate wrong answers

Option A is wrong because SageMaker Endpoint with Auto Scaling is not a distinct feature; it is a configuration applied to a Real-Time Endpoint to adjust capacity based on load, but the core requirement for low-latency real-time inference is already met by the Real-Time Endpoint itself, and Auto Scaling does not change the fundamental synchronous nature. Option B is wrong because SageMaker Serverless Inference automatically scales from zero and incurs cold start latency (often seconds) when there is no prior traffic, making it unsuitable for applications requiring consistently low latency for real-time inference. Option D is wrong because SageMaker Batch Transform is designed for asynchronous, offline inference on large datasets where latency is not a concern, processing data in batches and writing results to S3, not for real-time, synchronous requests.

Full explanation →

307

MCQhard

Refer to the exhibit. A data scientist configured SageMaker Debugger to monitor training for overfitting. However, the rule never triggers even though the model appears to be overfitting. What is the most likely reason?

A.The debug hook is not collecting the validation loss

B.The instance type for the rule is too small

C.The S3 output path is not writable

D.The rule evaluator image is incorrect

AnswerA

The hook only collects 'losses' and 'gradients', lacking a validation loss collection needed to detect overfitting.

Why this answer

The DebugHookConfig only collects losses and gradients. The overfitting rule likely compares training loss to validation loss, but validation loss is not being collected. Without a collection for validation loss (e.g., validation:loss), the rule cannot evaluate the condition for overfitting.

The rule evaluator image, instance type, and S3 path are less likely causes; the image is a placeholder but might be valid.

Full explanation →

308

MCQhard

A financial services company uses Amazon SageMaker to deploy a fraud detection model for real-time inference. The model is deployed on an ml.m5.large instance with a SageMaker real-time endpoint. The endpoint has an auto scaling policy configured using a custom scaling policy based on average CPU utilization, with scale out threshold at 70% and scale in threshold at 30%. During a flash sale event, the traffic to the endpoint spikes tenfold within minutes. The endpoint fails to handle the load, resulting in increased latency and timeouts. The data science team needs to improve the scalability of the endpoint to handle sudden traffic spikes. Which solution should the team implement?

A.Implement a SageMaker Model Ensemble with two additional models to balance the load.

B.Replace the custom scaling policy with a target tracking scaling policy based on the number of invocations per instance, with a target value of 1000.

C.Implement a SageMaker Inference Pipeline with a pre-processing step to reduce model input size.

D.Switch to a GPU instance type, such as ml.p3.2xlarge, to increase compute capacity.

AnswerB

Target tracking on request count provides faster reaction to traffic spikes because it directly measures the traffic, whereas CPU utilization is a lagging indicator.

Why this answer

Option D is correct because target tracking scaling policies based on request count respond faster to traffic spikes than CPU-based scaling, which suffers from lag. Option A is incorrect because GPU instances do not address the scaling policy lag. Option B is incorrect because model ensemble increases compute load.

Option C is incorrect because inference pipelines add latency, not reduce it.

Full explanation →

309

MCQhard

A model has high training accuracy but low validation accuracy. Which action is least likely to reduce overfitting?

A.Use dropout

B.Increase regularization strength

C.Add more training data

D.Increase model complexity

AnswerD

Increasing complexity makes the model more prone to overfitting.

Why this answer

Increasing model complexity (e.g., adding more layers or parameters) makes the model more flexible, which typically exacerbates overfitting by allowing it to memorize noise in the training data. Since the goal is to reduce overfitting, this action is counterproductive and therefore the least likely to help.

Exam trap

AWS often tests the misconception that 'more complex models always perform better,' leading candidates to incorrectly select increasing model complexity as a solution to overfitting rather than recognizing it as a cause.

How to eliminate wrong answers

Option A is wrong because dropout randomly deactivates neurons during training, which forces the network to learn redundant representations and reduces co-adaptation, directly combating overfitting. Option B is wrong because increasing regularization strength (e.g., L1/L2 penalty) adds a cost for large weights, shrinking the hypothesis space and preventing the model from fitting noise. Option C is wrong because adding more training data provides the model with more diverse examples, reducing the chance of memorizing spurious patterns and improving generalization.

Full explanation →

310

MCQeasy

A company is using Amazon SageMaker to train a model on sensitive customer data. The security team requires that all data be encrypted in transit and at rest, and that the training job does not have internet access. Which configuration should the team use to meet these requirements?

A.Configure the training job to run in a public subnet with a security group that blocks outbound traffic

B.Configure the training job to run in a private subnet, but disable encryption to reduce latency

C.Configure the training job to run in a private subnet with no internet access, and use a KMS key for encryption

D.Configure the training job to run in a VPC with a NAT gateway, and use default SageMaker encryption

AnswerC

Private subnet restricts internet; KMS encrypts data.

Why this answer

Option C is correct because running the SageMaker training job in a private subnet with no internet access ensures the job cannot reach the public internet, satisfying the no-internet-access requirement. Using an AWS KMS key for encryption at rest (for the S3 bucket and EBS volumes) and enforcing encryption in transit (via HTTPS/TLS for SageMaker and S3 endpoints) meets the encryption requirements. SageMaker training jobs in a private subnet use VPC endpoints (e.g., S3 and SageMaker API endpoints) to communicate securely without internet access.

Exam trap

The trap here is that candidates often confuse a private subnet with a NAT gateway as providing no internet access, but a NAT gateway actually enables outbound internet connectivity, which violates the requirement.

How to eliminate wrong answers

Option A is wrong because a public subnet inherently provides internet access via an internet gateway, violating the no-internet-access requirement; blocking outbound traffic with a security group does not prevent the instance from having a public IP or being reachable from the internet. Option B is wrong because disabling encryption violates the requirement that all data be encrypted in transit and at rest; encryption does not inherently increase latency in a meaningful way for SageMaker training jobs. Option D is wrong because a NAT gateway provides outbound internet access for instances in a private subnet, which violates the no-internet-access requirement; default SageMaker encryption uses AWS-managed keys, not a customer-managed KMS key, which may not satisfy the security team's requirement for explicit encryption control.

Full explanation →

311

MCQmedium

A SageMaker Processing job fails with the error: 'Unable to parse CSV file due to inconsistent number of columns'. The data is stored as CSV in S3. What is the most likely cause?

A.The CSV file is missing a header row

B.The file uses a different delimiter like tab

C.Some fields contain quoted commas

D.Some rows have missing values causing fewer columns

AnswerD

If some values are missing, the row may have fewer commas, leading to column count mismatch.

Why this answer

Option D is correct because inconsistent number of columns often results from rows with missing values where some fields are omitted. Option A (missing header) would cause a parsing error but not column count inconsistency. Option B (quoted commas) is handled by CSV parsers.

Option C (delimiter) would cause consistent parsing issues.

Full explanation →

312

MCQhard

A company uses SageMaker Ground Truth to create a labeled dataset, then trains a model using SageMaker Training. They want to automate the pipeline so that whenever a labeling job is completed, it triggers the training job. Which architecture meets this requirement with minimal latency?

A.Use AWS Step Functions to poll the labeling job status and then start training.

B.Configure an S3 event notification on the labeling job output bucket to trigger a Lambda function that starts training.

C.Use Amazon CloudWatch Events (EventBridge) to detect the completed labeling job and trigger a SageMaker Pipeline execution.

D.Set up a scheduled cron job in EventBridge to check for completed labeling jobs every hour and start training if found.

AnswerC

EventBridge directly supports SageMaker events and can start a pipeline execution with minimal latency.

Why this answer

Option C is correct because Amazon EventBridge can natively capture SageMaker job state changes (e.g., `SageMaker Labeling Job State Change` to `Completed`) and directly trigger a SageMaker Pipeline execution. This event-driven approach eliminates polling overhead and provides the lowest latency by reacting immediately when the labeling job finishes.

Exam trap

The trap here is that candidates often assume S3 event notifications are the simplest event-driven trigger, but they overlook the fact that S3 events can fire on intermediate writes (e.g., partial output files) rather than waiting for the labeling job's definitive `Completed` state, leading to data integrity issues.

How to eliminate wrong answers

Option A is wrong because polling the labeling job status with AWS Step Functions introduces unnecessary latency and cost from repeated API calls, and it is not a true event-driven architecture. Option B is wrong because S3 event notifications on the labeling job output bucket may fire before the labeling job is fully complete (e.g., partial writes) and do not guarantee that the job has transitioned to the `Completed` state, risking training on incomplete data. Option D is wrong because a scheduled cron job running every hour introduces up to 60 minutes of latency, which fails the 'minimal latency' requirement and is inefficient compared to an event-driven trigger.

Full explanation →

313

MCQhard

A financial services company uses SageMaker to train and deploy models. They must ensure that all model artifacts stored in S3 are encrypted at rest using customer-managed KMS keys. Additionally, only the SageMaker service role should have access to the encryption key for decrypting artifacts during inference. Which IAM policy configuration meets these requirements?

A.Set the S3 bucket policy to require aws:SourceArn to match the SageMaker endpoint and allow kms:GenerateDataKey and kms:Decrypt.

B.Create a KMS grant to allow the SageMaker service to use the key on behalf of the role, and set the S3 bucket to use AWS-managed SSE-S3.

C.Configure the KMS key policy to allow s3:PutObject and s3:GetObject for the SageMaker role, and enable S3 default encryption with the KMS key.

D.Use envelope encryption by generating a data key and storing it alongside the model artifact.

E.Attach a policy to the SageMaker role that allows kms:Decrypt on the KMS key, and set an S3 bucket policy that denies all access unless the request uses server-side encryption with the KMS key.

AnswerE

Correct. The role can decrypt, and the bucket policy enforces SSE-KMS, preventing unencrypted access.

Why this answer

The role must have kms:Decrypt permission, and the S3 bucket policy must enforce SSE-KMS to ensure encryption with the correct key.

Full explanation →

314

Multi-Selecthard

A healthcare company deploys a model to predict patient readmission risk. The model was trained on historical data and is now showing signs of concept drift. The team needs to implement a monitoring solution that can detect drift and automatically retrain the model when drift is detected. Which THREE steps should the team take to build this solution? (Choose THREE.)

Select 3 answers

A.Deploy SageMaker Model Monitor to track prediction quality over time

B.Disable the existing endpoint to prevent stale predictions during retraining

C.Set up a process to collect ground truth labels from patient outcomes

D.Manually compare the model's predictions against a holdout validation set each week

E.Use AWS Lambda to invoke a SageMaker training job when drift is detected

AnswersA, C, E

Model Monitor can detect drift using ground truth.

Why this answer

A is correct because Amazon SageMaker Model Monitor can continuously track prediction quality metrics (e.g., accuracy, precision) over time by analyzing data captured from the endpoint. This allows the team to detect concept drift by comparing live predictions against a baseline, triggering alerts when performance degrades. It provides a managed, automated way to monitor model quality without manual intervention.

Exam trap

The trap here is that candidates might think disabling the endpoint (Option B) is necessary to prevent stale predictions, but AWS best practice is to keep the endpoint live and use a separate pipeline (e.g., Lambda triggering a training job) to retrain and then update the endpoint without downtime.

Full explanation →

315

MCQeasy

A data science team wants to deploy a real-time inference endpoint on Amazon SageMaker for a model that requires low latency (under 100 ms). The model is a small ensemble of three tree-based models, each about 50 MB. The team expects around 1000 requests per minute, with occasional spikes to 5000 requests per minute. Which instance type and deployment strategy would be MOST cost-effective while meeting the latency requirement?

A.Deploy a single model endpoint on an ml.c5.large instance with Auto Scaling configured using a target tracking policy based on invocations per minute

B.Deploy a single model endpoint on an ml.c5.large instance with a Multi-Model endpoint

C.Use SageMaker batch transform with multiple ml.c5.large instances to process all requests offline

D.Deploy a single model endpoint on an ml.c5.xlarge instance with provisioned concurrency

AnswerA

The ml.c5.large provides sufficient compute for the latency requirement, and Auto Scaling scales out during spikes. This is the most cost-effective approach.

Why this answer

Option A is correct because deploying a single model endpoint on an ml.c5.large instance with Auto Scaling based on invocations per minute provides the necessary compute capacity for the expected 1000 requests per minute while scaling up to handle spikes up to 5000 requests per minute. The ml.c5.large instance offers sufficient memory (4 GB) and compute for three 50 MB tree-based models, and the target tracking policy ensures low latency by maintaining a buffer of capacity without over-provisioning, keeping inference under 100 ms.

Exam trap

The trap here is that candidates might confuse provisioned concurrency (a Lambda concept) with SageMaker's scaling options, or incorrectly assume Multi-Model endpoints are suitable for ensemble models, leading to choosing B or D without considering the real-time latency constraint.

How to eliminate wrong answers

Option B is wrong because Multi-Model endpoints are designed to host multiple independent models on a single instance, but here the ensemble is a single model composed of three sub-models that must be loaded together for each inference; using a Multi-Model endpoint would require loading each sub-model separately, increasing latency and complexity. Option C is wrong because SageMaker batch transform is an asynchronous, offline processing method that does not support real-time inference with sub-100 ms latency; it is designed for large-scale batch jobs, not low-latency endpoints. Option D is wrong because provisioned concurrency is a feature for AWS Lambda, not Amazon SageMaker endpoints; SageMaker uses Auto Scaling or manual instance scaling, and an ml.c5.xlarge instance would be over-provisioned for the baseline load, increasing cost unnecessarily.

Full explanation →

316

MCQeasy

A machine learning engineer is using SageMaker Data Wrangler to perform data validation. Which step should be added to the pipeline to ensure data quality before training?

A.Write a custom SageMaker Processing job for validation

B.Apply a 'Data Quality' transformation in Data Wrangler to validate column statistics

C.Use AWS Glue DataBrew to profile the dataset

D.Add a SageMaker Pipeline step to check data quality after Data Wrangler

AnswerB

Data Wrangler provides built-in data quality checks.

Why this answer

Option B is correct because SageMaker Data Wrangler includes a built-in 'Data Quality' transformation that allows you to validate column statistics (e.g., missing values, min/max, distinct counts) directly within the visual pipeline. This step ensures data quality without requiring custom code or external services, integrating seamlessly with the Data Wrangler workflow for pre-training validation.

Exam trap

The trap here is that candidates often overcomplicate the solution by choosing a custom Processing job or external service, missing that Data Wrangler's built-in 'Data Quality' transformation is the most direct and efficient way to validate data quality within the same pipeline.

How to eliminate wrong answers

Option A is wrong because writing a custom SageMaker Processing job for validation is unnecessary overhead; Data Wrangler already provides native data quality checks that are simpler and more integrated. Option C is wrong because AWS Glue DataBrew is a separate service for data preparation, not a step within a SageMaker Data Wrangler pipeline, and using it would break the pipeline's continuity. Option D is wrong because adding a SageMaker Pipeline step to check data quality after Data Wrangler is redundant; Data Wrangler itself can perform validation inline, and a post-hoc step would not catch issues before training in the same streamlined flow.

Full explanation →

317

Multi-Selecthard

A company has deployed a model to a SageMaker endpoint. The security team wants to ensure that all traffic between the endpoint and the client application is encrypted and that the endpoint is not accessible from the internet. Which TWO actions should the company take? (Choose TWO.)

Select 2 answers

A.Place the endpoint behind an API Gateway and call it from the client.

B.Configure the SageMaker endpoint to be VPC-only by setting the endpoint's VPC configuration.

C.Create the endpoint with a public endpoint and allow only the client's IP address via security group.

D.Enable HTTPS on the endpoint by using a custom certificate from ACM.

E.Use AWS KMS to encrypt data in transit between the client and the endpoint.

AnswersB, D

VPC-only endpoints are not publicly accessible.

Why this answer

Option A and D are correct. A VPC endpoint (PrivateLink) enables private connectivity from clients in the same VPC, and using HTTPS ensures encryption. Option B (public endpoint) is wrong.

Option C (AWS KMS) doesn't directly encrypt network traffic. Option E (API Gateway) is unnecessary if clients are in VPC.

Full explanation →

318

MCQeasy

An ML engineer needs to split a dataset into training, validation, and test sets. The dataset has a time-based column that should not be leaked. Which split method is most appropriate?

A.Stratified split based on target

B.Temporal split based on date

C.Random split with 70/20/10

D.K-fold cross-validation

AnswerB

Temporal split respects chronology by using earlier data for training and later data for testing.

Why this answer

Option B is correct because a temporal split ensures that the time-based column is not leaked by preserving the chronological order of the data. This method uses the date column to assign earlier records to the training set and later records to the validation and test sets, preventing future information from influencing the model during training.

Exam trap

AWS often tests the concept of data leakage by presenting random or stratified splits as viable options, trapping candidates who overlook the time-based column and assume standard splitting methods are always safe.

How to eliminate wrong answers

Option A is wrong because a stratified split based on the target variable preserves class proportions but does not account for time order, leading to potential data leakage when time-dependent patterns exist. Option C is wrong because a random split ignores the temporal structure entirely, allowing future data points to appear in the training set and causing leakage. Option D is wrong because K-fold cross-validation shuffles data randomly across folds, which breaks the time sequence and introduces leakage; it is unsuitable for time-series or time-sensitive data.

Full explanation →

319

MCQmedium

A data scientist is training a binary classification model using Amazon SageMaker. The dataset has a severe class imbalance (95% negative, 5% positive). The model achieves 99% accuracy but fails to identify positive cases correctly. Which action should the data scientist take to improve the model's ability to detect positive cases?

A.Switch to a logistic regression model with balanced class weights.

B.Use accuracy as the evaluation metric and retrain the model.

C.Apply SMOTE (Synthetic Minority Over-sampling Technique) to the training data.

D.Use the F1 score as the evaluation metric and adjust the classification threshold based on the precision-recall curve.

AnswerD

F1 score and threshold tuning directly address the imbalance.

Why this answer

Option D is correct because in a severely imbalanced dataset (95% negative, 5% positive), accuracy is misleading. The F1 score balances precision and recall, and adjusting the classification threshold based on the precision-recall curve allows the model to prioritize recall for the minority class, directly improving detection of positive cases. This approach is recommended in SageMaker when using built-in algorithms or custom models with imbalanced data.

Exam trap

The trap here is that candidates often think oversampling (SMOTE) or changing the model type is the primary fix, but the exam tests understanding that evaluation metrics and threshold tuning are critical for imbalanced classification, not just data preprocessing.

How to eliminate wrong answers

Option A is wrong because switching to logistic regression with balanced class weights may help, but it is not the best action; the question asks for a single action to improve detection, and adjusting the threshold and metric (D) is more direct and effective than changing the model type. Option B is wrong because using accuracy as the evaluation metric will continue to favor the majority class and fail to reflect poor positive detection, reinforcing the original problem. Option C is wrong because applying SMOTE to the training data can introduce synthetic samples, but it does not address the need to evaluate and tune the model's decision threshold; SMOTE alone may not fix the detection issue if the threshold remains at 0.5.

Full explanation →

320

MCQhard

A company deploys a model using SageMaker real-time endpoint with auto scaling. They observe that during a traffic spike, the endpoint quickly scales up to 10 instances, but after the spike, it takes a long time to scale down, leading to high costs. The scaling policy is based on a simple average CPU utilization threshold. Which adjustment would optimize the scaling down behavior?

A.Increase the scale-in cooldown period to prevent premature scale-down.

B.Decrease the scale-in cooldown period to allow the endpoint to scale down faster when utilization drops.

C.Use a step scaling policy with a larger step adjustment for scale-in.

D.Change the scaling policy to use memory utilization instead of CPU.

AnswerB

Reducing cooldown enables the Auto Scaling group to remove instances sooner.

Why this answer

The correct answer is B because decreasing the scale-in cooldown period allows the endpoint to respond more quickly to sustained drops in CPU utilization. By default, SageMaker auto scaling uses cooldown periods to prevent rapid fluctuations; a long scale-in cooldown delays the termination of instances after utilization falls, keeping costs high. Reducing this cooldown lets the endpoint scale down faster when the spike subsides, directly addressing the problem.

Exam trap

The trap here is that candidates often confuse cooldown periods with step adjustments, thinking that larger scale-in steps will speed up the process, when in fact the cooldown period controls the timing of when scaling actions can occur.

How to eliminate wrong answers

Option A is wrong because increasing the scale-in cooldown period would make the problem worse, not better—it would cause the endpoint to wait even longer before scaling down, increasing costs. Option C is wrong because step scaling policies control the magnitude of scaling adjustments (e.g., adding or removing multiple instances at once), but they do not affect the timing or delay of scale-in actions; the cooldown period is the key parameter for timing. Option D is wrong because changing the metric to memory utilization does not address the core issue of slow scale-down timing; the problem is with the cooldown period, not the metric choice.

Full explanation →

321

MCQmedium

An MLOps engineer is setting up a SageMaker endpoint for a model that performs inference on large images. The model is containerized and expects input in a specific format. The team wants to preprocess the images (resize and normalize) before passing them to the model. What is the most efficient way to implement this?

A.Configure SageMaker to use a preprocessing container as the first step of an inference pipeline, followed by the model container.

B.Use Amazon API Gateway to perform request transformation before forwarding to the endpoint.

C.Package the preprocessing logic into the same Docker container as the model.

D.Use a Lambda function as a proxy to preprocess requests before calling the SageMaker endpoint.

AnswerA

Inference pipeline allows separation of concerns and efficient processing.

Why this answer

Option A is correct because SageMaker Inference Pipelines allow you to chain multiple containers in a serial fashion, where the output of one container becomes the input of the next. By placing a preprocessing container as the first step, you can resize and normalize large images before passing them to the model container, which keeps the model container focused on inference and avoids unnecessary data transfer or custom code. This is the most efficient and natively supported approach within SageMaker for multi-step inference workflows.

Exam trap

The trap here is that candidates often choose Option C (packaging everything into one container) because it seems simpler, but they overlook the fact that SageMaker Inference Pipelines are specifically designed for this exact use case and provide better modularity, maintainability, and efficiency.

How to eliminate wrong answers

Option B is wrong because Amazon API Gateway is designed for request routing and transformation at the HTTP level, not for heavy image preprocessing (e.g., resizing and normalization) — it lacks the computational capability and libraries needed for such tasks, and it would introduce latency without any benefit. Option C is wrong because packaging preprocessing logic into the same container as the model violates the separation of concerns principle and makes the container larger and harder to maintain; it also prevents independent scaling or updating of preprocessing steps. Option D is wrong because using a Lambda function as a proxy adds unnecessary cold-start latency and a 6 MB (or 10 MB via extension) payload limit, which is problematic for large images, and it does not integrate as seamlessly with SageMaker's built-in batching or inference pipeline features.

Full explanation →

322

MCQhard

An e-commerce company uses Amazon SageMaker to train a model that predicts click-through rates. The training data includes a timestamp column 'click_time' and a categorical feature 'device_type' (8 values). They notice that the model's performance degrades over time because the data distribution shifts. They want to ensure the training data represents the most recent behavior. The data is stored in a daily partitioned S3 bucket (e.g., s3://bucket/data/2024-01-01/). The total dataset size is 500 GB. Which approach should they take to prepare the training data while minimizing bias and cost?

A.Select only the data from the last 30 days to train the model.

B.Take a random sample of 10% of the rows from the entire dataset.

C.Use all historical data and let the model learn the temporal patterns.

D.Downsample older data exponentially so that recent data is overrepresented.

AnswerA

Using a recent window captures current patterns, reduces volume, and mitigates drift.

Why this answer

Option A is correct because selecting only the last 30 days of data directly addresses the data distribution shift by focusing on the most recent user behavior, which is critical for click-through rate prediction. This approach minimizes bias from outdated patterns and reduces training cost by using a smaller, relevant dataset (approximately 500 GB / 365 * 30 ≈ 41 GB). SageMaker training jobs benefit from this reduced volume through faster data loading and lower compute costs.

Exam trap

AWS often tests the misconception that more data always improves model performance, but in the presence of concept drift, recent data is more valuable than historical data, making a time-window selection the most cost-effective and bias-minimizing strategy.

How to eliminate wrong answers

Option B is wrong because random sampling from the entire dataset would include outdated data from months or years ago, failing to capture the recent distribution shift and introducing bias from stale patterns. Option C is wrong because using all historical data would force the model to learn temporal patterns that may no longer be valid, leading to degraded performance on current data and higher training costs due to the full 500 GB dataset. Option D is wrong because exponential downsampling of older data is an overly complex approach that may still retain some outdated data, and it does not guarantee that the training set reflects the most recent behavior as cleanly as a simple time-window cut; it also adds unnecessary preprocessing overhead.

Full explanation →

323

MCQmedium

An ML team is using SageMaker Model Registry to manage model versions. After training a new model version, they register it with an 'Approved' status. The CI/CD pipeline automatically deploys the latest approved model to a staging endpoint. However, the pipeline fails with an error: 'Cannot deploy model because the model version is not approved.' The model version is clearly approved in the registry. What is the most likely cause?

A.The pipeline is using the model package ARN instead of the model version ARN.

B.The model version is approved but the pipeline uses a different version that is still pending.

C.The SageMaker endpoint configuration does not have the necessary IAM permissions to read the registry.

D.The approval status was set on the model package group, not on the specific model version.

AnswerD

Approval is per model version; if only the group is approved, individual versions may not inherit.

Why this answer

Option D is correct because in SageMaker Model Registry, approval is a property of a specific model version within a model package group, not of the model package group itself. The error indicates the pipeline is likely referencing the model package group ARN or a version that lacks explicit approval, even though the team believes the model is approved. The CI/CD pipeline must use the exact model version ARN that has the 'Approved' status to deploy successfully.

Exam trap

The trap here is that candidates confuse model package group approval with model version approval, assuming that approving the group automatically approves all versions, whereas AWS requires explicit approval on each version individually.

How to eliminate wrong answers

Option A is wrong because using the model package ARN (which refers to the group) would cause a different error, such as 'ModelPackageNotFound' or 'InvalidARN', not a specific 'not approved' error; the pipeline would still need to specify a version. Option B is wrong because the question states the model version is clearly approved in the registry, so the pipeline using a different pending version would imply a misconfiguration in the pipeline's version selection logic, but the error message directly contradicts the approval status of the intended version. Option C is wrong because IAM permissions for the endpoint configuration to read the registry would cause an 'AccessDenied' or authorization error, not a 'not approved' error; the error is about approval status, not permissions.

Full explanation →

324

MCQhard

A data scientist is using SageMaker built-in linear learner algorithm for a regression problem. The dataset has 10 features, some have missing values, and the target variable is right-skewed. The data scientist wants to handle missing values and transform the target variable to improve model performance. Which data preparation steps should the data scientist take?

A.Apply one-hot encoding to all features and remove missing values by dropping rows.

B.Standardize all features to have zero mean and unit variance, then apply a box-cox transformation to the target.

C.Impute missing values with the median of each feature and apply a log transformation to the target variable.

D.Remove rows with missing values and normalize the target to range [0,1].

AnswerC

Handles missing values and skew appropriately.

Why this answer

Option C is correct because imputing missing values with the median is robust to outliers and preserves the distribution of each feature, which is important when the target is right-skewed. Applying a log transformation to the right-skewed target variable helps normalize its distribution, which aligns with the linear learner algorithm's assumption of normally distributed errors and improves convergence and prediction accuracy.

Exam trap

The trap here is that candidates may assume standardizing features (Option B) is always required, but for a right-skewed target, transforming the target itself (e.g., log transform) is more critical than scaling features, and imputation is essential to avoid data loss.

How to eliminate wrong answers

Option A is wrong because one-hot encoding all features, including numeric ones, would dramatically increase dimensionality and is inappropriate for features that are not categorical; dropping rows with missing values reduces the dataset size and can introduce bias. Option B is wrong because standardizing features is beneficial, but applying a Box-Cox transformation to the target variable requires all target values to be positive (which may not hold) and is less commonly used than log transformation for right-skewed targets; also, Box-Cox is not directly available in SageMaker's built-in linear learner without custom preprocessing. Option D is wrong because removing rows with missing values discards potentially valuable data and can lead to biased models; normalizing the target to [0,1] does not address skewness and may compress the variance, harming regression performance.

Full explanation →

325

MCQhard

A machine learning team is building a model using a dataset that contains a mix of numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). The team wants to use Amazon SageMaker for training. Which technique should the team use to encode the high-cardinality categorical features effectively?

A.Apply hash encoding to map categories to a fixed number of buckets.

B.Apply target encoding (mean encoding) to the high-cardinality features.

C.Apply one-hot encoding to all categorical features.

D.Apply label encoding to assign integer values to each category.

AnswerB

Target encoding reduces dimensionality and captures target-related information.

Why this answer

For high-cardinality categorical features, target encoding (mean encoding) replaces each category with the mean of the target variable for that category, which captures information without creating a large number of dummy variables. One-hot encoding would create too many features. Label encoding implies ordinal relationships.

Hash encoding can cause collisions.

Full explanation →

326

MCQmedium

A data scientist is training a deep learning model on Amazon SageMaker and notices that the training loss decreases but the validation loss starts increasing after a certain number of epochs. The model is likely overfitting. Which SageMaker feature can they use to detect and diagnose this issue during training?

A.SageMaker Model Monitor

B.SageMaker Automatic Model Tuning

C.SageMaker Experiments

D.SageMaker Debugger

AnswerD

SageMaker Debugger provides built-in rules such as OverfitRule to monitor training and detect issues like overfitting in real time.

Why this answer

SageMaker Debugger includes built-in rules like OverfitRule that monitor training and validation metrics during training and emit alerts when overfitting is detected. SageMaker Experiments tracks runs but does not diagnose, Model Monitor is for inference, and Automatic Model Tuning optimizes hyperparameters.

Full explanation →

327

MCQmedium

A company is building a fraud detection model on an imbalanced dataset (99% legitimate, 1% fraudulent). To improve recall on the minority class, they want to resample data. Which combination of techniques should they use?

A.SMOTE on entire dataset before train/test split

B.Random oversampling of minority class before train/test split

C.Random undersampling of majority class

D.SMOTE on training set only

AnswerD

Correct: SMOTE generates synthetic minority samples on the training set without affecting the test distribution.

Why this answer

SMOTE should be applied only to the training set to avoid data leakage; evaluation must reflect the original distribution. Random undersampling may discard useful majority samples; random oversampling before split leaks information.

Full explanation →

328

MCQhard

A machine learning engineer is preparing a dataset for a binary classification model. The dataset has a severe class imbalance (95% class A, 5% class B). The engineer wants to use Amazon SageMaker to train the model. Which data preparation technique should the engineer apply to the training dataset to address the imbalance and improve model performance?

A.Apply data augmentation to the majority class by adding noise.

B.Apply Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples for the minority class.

C.Use a weighted loss function during training to penalize misclassifications of the minority class.

D.Apply random under-sampling to reduce the majority class to match the minority class size.

AnswerB

SMOTE creates synthetic samples, balancing the dataset without losing data.

Why this answer

Option B is correct because SMOTE generates synthetic samples for the minority class by interpolating between existing minority instances, which directly addresses the severe class imbalance (95% class A, 5% class B) by creating a more balanced training dataset. This technique is particularly effective for tabular data in Amazon SageMaker, as it increases the representation of the minority class without simply duplicating existing samples, thereby reducing overfitting and improving the model's ability to learn decision boundaries for the minority class.

Exam trap

The trap here is that candidates confuse data preparation techniques (like SMOTE) with training-time strategies (like weighted loss functions), leading them to select option C even though the question explicitly specifies applying a technique to the training dataset before training.

How to eliminate wrong answers

Option A is wrong because applying data augmentation by adding noise to the majority class does not address the imbalance—it only increases the size of the already dominant class, potentially worsening the imbalance and introducing irrelevant variance. Option C is wrong because using a weighted loss function is a training-time technique, not a data preparation technique; the question explicitly asks for a data preparation technique to apply to the training dataset before training. Option D is wrong because random under-sampling to match the minority class size would discard 90% of the majority class data, leading to significant information loss and a high risk of underfitting, especially with a severe 95:5 imbalance.

Full explanation →

329

MCQmedium

A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?

A.SMOTE (Synthetic Minority Over-sampling Technique)

B.Random undersampling of the majority class

C.Random oversampling of the minority class

D.Apply class weights during model training

AnswerA

Generates synthetic samples for the minority class.

Why this answer

SMOTE (Synthetic Minority Over-sampling Technique) is the best choice because it generates synthetic examples for the minority class by interpolating between existing minority instances and their k-nearest neighbors, rather than simply duplicating data. This addresses the severe 95:5 class imbalance without losing data (as undersampling would) and without the overfitting risk of naive random oversampling. The synthetic samples help the model learn a more general decision boundary for the positive class.

Exam trap

AWS often tests the distinction between data-level techniques (like SMOTE, oversampling, undersampling) and algorithm-level techniques (like class weights), and the trap here is that candidates confuse class weighting as a data preparation method when it is actually a model training adjustment, not a data transformation step.

How to eliminate wrong answers

Option B is wrong because random undersampling of the majority class discards a large portion of the dataset (up to 95% of the negative examples), which leads to significant information loss and can degrade model performance due to reduced training data. Option C is wrong because random oversampling of the minority class simply duplicates existing positive examples, which does not introduce new variability and often causes overfitting, especially when the minority class is very small (5%). Option D is wrong because applying class weights during model training is a cost-sensitive learning technique, not a data preparation technique; it adjusts the loss function to penalize misclassifications of the minority class more heavily, but the question specifically asks for a data preparation technique to address imbalance without losing data.

Full explanation →

330

MCQmedium

Refer to the exhibit. A data scientist receives the above error when running a SageMaker training job. Which action will resolve the issue?

A.Change the training instance type to ml.m5.xlarge

B.Add an S3 bucket policy granting s3:GetObject to the SageMaker role

C.Use s3g:// instead of s3:// in the data source URI

D.Increase the volume size in the ResourceConfig

AnswerB

Granting the missing permission allows SageMaker to read the training data.

Why this answer

Option A is correct because the error indicates the SageMaker execution role lacks s3:GetObject permission on the training data. Adding that permission to the role resolves the issue. Option B (changing instance type) is unrelated.

Option C (increase volume size) does not affect S3 access. Option D (s3g://) is an invalid S3 URI scheme.

Full explanation →

331

MCQmedium

A financial services company is building a fraud detection model using historical transaction data stored in Amazon S3. The data includes features such as transaction amount, merchant category, time of day, and user location. The data scientist observes that the 'merchant_category' column is a text attribute with over 200 unique values. Additionally, the 'transaction_amount' column has a long-tail distribution with extreme outliers. The dataset is 200 GB in size, and the company wants to use Amazon SageMaker for model training. The data scientist needs to engineer features that capture the high-cardinality category and reduce the impact of outliers. What is the MOST efficient and effective approach to prepare this data?

A.Use AWS Glue ETL to apply one-hot encoding to merchant_category and min-max scaling to transaction_amount.

B.Use Amazon EMR with Spark to apply ordinal encoding to merchant_category based on frequency, and log-transform the transaction_amount to reduce skewness.

C.Use Amazon Athena to bin transaction_amount into 10 equal-width bins and replace merchant_category with its count encoding.

D.Use AWS Glue DataBrew to apply a one-hot encoding on merchant_category and a standard scaler on transaction_amount after removing outliers.

AnswerB

Ordinal encoding handles high cardinality efficiently, and log transformation compresses extreme values, both reducing dimensionality and improving model performance.

Why this answer

Option B is correct because ordinal encoding based on frequency handles high-cardinality categorical features efficiently without exploding dimensionality, and log-transform is a standard technique to reduce skewness in long-tail distributions. Using Amazon EMR with Spark provides distributed processing for the 200 GB dataset, making it scalable and cost-effective compared to single-node alternatives.

Exam trap

The trap here is that candidates often default to one-hot encoding for categorical data without considering cardinality, and assume scaling methods like min-max or standard scaling are always appropriate, ignoring the impact of outliers on these transformations.

How to eliminate wrong answers

Option A is wrong because one-hot encoding on a column with over 200 unique values would create over 200 sparse columns, dramatically increasing memory and training time, and min-max scaling is sensitive to outliers, which would compress the majority of values into a narrow range. Option C is wrong because equal-width binning on a long-tail distribution will result in most data falling into the first few bins, losing information, and count encoding alone may not capture the ordinal relationship implied by frequency. Option D is wrong because one-hot encoding again suffers from high dimensionality, standard scaling is not robust to outliers (it uses mean and standard deviation), and removing outliers arbitrarily can discard valuable fraud signals.

Full explanation →

332

MCQeasy

A marketing company is preparing a dataset to train a logistic regression model to predict whether a customer will click on an online ad. The dataset includes 1 million records with features: customer_age (numeric), income (numeric), education_level (ordinal: high school, bachelor, master, PhD), and ad_category (categorical: 50 unique values). The data is stored in a CSV file in Amazon S3. The data scientist plans to use Amazon SageMaker's built-in linear learner algorithm. The data scientist needs to preprocess the data before training. What is the correct sequence of data preparation steps that should be applied to this dataset to ensure optimal model performance?

A.Drop any duplicate records, apply min-max scaling to all numeric features, and use target encoding for ad_category based on click rates.

B.Apply PCA to all numeric and categorical features after converting categories to numeric indices, then standardize the principal components.

C.Apply min-max scaling to customer_age and income, label encode education_level and ad_category, then use recursive feature elimination to reduce dimensionality.

D.Standardize customer_age and income to have zero mean and unit variance, one-hot encode ad_category, ordinal encode education_level (e.g., map to 1-4), then combine all features into a feature matrix.

AnswerD

Standardization helps linear models converge faster; one-hot encoding for categorical with many categories is standard; ordinal encoding preserves the ordinal nature of education.

Why this answer

Option D is correct because it applies appropriate preprocessing for a logistic regression model using SageMaker's linear learner. Standardizing numeric features (zero mean, unit variance) is essential for linear models to ensure convergence and equal feature influence. One-hot encoding the categorical ad_category (50 unique values) avoids imposing ordinal relationships, while ordinal encoding education_level respects its natural order.

This combination prepares a feature matrix suitable for the linear learner's optimization.

Exam trap

The trap here is that candidates often choose label encoding for all categorical features (Option C) or target encoding (Option A) without considering the ordinal nature of education_level or the risk of data leakage, leading to suboptimal model performance.

How to eliminate wrong answers

Option A is wrong because min-max scaling is not optimal for linear models (it does not center data, which can slow convergence), and target encoding ad_category based on click rates introduces data leakage (future information) and risks overfitting. Option B is wrong because applying PCA to categorical features after converting to numeric indices is inappropriate (PCA assumes linear relationships and continuous data), and standardizing principal components is redundant since PCA already produces uncorrelated components. Option C is wrong because label encoding ad_category (50 unique values) imposes false ordinal relationships, and recursive feature elimination is computationally expensive and unnecessary for this dataset size; min-max scaling also lacks centering for linear models.

Full explanation →

333

MCQhard

A team is deploying a deep learning model on a SageMaker real-time endpoint. The model has high memory requirements, and the team wants to minimize instance cost while ensuring the endpoint can handle up to 10 concurrent requests. They plan to use a single ml.p3.2xlarge instance (8 vCPUs, 61 GB memory). Which SageMaker endpoint configuration will allow the endpoint to handle 10 concurrent requests without errors?

A.Disable ModelServerWorkers to reduce overhead.

B.Set the initial instance count to 1 and configure the container to use multiple ModelServerWorkers.

C.Set the initial variant weight to 10.

D.Set the initial instance count to 10 in the production variant.

AnswerB

Multiple workers allow the instance to handle multiple requests concurrently, up to the CPU/memory limit.

Why this answer

Option B is correct because SageMaker's ModelServerWorkers (MSWs) allow a single container to handle multiple inference requests concurrently by running multiple worker processes. With 8 vCPUs on ml.p3.2xlarge, configuring multiple MSWs (e.g., 8 workers) enables the endpoint to process up to 10 concurrent requests without errors, as each worker can handle one request at a time. This minimizes cost by using a single instance while meeting concurrency requirements.

Exam trap

The trap here is confusing concurrency mechanisms: candidates often think increasing instance count (Option D) is the only way to handle concurrent requests, but SageMaker's ModelServerWorkers allow a single instance to serve multiple requests in parallel, which is more cost-effective.

How to eliminate wrong answers

Option A is wrong because disabling ModelServerWorkers would force the container to use a single worker, limiting concurrency to 1 request at a time, which cannot handle 10 concurrent requests. Option C is wrong because initial variant weight controls traffic distribution across multiple variants, not concurrency or instance count; setting it to 10 does not increase the number of instances or workers. Option D is wrong because setting the initial instance count to 10 would deploy 10 instances, which is unnecessary and costly for handling 10 concurrent requests, and does not address the goal of minimizing cost.

Full explanation →

334

MCQmedium

A financial services company uses SageMaker to train a fraud detection model. They have imbalanced data with 1% fraud. They trained a Gradient Boosting model using SMOTE for oversampling and achieved 99% accuracy on the test set, but the fraud recall is only 10%. The data scientist is concerned about the model's performance. Which change is most likely to improve fraud recall without sacrificing too much precision?

A.Use a different evaluation metric like F1-score during training.

B.Increase the weight of the fraud class in the loss function.

C.Reduce the SMOTE sampling ratio to create more synthetic samples.

D.Use a random undersampling of the majority class.

AnswerB

Correct: Class weighting focuses the model on the minority class, improving recall.

Why this answer

Option B is correct because increasing the weight of the fraud class in the loss function penalizes misclassifications of fraud more, improving recall. Option A is wrong because reducing the SMOTE ratio (i.e., less oversampling) would likely reduce recall. Option C is wrong because using F1-score as a metric does not change the training objective.

Option D is wrong because random undersampling may lose important majority class data, reducing precision.

Full explanation →

335

MCQeasy

A company deployed a machine learning model on an Amazon SageMaker real-time endpoint. Over several weeks, they notice that inference latency has been gradually increasing, especially during peak business hours. The model and instance type have remained unchanged. What is the most likely cause of the increased latency?

A.The inference script is not using batch processing.

B.The SageMaker endpoint auto scaling is not configured to scale out quickly enough under increasing traffic.

C.The model size is too large for the instance type.

D.The endpoint has data capture enabled, causing additional overhead.

AnswerB

If auto scaling policies are too conservative, the endpoint may not add instances fast enough during traffic spikes, leading to increased latency.

Why this answer

Option B is correct because the gradual increase in latency over time, especially during peak hours, suggests that the endpoint may not be scaling out adequately to handle increased traffic. Option A is incorrect because the model size has not changed. Option C is incorrect because data capture does not inherently cause latency increases over time.

Option D is incorrect because batch processing is not used for real-time endpoints.

Full explanation →

336

MCQeasy

A data science team deploys a regression model to Amazon SageMaker for real-time inference. After one month, the model's prediction errors increase significantly, but data distributions remain unchanged. Which monitoring approach is MOST suitable for detecting this issue?

A.Set up Amazon SageMaker Model Monitor to track model performance metrics against ground truth labels as they arrive.

B.Use Amazon SageMaker Clarify to monitor feature attribution drift.

C.Enable Amazon CloudWatch to monitor model endpoint latency.

D.Configure Amazon SageMaker Model Monitor to track data drift on the input features.

AnswerA

Model performance monitoring directly detects concept drift by comparing predictions to actuals.

Why this answer

Option A is correct because the model's prediction errors increased despite no data drift, indicating concept drift. Model performance monitoring compares predictions against ground truth. Option B is wrong because data drift monitors input distribution changes, not concept drift.

Option C is wrong because feature attribution drift is a type of data drift. Option D is wrong because model latency is a performance metric, not accuracy.

Full explanation →

337

Multi-Selecteasy

A machine learning engineer is setting up an Amazon SageMaker notebook instance. The instance needs to access a private S3 bucket that contains training data. The notebook instance is in a VPC. Which combination of steps will grant access to the S3 bucket? (Choose TWO.)

Select 2 answers

A.Create a VPC endpoint for S3 in the same VPC and subnet.

B.Assign a public IP address to the notebook instance.

C.Set up a NAT gateway in the public subnet.

D.Create an IAM role with S3 access permissions and attach it to the notebook instance.

E.Attach an internet gateway to the VPC.

AnswersA, D

Allows private connectivity to S3.

Why this answer

Options B and D are correct. The notebook needs an IAM role with S3 permissions (B) and a VPC endpoint for S3 (D) to access the bucket privately. Option A is wrong because internet gateway is not needed if using VPC endpoint; using NAT would be more complex.

Option C is wrong because assigning public IP is not necessary for private access. Option E is wrong because NAT gateway is not required if using VPC endpoint.

Full explanation →

338

MCQeasy

A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?

A.Label encoding

B.One-hot encoding

C.Frequency encoding

D.Target encoding

AnswerD

Encodes using target mean, handles high cardinality well.

Why this answer

Target encoding is the most appropriate technique for high-cardinality categorical features because it replaces each category with the mean of the target variable for that category, effectively capturing the predictive signal while keeping the feature as a single numeric column. This avoids the dimensionality explosion of one-hot encoding and the arbitrary ordinality of label encoding, making it a common choice in gradient boosting frameworks like XGBoost or LightGBM for datasets with thousands of unique categories.

Exam trap

AWS often tests the misconception that one-hot encoding is always the safest choice for categorical data, but candidates fail to recognize that high cardinality makes it impractical, leading them to overlook target encoding as a more efficient alternative.

How to eliminate wrong answers

Option A is wrong because label encoding assigns arbitrary integer values to categories, which introduces a false ordinal relationship that can mislead tree-based models into treating high-cardinality features as ordered, degrading performance. Option B is wrong because one-hot encoding creates a binary column for each unique category, which with thousands of categories leads to an extremely high-dimensional and sparse feature space, causing memory issues and overfitting. Option C is wrong because frequency encoding replaces categories with their occurrence counts, which loses the relationship between the category and the target variable, often resulting in weaker predictive power compared to target encoding.

Full explanation →

339

MCQmedium

An ML team is developing a regression model using Amazon SageMaker. They have a 100 GB CSV dataset stored in Amazon S3. The data is contained in a single large file. They launch a SageMaker training job with an ml.p3.8xlarge instance using a custom Docker container. The training script loads the data using pandas' read_csv from S3 directly. The team observes that the training job takes over 24 hours, and CloudWatch metrics show: GPU utilization is consistently above 90%, but CPU utilization is below 30%. Network I/O is moderate, and disk I/O is low. The team has already tried switching to a larger instance type (ml.p3.16xlarge) with no significant improvement. They need to reduce training time. Which action is MOST likely to achieve this?

A.Use SageMaker Pipe Mode to stream data directly from S3 to the algorithm, bypassing the local file system.

B.Split the CSV file into multiple smaller files (e.g., 100 MB each) and update the training script to read from a list of files in S3.

C.Use Amazon SageMaker Managed Spot Training to reduce cost, then use the savings to rent a larger instance.

D.Increase the number of training instances by using a distributed training configuration with Horovod.

AnswerB

This allows SageMaker to parallelize data loading across multiple instances or even multiple processes within one instance, improving I/O throughput.

Why this answer

The bottleneck is data loading. The single large CSV file prevents parallelism; SageMaker's Pipe mode streams data directly to the algorithm, but custom containers must support it. However, a simpler and effective approach is to split the data into multiple smaller files, enabling SageMaker's distributed data loading across instances and improving I/O parallelism.

Increasing instance count with single file doesn't help because each instance still reads the same file. Changing instance type already tried. Spot instances don't improve speed.

EBS volume doesn't matter.

Full explanation →

340

MCQmedium

A trained model needs to be deployed for real-time inference with low latency. Which AWS service is best suited for this?

A.SageMaker Batch Transform

B.SageMaker endpoints

C.SageMaker Hyperparameter Tuning

D.AWS Lambda with model packaged

AnswerB

Endpoints are designed for real-time inference with automatic scaling and low latency.

Why this answer

SageMaker endpoints provide managed, scalable, and low-latency real-time inference. Batch Transform is for offline inference, Hyperparameter Tuning is for training, and Lambda is for serverless but lacks native ML optimizations.

Full explanation →

341

MCQhard

A financial services company is training a large natural language processing (NLP) model using PyTorch on a SageMaker distributed training job. The cluster consists of 4 ml.p3.16xlarge instances (8 GPUs each). The training job runs successfully but takes 72 hours, exceeding the allotted 48-hour window. The team must reduce training time without sacrificing model quality. The model architecture has 1.5 billion parameters and currently uses the SageMaker data parallel library with Horovod for all-reduce. Observing CloudWatch metrics, the team notices that GPU utilization averages only 45% and network throughput is near maximum. Which action will most effectively reduce training time?

A.Enable Elastic Fabric Adapter (EFA) for faster inter-node connectivity.

B.Increase the batch size to improve GPU utilization.

C.Increase the number of instances from 4 to 8 to add more GPUs.

D.Switch to SageMaker model parallel library with pipeline parallelism to reduce communication overhead.

AnswerD

Model parallelism partitions the model across devices, reducing communication volume and improving utilization.

Why this answer

Option C is correct because with low GPU utilization and high network bandwidth consumption, the bottleneck is likely communication overhead. Model parallelism splits the model across GPUs, reducing the need for frequent all-reduce of large gradients, thus improving GPU utilization. Option A is wrong because increasing instance count would increase communication overhead and likely not improve utilization.

Option B is wrong because data parallelism already uses GPUs; increasing batch size may cause memory overflow. Option D is wrong because enabling EFA improves network, but network is already near maximum; the bottleneck is not network speed but the frequency of communication.

Full explanation →

342

MCQeasy

Refer to the exhibit. The Glue job reads a CSV file and attempts to write to a Parquet table. What is the most likely cause of this error?

A.The 'price' column is missing from some rows

B.The schema inference incorrectly detected the column as String

C.The 'price' column contains non-numeric values in some rows

D.The CSV file is compressed and not properly decompressed

AnswerC

Non-numeric strings like 'N/A' or commas cause conversion errors.

Why this answer

Option C is correct because the error message indicates a 'NumberFormatException' when parsing the 'price' column, which occurs when Spark attempts to convert a string value to a numeric type. Since the Glue job's schema inference likely detected 'price' as a numeric column based on the majority of rows, any row containing a non-numeric value (e.g., 'N/A', 'null', or a currency symbol) will cause this parsing failure during the write to Parquet.

Exam trap

AWS often tests the distinction between schema inference behavior and runtime type conversion errors, where candidates mistakenly attribute the error to missing data or schema detection rather than the actual parsing failure caused by malformed values.

How to eliminate wrong answers

Option A is wrong because missing values in a column would result in a null value, not a NumberFormatException; Spark can handle nulls in numeric columns without throwing a parsing error. Option B is wrong because if the schema inference had incorrectly detected the column as String, the write to Parquet would succeed without any type conversion error; the error occurs only when Spark tries to parse a string as a number. Option D is wrong because compressed CSV files are automatically decompressed by Spark/Glue based on the file extension (e.g., .gz, .bz2), and a decompression issue would produce an IOException or a different error, not a NumberFormatException.

Full explanation →

343

MCQmedium

A data scientist is using Amazon SageMaker Processing to run a feature engineering job. The job requires installing additional Python libraries not included in the default SageMaker containers. Which approach should the data scientist use to include these libraries?

A.Add the libraries to the `requirements.txt` file in the same S3 bucket as the script

B.Create a custom Docker image with the libraries installed and specify it in the ProcessingInput

C.Use Amazon EFS to store the libraries and mount them to the processing container

D.Use the `pip install` command within the processing script at runtime

AnswerB

A custom image ensures dependencies are available without runtime installation.

Why this answer

Option B is correct because SageMaker Processing jobs run in isolated containers that cannot install packages at runtime via pip without internet access or custom images. Creating a custom Docker image with the required libraries pre-installed ensures the environment is consistent, reproducible, and avoids dependency resolution failures during job execution. This approach aligns with SageMaker's best practice for custom dependencies.

Exam trap

The trap here is that candidates assume SageMaker containers have internet access by default or that a `requirements.txt` in S3 is automatically processed, but in reality, SageMaker Processing jobs often run in isolated subnets without outbound internet, making pip install impossible without a pre-built custom image.

How to eliminate wrong answers

Option A is wrong because a `requirements.txt` file in S3 is not automatically processed by SageMaker Processing; the container does not read it unless explicitly handled in a custom entry point or lifecycle script, and even then, pip install requires network access or a pre-built wheel. Option C is wrong because Amazon EFS is a file system for shared storage, not for distributing Python libraries; mounting EFS to a processing container would require custom network configuration and does not integrate with Python's import system without additional setup. Option D is wrong because `pip install` inside the processing script at runtime will fail if the container lacks internet access (common in VPC-only modes) or if the required build tools are missing, and it violates the principle of immutable infrastructure.

Full explanation →

344

MCQhard

A data science team is using Amazon SageMaker Pipelines to orchestrate a multi-step workflow that includes data preprocessing, training, and model evaluation. They want to reuse the preprocessed data across multiple pipeline executions without re-running the preprocessing step if the source data hasn't changed. What should they configure?

A.Use SageMaker Training steps with checkpointing

B.Use SageMaker Processing steps with caching

C.Use SageMaker Feature Store to store the preprocessed features

D.Use SageMaker Data Wrangler for the preprocessing

AnswerB

Caching in SageMaker Pipelines reuses step outputs when inputs are identical, avoiding redundant computation.

Why this answer

SageMaker Pipelines supports step caching, which allows reusing the output of a step if its inputs and parameters are unchanged. SageMaker Feature Store is for feature storage and serving, not for pipeline step reuse. Checkpointing is for training resumption, not step caching.

Data Wrangler preprocesses but caching is a pipeline feature.

Full explanation →

345

MCQmedium

A data scientist is training a large model on SageMaker and wants to reduce training time by using multiple GPUs. The model is small enough to fit on a single GPU but training is slow. Which SageMaker feature should be used?

A.Data parallelism using SageMaker's Distributed Data Parallel

B.Use a larger instance with more vCPUs

C.Model parallelism using SageMaker's Model Parallel

D.Use Elastic Inference

AnswerA

Data parallelism distributes the training across multiple GPUs, reducing training time for models that fit on a single GPU.

Why this answer

Option D is correct because SageMaker's Distributed Data Parallel (DDP) replicates the model across multiple GPUs and splits mini-batches, speeding up training for models that fit on a single GPU. Option A (Model Parallel) is for models too large for one GPU. Option B (larger instance) may provide more vCPUs but not necessarily more GPUs.

Option C (Elastic Inference) accelerates inference, not training.

Full explanation →

346

MCQhard

A data science team at a financial services company is deploying a real-time fraud detection model using Amazon SageMaker. The model is a gradient boosting classifier trained on historical transaction data. The model is deployed to a SageMaker endpoint with an ML.M5.LARGE instance for real-time inference. After deployment, the team observes that the endpoint's latency spikes to over 2 seconds during peak hours (10:00-12:00 and 14:00-16:00), causing timeouts for client applications. The average latency during off-peak hours is 200 ms. The team has enabled auto-scaling with a target average CPU utilization of 70%, but the endpoint still experiences high latency during peak hours. The instance count never scales beyond 2 instances during peaks. The model size is 500 MB, and each request includes 200 features. The team needs to reduce latency to under 500 ms at the 99th percentile during peak hours without increasing costs beyond the current budget. Which course of action should the team take?

A.Configure SageMaker batch transform for the real-time endpoint to process requests asynchronously.

B.Increase the auto-scaling maximum instance count to 10 and set target CPU utilization to 50%.

C.Switch the endpoint instance type to a GPU instance such as ml.g4dn.xlarge to accelerate inference.

D.Enable data compression on the endpoint to reduce payload size and network latency.

AnswerC

GPU instances can accelerate inference for gradient boosting models by parallelizing computations, reducing per-request latency significantly.

Why this answer

Option C is correct because GPU instances like ml.g4dn.xlarge are optimized for compute-intensive workloads such as gradient boosting inference, which involves numerous matrix operations. By offloading the computation to the GPU, the model can process each request faster, reducing latency from over 2 seconds to under 500 ms at the 99th percentile without increasing the instance count or budget. This directly addresses the root cause—CPU-bound inference during peak hours—while keeping costs stable.

Exam trap

The trap here is that candidates assume auto-scaling or instance count adjustments will solve latency issues, but the real bottleneck is per-instance compute capacity, which GPU acceleration directly addresses without increasing costs.

How to eliminate wrong answers

Option A is wrong because SageMaker batch transform is designed for offline, asynchronous processing of large datasets, not for real-time inference; it would introduce unacceptable delays and cannot meet the sub-500 ms latency requirement. Option B is wrong because increasing the maximum instance count to 10 and lowering CPU target to 50% would significantly increase costs (more instances running) and still not guarantee sub-500 ms latency if each instance is CPU-bound; the current scaling limit of 2 instances suggests the bottleneck is per-instance compute capacity, not scaling policy. Option D is wrong because data compression reduces payload size and network latency, but the primary latency spike is due to compute time (model inference), not network transfer; the 500 MB model and 200 features are already moderate, and compression would offer minimal improvement for the compute-bound bottleneck.

Full explanation →

347

MCQhard

A company uses SageMaker training jobs that need to access data in an S3 bucket in a different AWS account. The bucket uses a bucket policy that allows access only from a specific VPC. How should they configure the training job?

A.Use AWS DataSync to copy data to the training account's S3.

B.Create an IAM role in the source account and assume it from the training account.

C.Use an S3 VPC endpoint in the training job's VPC and attach a bucket policy that allows the VPC.

D.Use cross-account access with an IAM role and add a bucket policy allowing the training job's VPC.

AnswerD

This combines IAM role assumption and VPC condition to meet both requirements.

Why this answer

The training job should be launched in a VPC with an S3 VPC endpoint, and the bucket policy must allow the VPC. Additionally, an IAM role in the source account with cross-account trust is needed. Option C combines both requirements.

Full explanation →

348

MCQmedium

A machine learning engineer is training a deep learning model on SageMaker and notices that the training loss decreases rapidly in the first few epochs but then plateaus. The validation loss starts increasing after 10 epochs. Which action should the engineer take to improve generalization?

A.Add more layers to the model

B.Use early stopping with validation loss monitoring

C.Increase the learning rate

D.Decrease the batch size

AnswerB

Early stopping halts training when validation loss stops decreasing, reducing overfitting.

Why this answer

Early stopping is the correct action because the validation loss increasing after 10 epochs while training loss continues to decrease is a classic sign of overfitting. By monitoring validation loss and halting training when it stops improving (e.g., using a patience parameter), the engineer prevents the model from memorizing noise in the training data, thereby improving generalization. SageMaker's built-in training job features or the `EarlyStopping` callback in frameworks like TensorFlow or PyTorch can implement this directly.

Exam trap

AWS often tests the distinction between underfitting and overfitting symptoms, and the trap here is that candidates mistake a plateauing training loss for a need to increase model complexity or learning rate, when the rising validation loss clearly signals overfitting that early stopping can mitigate.

How to eliminate wrong answers

Option A is wrong because adding more layers increases model capacity, which would exacerbate overfitting and likely cause validation loss to rise even sooner, not improve generalization. Option C is wrong because increasing the learning rate would make training more unstable, potentially causing the loss to diverge or oscillate, and would not address the overfitting indicated by the rising validation loss. Option D is wrong because decreasing the batch size introduces more noise into gradient estimates, which can sometimes help escape local minima but does not directly prevent overfitting; it may even slow convergence and does not target the core issue of validation loss increasing.

Full explanation →

349

MCQmedium

A company uses an Amazon SageMaker endpoint with auto-scaling. They notice that during traffic bursts, new instances take several minutes to become healthy, causing 503 errors. What is the BEST way to reduce the time to serve requests during scaling events?

A.Set up a scheduled scaling policy to pre-warm instances before known traffic bursts.

B.Decrease the cooldown period for the scaling policy to add instances faster.

C.Use a larger instance type so that fewer instances are needed, and the scaling threshold is triggered less often.

D.Increase the maximum number of instances to allow more capacity.

AnswerC

Larger instances can serve more traffic, reducing scaling events.

Why this answer

Option D is correct because using a larger instance type with more compute resources can handle more requests per instance, reducing the need to scale as aggressively. Option A is wrong because proactive scaling with a schedule can help but doesn't reduce the time to become healthy. Option B is wrong because decreasing cooldown period could cause thrashing.

Option C is wrong because increasing maximum instances doesn't speed up each instance's startup.

Full explanation →

350

MCQhard

A company operates an IoT platform that ingests sensor data from thousands of devices. Data is streamed via Amazon Kinesis Data Streams and stored in an S3 bucket using a Kinesis Firehose delivery stream, which writes data in 5-minute windows. The data is then used to train a machine learning model for anomaly detection. Recently, the data science team noticed that the training dataset is always missing the last 5 minutes of events from the end of each day. The S3 objects show that the last delivery stream buffer window is incomplete. The data engineer checked the Kinesis Firehose metrics and found no delivery errors or data loss, but the 'IncomingBytes' and 'IncomingRecords' metrics show consistent data for all periods. The S3 bucket has Lifecycle policies that do not delete objects. The team suspects the issue is related to the data preparation pipeline. Which course of action would correctly resolve the missing data problem?

A.Increase the buffer size to 10 MB and reduce the buffer interval to 60 seconds in the Firehose delivery stream configuration

B.Reprocess the Kinesis stream data from the beginning using a custom application

C.Modify the data preparation pipeline to use AWS Lambda to write data to S3 directly from Kinesis

D.Increase the buffer interval to 600 seconds to allow more time for data to accumulate

AnswerA

Reducing the buffer interval to 60 seconds ensures that data is flushed every minute, preventing incomplete windows from being missed at the end of the day.

Why this answer

Option A is correct because the issue is that the last 5-minute buffer window at the end of each day never completes, so Firehose never delivers that final object to S3. By reducing the buffer interval to 60 seconds and increasing the buffer size to 10 MB, Firehose will flush data more frequently, ensuring that even small residual data at the end of the day is delivered before the stream stops. This directly addresses the incomplete last window without requiring reprocessing or changing the pipeline architecture.

Exam trap

The trap here is that candidates assume the missing data is due to data loss or pipeline errors, but the real issue is that Firehose's buffer window never completes when data stops arriving, so no S3 object is created for that final period.

How to eliminate wrong answers

Option B is wrong because reprocessing the entire Kinesis stream from the beginning is unnecessary and inefficient; the data is not lost, it is simply never delivered due to the buffer window not closing. Option C is wrong because switching to a Lambda-based direct write from Kinesis to S3 would bypass Firehose entirely, adding complexity and potential for data loss or duplication, and does not fix the root cause of the incomplete buffer window. Option D is wrong because increasing the buffer interval to 600 seconds would make the problem worse, as it would extend the time needed for a buffer window to complete, increasing the likelihood of incomplete windows at day boundaries.

Full explanation →

351

Multi-Selecteasy

A company wants to monitor its Amazon SageMaker real-time endpoint for data quality issues. Which TWO actions should the company take?

Select 2 answers

A.Create a baseline from the training data to compare against live data.

B.Use SageMaker Debugger to analyze training jobs.

C.Set up an AWS Lambda function to preprocess incoming requests.

D.Configure Amazon S3 bucket notifications for model artifacts.

E.Enable data capture on the SageMaker endpoint.

AnswersA, E

A baseline provides the expected statistics and constraints for the data.

Why this answer

To monitor data quality with SageMaker Model Monitor, you need to enable data capture on the endpoint and create a baseline from the training data. The other options are not directly required for data quality monitoring.

Full explanation →

352

MCQeasy

A data science team deploys a machine learning model to a SageMaker endpoint for real-time inference. They need to monitor the model for feature distribution drift over time to ensure the model's predictions remain accurate. Which AWS service should they use?

A.Amazon CloudWatch Evidently

B.AWS Glue DataBrew

C.SageMaker Clarify

D.SageMaker Model Monitor

E.SageMaker Debugger

AnswerD

Correct. SageMaker Model Monitor monitors data and model quality, including drift detection.

Why this answer

SageMaker Model Monitor is specifically designed to detect drift in feature distributions and prediction quality over time.

Full explanation →

353

Multi-Selectmedium

A data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (95% negative class, 5% positive class). The model is evaluated on a held-out test set, and the F1 score is 0.12. The data scientist wants to improve the F1 score. Which two actions should the data scientist take? (Choose two.)

Select 2 answers

A.Reduce the model complexity by decreasing the number of layers in a deep neural network.

B.Apply SMOTE (Synthetic Minority Oversampling Technique) to the training data using a preprocessing script in SageMaker Processing.

C.Increase the decision threshold to reduce false positives.

D.Use recall as the primary evaluation metric instead of F1.

E.Set the `scale_pos_weight` parameter in the SageMaker XGBoost estimator to the ratio of negative to positive samples.

AnswersB, E

Correct: SMOTE generates synthetic samples of the minority class, balancing the dataset and improving F1.

Why this answer

Setting scale_pos_weight in XGBoost adjusts the loss function to penalize misclassification of the minority class more heavily, improving recall and F1. SMOTE oversamples the minority class to balance the dataset. Option C reduces model capacity, which may not help; D changes the metric but doesn't fix the problem; E increases threshold and likely reduces recall further.

Full explanation →

354

MCQeasy

A company is using SageMaker Pipelines to automate a multi-step ML workflow. The pipeline includes data preprocessing, training, and model evaluation. The team wants to ensure that if the evaluation step fails, the pipeline stops and sends an alert to the operations team. Which SageMaker Pipelines feature should they use?

A.Configure an Amazon CloudWatch Events rule to monitor the pipeline execution status and stop it if the evaluation step fails

B.Register the model in the Model Registry only if evaluation passes, and configure the pipeline to stop if registration fails

C.Add a Lambda step after the evaluation step that checks the evaluation metrics and sends an SNS notification if the metrics are below a threshold

D.Use a Condition step to check the evaluation result and route to a Fail step if the result indicates failure

AnswerD

Condition step allows branching; a Fail step terminates the pipeline and can trigger notifications via SNS.

Why this answer

Option D is correct because SageMaker Pipelines provides a built-in Condition step that evaluates a boolean expression (e.g., checking if evaluation metrics meet a threshold) and then routes execution to different steps. If the condition fails, you can direct the pipeline to a Fail step, which immediately stops the pipeline and marks it as failed. This is the native, event-driven way to halt a pipeline based on step output without relying on external services.

Exam trap

The trap here is that candidates often confuse external monitoring (CloudWatch) or post-step actions (Lambda) with native pipeline control flow, missing that SageMaker Pipelines has a dedicated Condition step for conditional branching and halting execution.

How to eliminate wrong answers

Option A is wrong because CloudWatch Events rules can monitor pipeline state changes but cannot stop a running pipeline; they can only trigger notifications or invoke other actions after the fact. Option B is wrong because registering a model in the Model Registry is an optional downstream step, not a mechanism to stop the pipeline; if registration fails, the pipeline would still continue to subsequent steps unless explicitly handled. Option C is wrong because a Lambda step can send SNS notifications but does not have the ability to halt the pipeline execution; it would only alert after the step completes, not prevent further steps from running.

Full explanation →

355

Multi-Selecthard

Which THREE steps should be taken to optimize a large-scale distributed training job on SageMaker? (Choose 3.)

Select 3 answers

A.Attach multiple EBS volumes with throughput provisioning.

B.Use GPU instances with high bandwidth and memory (e.g., ml.p4d.24xlarge).

C.Enable batch transform for offline inference after training.

D.Use Elastic Fabric Adapter (EFA) for low-latency inter-node communication.

E.Select the appropriate distributed training strategy (e.g., Horovod, SageMaker data parallel, or model parallel).

AnswersB, D, E

GPU instances are necessary for large model training.

Why this answer

Options A, C, and D are correct. Using EFA (Elastic Fabric Adapter) reduces network latency, choosing the right distribution strategy (e.g., data parallelism vs model parallelism) improves scaling, and using GPU-optimized instances provides high compute. Option B is wrong because attaching additional EBS volumes does not directly help distributed training performance.

Option E is wrong because batch transform is for inference, not training.

Full explanation →

356

MCQmedium

A company is building a time series forecasting model using SageMaker DeepAR. The raw data is a CSV with columns: timestamp, item_id, and value. What is the correct data format required for DeepAR training?

A.JSON Lines files with 'start', 'target', and optional fields per time series

B.A wide-format CSV where each column is a different time series

C.Parquet files with a schema containing timestamp, item_id, and value

D.A single CSV file with columns: timestamp, item_id, value

AnswerA

DeepAR's training data format is JSON Lines with start timestamp and target array.

Why this answer

DeepAR requires time series data to be provided in JSON Lines format, where each line represents a single time series with a 'start' timestamp (in ISO 8601 format), a 'target' array of values, and optional fields like 'cat' for categorical features. This structured format allows DeepAR to handle variable-length sequences and missing values natively, which is not possible with simple CSV or wide-format data.

Exam trap

The trap here is that candidates assume DeepAR can accept raw CSV data like other SageMaker built-in algorithms (e.g., XGBoost), but DeepAR is a specialized time series algorithm that requires a specific JSON Lines structure with 'start' and 'target' fields, not a simple tabular format.

How to eliminate wrong answers

Option B is wrong because wide-format CSV (each column as a separate time series) is not supported by DeepAR; it expects each time series to be a separate JSON object, not columns. Option C is wrong because Parquet files are not a native input format for DeepAR; the built-in algorithm specifically requires JSON Lines or RecordIO-protobuf format. Option D is wrong because a single CSV with timestamp, item_id, and value columns does not provide the 'start' and 'target' structure DeepAR needs; it would require significant preprocessing to group by item_id and convert to the required JSON Lines format.

Full explanation →

357

MCQeasy

A data scientist is preparing a dataset for a machine learning model that predicts customer churn. The dataset contains a column 'CustomerID' that is a unique identifier. What should the data scientist do with this column before training the model?

A.Keep the column as a feature because it uniquely identifies each customer.

B.Use the column as the target variable.

C.Remove the column from the feature set.

D.Encode the column using one-hot encoding.

AnswerC

Removing unique identifiers prevents overfitting and is standard practice.

Why this answer

Option C is correct because 'CustomerID' is a unique identifier with no predictive power for churn. Including it as a feature would cause the model to memorize individual customers rather than learn generalizable patterns, leading to overfitting and poor performance on unseen data. In machine learning, such columns should be removed during data preparation to ensure the model learns from meaningful features.

Exam trap

The trap here is that candidates may think unique identifiers are useful for tracking or that they can be encoded as categorical features, but the exam tests the principle that identifiers with no predictive relationship to the target must be removed to avoid overfitting and data leakage.

How to eliminate wrong answers

Option A is wrong because keeping 'CustomerID' as a feature introduces a high-cardinality categorical variable with no correlation to the target, which can cause overfitting and degrade model generalization. Option B is wrong because the target variable for churn prediction should be a binary or categorical label indicating churn status, not a unique identifier that has no relationship to the outcome. Option D is wrong because one-hot encoding a unique identifier like 'CustomerID' would create thousands of sparse binary columns, dramatically increasing dimensionality without adding any predictive value, and is computationally wasteful.

Full explanation →

358

MCQmedium

Refer to the exhibit. A SageMaker Processing job fails with the following error log. Which change during data preparation would resolve the issue?

A.In SageMaker Data Wrangler, set the 'age' column type to 'number'

B.Drop rows with missing values in the 'age' column before training

C.Remove the 'age' column from the dataset entirely

D.Modify the preprocessing script to cast 'age' to float using astype(float)

AnswerD

Casting the column ensures numeric operations work.

Why this answer

Option D is correct because the error log indicates a type mismatch when processing the 'age' column, likely due to mixed data types (e.g., strings and numbers) in a column expected to be numeric. By explicitly casting the column to float using astype(float) in the preprocessing script, you ensure consistent numeric type handling, which resolves the failure during SageMaker Processing job execution.

Exam trap

The trap here is that candidates often assume missing value handling (Option B) or column removal (Option C) is the fix, when the actual issue is a data type inconsistency that requires explicit type casting in the preprocessing code.

How to eliminate wrong answers

Option A is wrong because setting the 'age' column type to 'number' in SageMaker Data Wrangler only affects the visual interface and exported recipe, but does not enforce type casting in the actual processing script, so the underlying data type mismatch persists. Option B is wrong because dropping rows with missing values does not address the core issue of mixed data types (e.g., strings like 'N/A' or 'unknown') in the 'age' column; the error is about type conversion, not missing values. Option C is wrong because removing the 'age' column entirely discards potentially valuable feature data and does not solve the type mismatch problem; it is an overly aggressive workaround that reduces model performance.

Full explanation →

359

MCQeasy

A company wants to deploy a machine learning model that was trained on-premises using TensorFlow. The model is a TensorFlow SavedModel. The company uses AWS and wants to minimize operational overhead. Which deployment option meets these requirements?

A.Deploy the model on Amazon ECS using a custom Docker image.

B.Deploy the model as an AWS Lambda function with the TensorFlow runtime.

C.Deploy the model using Amazon SageMaker Studio.

D.Deploy the model using Amazon SageMaker with a TensorFlow inference container.

AnswerD

SageMaker provides pre-built TensorFlow containers and manages the endpoint, reducing operational overhead.

Why this answer

Amazon SageMaker provides a fully managed TensorFlow inference container that directly supports TensorFlow SavedModel format, enabling deployment without any custom infrastructure management. This minimizes operational overhead compared to self-managed options like ECS or Lambda, as SageMaker handles scaling, load balancing, and model updates automatically.

Exam trap

AWS often tests the distinction between SageMaker Studio (an IDE) and SageMaker hosting (deployment endpoints), leading candidates to mistakenly select Studio as a deployment option when it is only for development and experimentation.

How to eliminate wrong answers

Option A is wrong because deploying on Amazon ECS with a custom Docker image requires you to build, maintain, and scale the container infrastructure yourself, increasing operational overhead. Option B is wrong because AWS Lambda has a maximum deployment package size limit (250 MB unzipped) and a 15-minute timeout, making it unsuitable for large TensorFlow SavedModels or inference requests that require significant compute. Option C is wrong because Amazon SageMaker Studio is an integrated development environment (IDE) for building, training, and debugging models, not a deployment target; the actual deployment would still require creating an endpoint, which is covered by Option D.

Full explanation →

360

Multi-Selecthard

A company is deploying a machine learning model using Amazon SageMaker. The model is a large deep learning model that requires GPU for inference. The company expects unpredictable traffic patterns with occasional bursts. They want to minimize cost while ensuring low latency during bursts. Which TWO actions should they take? (Select TWO.)

Select 2 answers

A.Use a serverless endpoint configuration to automatically scale.

B.Use a multi-model endpoint with a mix of CPU and GPU instances to handle variable traffic.

C.Use Spot instances for the endpoint to reduce cost.

D.Provision multiple on-demand GPU instances behind a load balancer.

E.Use Amazon SageMaker Elastic Inference to attach GPU acceleration to a CPU instance.

AnswersB, E

Multi-model endpoints allow efficient resource utilization and cost savings.

Why this answer

Option B is correct because a multi-model endpoint with a mix of CPU and GPU instances allows the company to host multiple models on the same endpoint, reducing cost by sharing underlying instances. By including GPU instances, the endpoint can handle the GPU-intensive deep learning inference for the large model, while the CPU instances can serve lighter loads or fallback traffic, ensuring low latency during unpredictable bursts without over-provisioning.

Exam trap

The trap here is that candidates often confuse serverless endpoints with GPU support, not realizing that SageMaker serverless endpoints are CPU-only, and they may overlook that multi-model endpoints can mix instance types to balance cost and performance for bursty GPU workloads.

Full explanation →

361

Multi-Selectmedium

A company has deployed a SageMaker endpoint for real-time inference. The security team needs to monitor for potential security threats such as unauthorized access attempts and tampering with the model configuration. Which TWO actions should the team take? (Choose TWO.)

Select 2 answers

A.Enable AWS CloudTrail for the SageMaker endpoint API calls

B.Enable AWS Config to monitor endpoint configuration changes

C.Enable SageMaker Data Capture on the endpoint

D.Enable SageMaker Model Monitor for the endpoint

E.Enable Amazon GuardDuty for the endpoint

AnswersA, B

CloudTrail logs all API calls, providing an audit trail for security analysis.

Why this answer

CloudTrail logs all API calls to the SageMaker endpoint, including who made the call and from where, which helps identify unauthorized access. AWS Config continuously monitors endpoint configuration changes and can trigger alerts when changes are made without authorization. SageMaker Model Monitor is for data drift, not security.

Data Capture captures input/output for monitoring model performance, not security. GuardDuty is a threat detection service for AWS accounts and workloads, but it does not directly monitor SageMaker endpoints specifically.

Full explanation →

362

MCQeasy

A machine learning engineer is building a regression model to predict house prices. The feature 'square_footage' has values ranging from 500 to 10,000, while 'num_bedrooms' ranges from 1 to 10. Which preprocessing step is most critical before training a model that uses gradient descent?

A.Standardize both features to have zero mean and unit variance.

B.Apply a logarithmic transformation to both features.

C.Encode the 'num_bedrooms' feature using one-hot encoding.

D.Impute missing values using the mean of the feature.

AnswerA

Standardization brings features to a common scale, crucial for gradient descent.

Why this answer

Gradient descent is sensitive to the scale of features because it updates weights proportionally to the feature values. With 'square_footage' (500–10,000) and 'num_bedrooms' (1–10), the large range difference causes the loss function's contours to be elongated, leading to slow or unstable convergence. Standardizing both features to zero mean and unit variance ensures each feature contributes equally to the gradient updates, enabling faster and more reliable optimization.

Exam trap

AWS often tests the distinction between scaling for gradient-based optimizers versus other preprocessing steps like encoding or transformation, trapping candidates who confuse feature scaling with handling outliers or categorical data.

How to eliminate wrong answers

Option B is wrong because applying a logarithmic transformation is not the most critical step for gradient descent; it is used to handle skewed distributions or multiplicative relationships, not to address feature scale differences. Option C is wrong because one-hot encoding is for categorical features, and 'num_bedrooms' is ordinal (integer-valued), not nominal; encoding it would create unnecessary sparsity and lose the natural ordering. Option D is wrong because imputing missing values is a general data cleaning step, but the question does not mention any missing data; the core issue here is feature scaling for gradient descent, not missingness.

Full explanation →

363

MCQeasy

A company is building a recommendation system and has trained a matrix factorization model using SageMaker. They want to evaluate the model's performance using precision at k (P@k) and recall at k (R@k). They have a test set of user-item interactions. The data scientist implements a custom evaluation script that computes these metrics, but the precision values are consistently zero. What is the most likely cause?

A.The model outputs are not being ranked correctly.

B.The model is overfitting.

C.The test set contains only positive interactions.

D.The k value is too large.

AnswerC

Correct: Without negative examples, precision is undefined or zero if no test items are in the recommendation list.

Why this answer

Option C is correct because if the test set contains only positive interactions (items the user interacted with), then there are no negative examples. In precision at k, if the recommended items do not exactly match the test set items (which is likely), precision will be zero. Options A and B are incorrect because ranking or k value would not cause consistent zero unless no overlap.

Option D is incorrect because overfitting would cause high training accuracy, not zero precision.

Full explanation →

364

MCQmedium

A machine learning engineer is troubleshooting a model that is producing unexpectedly low accuracy in production. The engineer examines the model's training data and finds that the distribution of the target variable in production is significantly different from the training set. What type of drift is the model experiencing?

A.Prior probability shift

B.Concept drift

C.Data drift

D.Covariate shift

AnswerB

Concept drift is a change in the statistical properties of the target variable.

Why this answer

Option B is correct because a change in the target variable distribution is concept drift. Option A is wrong because covariate shift is input distribution change. Option C is wrong because prior probability shift is a type of concept drift, but not the best answer here.

Option D is wrong because data drift is a general term.

Full explanation →

365

MCQmedium

An e-commerce company uses Amazon SageMaker to deploy a real-time inference endpoint for product recommendations. The endpoint receives bursty traffic, with occasional spikes. The company wants to minimize cost while ensuring that latency remains under 100 ms. Which approach should the company take?

A.Use an elastic inference accelerator to reduce latency instead of scaling.

B.Use a scheduled scaling plan based on historical traffic patterns.

C.Deploy the model on one large instance to handle peak load.

D.Deploy the model on a multi-model endpoint with automatic scaling and configure a warm-up period for new instances.

AnswerD

Multi-model endpoint with scaling and warm-up can handle bursts cost-effectively.

Why this answer

Option D is correct because a multi-model endpoint with automatic scaling allows multiple models to share a single endpoint, reducing cost while handling bursty traffic. Configuring a warm-up period ensures new instances are fully initialized before receiving traffic, preventing cold-start latency spikes and keeping inference under 100 ms.

Exam trap

The trap here is that candidates confuse latency optimization techniques (like elastic inference) with scaling strategies, overlooking that bursty traffic requires dynamic scaling with warm-up to prevent cold-start latency spikes.

How to eliminate wrong answers

Option A is wrong because elastic inference accelerators reduce per-inference latency but do not address the need to scale out during traffic spikes; they add cost without solving the bursty traffic problem. Option B is wrong because scheduled scaling based on historical patterns cannot react to unpredictable spikes, leading to either over-provisioning or latency violations during unexpected bursts. Option C is wrong because deploying on one large instance creates a single point of failure and is cost-inefficient for bursty traffic; it either underutilizes resources during low traffic or fails to handle peak load without latency degradation.

Full explanation →

366

MCQmedium

A media company uses SageMaker endpoints to serve a model that predicts video engagement. They have two production variants: Variant A (ml.c5.large) for regular traffic and Variant B (ml.c5.xlarge) for burst traffic. They use weighted routing (90% to A, 10% to B). Recently, during peak hours, Variant A's latency increase causes many requests to time out. The metrics show that both variants are under similar CPU load, but the number of concurrent requests to Variant A is very high. The team wants to ensure that burst traffic is handled properly without manual intervention. What should they do?

A.Increase the traffic weight to Variant B to 70% and reduce Variant A to 30%.

B.Configure Application Auto Scaling for each variant with a target tracking scaling policy based on the number of concurrent requests per instance.

C.Set a CloudWatch alarm on Variant A's p99 latency and trigger a step scaling policy to add instances.

D.Create a separate endpoint for burst traffic and route peak traffic to it via DNS.

AnswerB

Autoscaling adjusts capacity based on load, preventing timeouts.

Why this answer

Option B is correct because changing to target tracking scaling based on the number of concurrent requests (or InvocationsPerInstance) ensures each variant scales based on its load. Option A (swap weights) doesn't fix scaling. Option C (p99 latency alarm) might trigger too late.

Option D (separate endpoint) is not necessary.

Full explanation →

367

MCQmedium

A company has a SageMaker endpoint that was deployed successfully and is in service. However, when the team sends test inferences using the InvokeEndpoint API, they receive a 500 internal server error. The endpoint logs in CloudWatch show a stack trace indicating 'OutOfMemoryError: Java heap space'. The model is a large XGBoost model (2 GB) and the endpoint is using an ml.m5.large instance with 8 GB of memory. What is the MOST likely cause and solution?

A.The endpoint needs to have a smaller batch size configured in the real-time inference request.

B.The instance type has insufficient memory for the model size; use a larger instance type like ml.m5.xlarge (16 GB) or ml.m5.2xlarge.

C.The model is a Transformer model and requires a GPU instance; use ml.g4dn.xlarge instead.

D.The SageMaker container is not compatible with XGBoost; switch to a framework container.

AnswerB

A 2 GB model plus runtime overhead (e.g., Java heap for XGBoost) can exceed 8 GB. Increasing instance memory resolves the out-of-memory error.

Why this answer

The OutOfMemoryError in Java heap space indicates that the model (2 GB) plus the runtime overhead of the XGBoost container and Java-based inference code exceed the available memory on the ml.m5.large instance (8 GB total, but not all is available for the Java heap). The most direct fix is to use a larger instance type, such as ml.m5.xlarge (16 GB) or ml.m5.2xlarge, to provide sufficient heap space for the model and inference operations.

Exam trap

The trap here is that candidates may incorrectly attribute the OutOfMemoryError to batch size or container compatibility, rather than recognizing that the instance's memory is insufficient for the model size and Java heap overhead.

How to eliminate wrong answers

Option A is wrong because batch size configuration is not applicable to real-time InvokeEndpoint requests (which are single inference calls), and reducing batch size would not resolve a Java heap space error caused by model size and overhead. Option C is wrong because the model is explicitly stated as XGBoost, not a Transformer model, and XGBoost runs efficiently on CPU instances; GPU instances are not required. Option D is wrong because SageMaker provides native support for XGBoost via built-in containers, and the error is a memory issue, not a compatibility issue with the container.

Full explanation →

368

Multi-Selecthard

A data scientist is developing a gradient boosting model and observes that the model is overfitting to the training data. Which three techniques can help reduce overfitting? (Select THREE.)

Select 3 answers

A.Reduce the learning rate

B.Apply early stopping

C.Increase the maximum depth of trees

D.Increase the regularization parameters (e.g., lambda, alpha)

E.Add subsampling of data or features

.Increase the number of trees

AnswersA, B, E

A lower learning rate makes the model more robust and reduces overfitting.

Why this answer

Reducing the learning rate slows down learning and helps reduce overfitting. Subsampling (data/feature sampling) adds randomness and reduces overfitting. Early stopping stops training before overfitting occurs.

Increasing the number of trees or tree depth increases model complexity, worsening overfitting. Increasing regularization parameters (like lambda, alpha) also helps reduce overfitting, but the three most common for gradient boosting are reducing learning rate, subsampling, and early stopping.

Full explanation →

369

MCQmedium

A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The data is stored in Amazon S3 in Parquet format. A data engineer notices that the Glue job is running slowly and consuming a lot of resources. What is the MOST cost-effective way to improve the performance of the Glue job?

A.Use the G.1X worker type, which provides more memory per worker compared to the Standard worker type.

B.Use partition pruning on the source data to reduce the amount of data processed.

C.Switch the output format from Parquet to CSV to reduce processing overhead.

D.Use a larger instance type for the Glue job by increasing the number of DPUs.

AnswerA

G.1X offers more memory, reducing memory-related bottlenecks without increasing DPU count.

Why this answer

Increasing the number of DPUs (Data Processing Units) in AWS Glue can improve parallelism and reduce job runtime, but it increases cost. Using G.1X worker type with more memory per worker can improve performance without increasing DPU count, offering better resource utilization. Switching to CSV may degrade performance.

Using partition pruning on the source data can reduce data scanned but may not address resource consumption.

Full explanation →

370

MCQhard

A company deploys a SageMaker model using AWS KMS for encryption at rest. They have a compliance requirement to rotate the KMS key every year without causing downtime for the inference endpoint. Which approach should they take?

A.Use AWS Certificate Manager (ACM) for encryption

B.Create a new KMS key and update the endpoint configuration

C.Manually rotate the key by recreating the endpoint

D.Enable automatic key rotation on the existing KMS key

AnswerD

Automatic rotation rotates the key material without changing the key ID, causing no downtime.

Why this answer

Option B is correct because AWS KMS supports automatic key rotation, which rotates the key material yearly without requiring any changes to the endpoint. Options A, C, and D would cause downtime or are unnecessary.

Full explanation →

371

Multi-Selecteasy

Which TWO data storage options are commonly used by Amazon SageMaker Feature Store for offline and online storage?

Select 2 answers

A.Amazon Redshift

B.Amazon RDS

C.Amazon ElastiCache

D.Amazon S3

E.Amazon DynamoDB

AnswersD, E

S3 is the default offline store for large historical feature data.

Why this answer

Amazon SageMaker Feature Store uses Amazon S3 as the default offline storage layer because it provides durable, scalable, and cost-effective object storage for large volumes of historical feature data. Amazon DynamoDB is used as the default online storage layer because it offers low-latency, single-digit millisecond read/write performance required for real-time inference serving.

Exam trap

The trap here is that candidates often confuse Amazon ElastiCache (a caching layer) with the primary online storage service, or assume Amazon Redshift is used for offline storage due to its analytical capabilities, but SageMaker Feature Store specifically integrates DynamoDB for online and S3 for offline storage as first-class options.

Full explanation →

372

MCQhard

A data science team is building a model to predict fraudulent transactions. The dataset has 1 million legitimate transactions and only 1,000 fraudulent ones. They plan to use Amazon SageMaker to train a model. Which data preparation technique should they apply to address the severe class imbalance before training?

A.Apply data augmentation using image transformations because fraud detection is like image classification.

B.Randomly oversample the fraudulent class to match the legitimate count by duplicating existing fraud records.

C.Use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic fraudulent samples.

D.Randomly undersample the legitimate class to 1,000 samples to create a balanced dataset.

AnswerC

SMOTE creates synthetic examples by interpolating between existing minority instances, reducing overfitting risk.

Why this answer

Option D is correct because using SMOTE generates synthetic samples for the minority class, addressing imbalance without simply duplicating data. Option A is wrong because oversampling with duplication can lead to overfitting. Option B is wrong because undersampling discards too much legitimate data, losing valuable patterns.

Option C is wrong because the data is already in a tabular format, not images.

Full explanation →

373

MCQhard

A data scientist is trying to create a SageMaker endpoint configuration with 6 instances of ml.c5.large for a production variant. The creation fails with the error shown in the exhibit. Which action should the data scientist take to resolve this issue?

A.Create two separate endpoint configurations, each with 3 instances, and distribute traffic between them.

B.Request a service quota increase for ml.c5.large for real-time endpoints from the AWS Service Quotas console.

C.Use a different instance type, such as ml.m5.large, which has a higher limit.

D.Delete unused endpoints to free up resources.

AnswerB

Increasing the quota allows provisioning the requested number of instances.

Why this answer

The error indicates that the requested number of instances exceeds the service quota for ml.c5.large for real-time endpoints. AWS enforces default limits on instance counts per instance type per region. Requesting a quota increase via the Service Quotas console is the correct action to raise the limit and allow the deployment of 6 instances.

Exam trap

The trap here is that candidates may confuse service quotas with resource availability, thinking that deleting unused endpoints or splitting configurations will free up capacity, when in fact the quota is a hard limit that must be explicitly increased.

How to eliminate wrong answers

Option A is wrong because creating two separate endpoint configurations does not bypass the service quota; the total instance count across all endpoints still counts against the same quota. Option C is wrong because using a different instance type like ml.m5.large does not inherently have a higher limit; each instance type has its own default quota, and the limit for ml.m5.large may also be insufficient or unknown without checking. Option D is wrong because deleting unused endpoints does not increase the quota for ml.c5.large; it only frees up currently used instances, but the quota itself remains unchanged.

Full explanation →

374

Multi-Selecthard

A machine learning team is setting up Model Monitor for a deployed model. Which THREE factors should they consider when configuring the monitoring schedule? (Select three.)

Select 3 answers

A.The monitoring job can be configured to send notifications via Amazon SNS.

B.The frequency of monitoring should be at least daily.

C.The monitoring job should analyze a sufficient sample size to be statistically significant.

D.The monitoring job should run on a schedule that aligns with data arrival patterns.

E.The constraints file must be updated after each monitoring run.

AnswersA, C, D

SNS notifications can alert teams when violations are detected.

Why this answer

Options B, D, and E are correct. B: Sufficient sample size ensures statistical significance. D: Schedule should align with data arrival patterns to detect drift promptly.

E: SNS notifications can be set up for alerts. A is not necessarily correct; frequency depends on data volume. C is incorrect because constraints are updated manually or via baseline jobs, not automatically after each run.

Full explanation →

375

Multi-Selectmedium

A data scientist is performing feature engineering for a dataset with both numerical and categorical features. The data scientist wants to apply transformations that preserve the interpretability of the features. Which TWO transformations should the data scientist use? (Select TWO)

Select 2 answers

A.Log transformation of skewed numerical features

B.Target encoding of high-cardinality categorical features

C.Standard scaling of numerical features

D.PCA dimensionality reduction

E.One-hot encoding of categorical features

AnswersA, C

Log transformation reduces skewness while keeping feature order.

Why this answer

Log transformation is correct because it reduces skewness in numerical features by compressing the scale of large values, making the distribution more normal while preserving the original feature's interpretability (e.g., a log-transformed income value still relates to income). This is a monotonic transformation, so the order of values is maintained, and the feature remains directly understandable.

Exam trap

AWS often tests the misconception that one-hot encoding always preserves interpretability (it does, but the question pairs it with target encoding as a distractor), leading candidates to select one-hot encoding instead of recognizing that standard scaling is the correct second choice for numerical features.

Full explanation →

Page 5 of 7

All pages

Practice MLA-C01 by domain

Target a specific domain to shore up weak areas.

Data Preparation for Machine Learning ML Model Development Deployment and Orchestration of ML Workflows ML Solution Monitoring, Maintenance and Security

See all domains with question counts →