Knowledge + Practice

AWS Certified Machine Learning Engineer Associate MLA-C01 (MLA-C01) — Questions 376–450

507 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 6 of 7

376

MCQmedium

A company uses Amazon SageMaker Ground Truth to create labeled datasets for object detection. The output must be in COCO format for downstream model training. How should the data preparation process be configured?

A.Use a built-in transformation to convert from Ground Truth JSON to COCO after labeling

B.Use a pre-built AWS Lambda function to transform annotations to COCO

C.Write a custom SageMaker Processing script to convert the output to COCO

D.Select 'Object Detection' task type and specify 'COCO' as the output format in the labeling job configuration

AnswerD

Ground Truth supports COCO output for object detection tasks.

Why this answer

Option D is correct because Amazon SageMaker Ground Truth natively supports outputting object detection labeling jobs in COCO format. When you select 'Object Detection' as the task type, the labeling job configuration includes an option to specify 'COCO' as the output format, which automatically structures the labeled data into the required COCO JSON schema without any post-processing.

Exam trap

The trap here is that candidates assume post-processing is always required for format conversion, overlooking that Ground Truth can directly output COCO format when the correct task type and output format are selected in the labeling job configuration.

How to eliminate wrong answers

Option A is wrong because Ground Truth does not provide a built-in transformation to convert its default JSON output to COCO format; the conversion must be handled externally. Option B is wrong because while AWS Lambda can be used for custom transformations, it is not a pre-built solution for this specific conversion; using a Lambda function would require writing custom code and is not the recommended or simplest approach. Option C is wrong because writing a custom SageMaker Processing script is an unnecessary extra step; Ground Truth can directly output COCO format, eliminating the need for any post-labeling transformation.

Full explanation →

377

MCQeasy

A data engineer needs to convert a JSON dataset to Parquet format for efficient querying with Amazon Athena. The JSON files are in an S3 bucket. Which service can perform this conversion with minimal coding?

A.Amazon SageMaker Processing

B.Amazon EMR

C.AWS Lambda

D.AWS Glue Studio with a visual job

AnswerD

Glue Studio's drag-and-drop interface enables JSON to Parquet conversion with minimal coding.

Why this answer

AWS Glue Studio with a visual job is the correct choice because it provides a no-code, drag-and-drop interface to create ETL jobs that can read JSON from S3 and write it as Parquet, with built-in schema inference and transformation capabilities. This minimizes coding effort while leveraging Glue's serverless Spark engine for efficient conversion, making it ideal for preparing data for Athena queries.

Exam trap

The trap here is that candidates often confuse AWS Glue Studio with AWS Glue DataBrew or assume that any AWS service with 'processing' in its name (like SageMaker Processing) is suitable for simple ETL tasks, overlooking the specific no-code visual job capability of Glue Studio.

How to eliminate wrong answers

Option A is wrong because Amazon SageMaker Processing is designed for data preprocessing and model training workflows within the ML pipeline, not for simple file format conversion; it requires writing custom processing scripts and managing infrastructure, which adds unnecessary complexity. Option B is wrong because Amazon EMR is a managed Hadoop/Spark cluster that can perform the conversion, but it requires provisioning and configuring a cluster, writing Spark or Hive code, and managing lifecycle, which is far more coding and operational overhead than a visual job. Option C is wrong because AWS Lambda has a maximum execution time of 15 minutes and a deployment package size limit, making it impractical for converting large JSON datasets to Parquet; it also requires custom Python code with libraries like PyArrow or Pandas, which is not minimal coding.

Full explanation →

378

MCQmedium

A company uses SageMaker endpoints with auto-scaling. The endpoint is experiencing high latency during peak hours. The metrics show CPU utilization is low but memory is high. What is the most likely cause?

A.The model is not optimized for inference, causing memory leaks.

B.The auto-scaling policy is based on CPU utilization, which does not trigger scaling.

C.The instance type has insufficient network bandwidth.

D.The endpoint is deployed in a VPC without a NAT gateway.

AnswerB

CPU is low so scaling not triggered, but memory high indicates need for more instances.

Why this answer

The auto-scaling policy is likely based on CPU utilization, which does not trigger scaling during memory pressure. Memory leaks could be a secondary cause, but the primary issue is the scaling metric.

Full explanation →

379

Multi-Selecteasy

Which TWO actions are recommended best practices when preparing training data for a machine learning model in AWS? (Choose two.)

Select 2 answers

A.Remove all outliers from the dataset.

B.Train the model on the entire dataset to maximize data usage.

C.Check for and handle missing values appropriately.

D.Split the data into training, validation, and test sets.

E.Always normalize all features to a [0,1] range.

AnswersC, D

Missing values can cause errors or bias if not addressed.

Why this answer

Option C is correct because missing values can introduce bias or cause algorithms to fail, so handling them (e.g., via imputation or removal) is a critical data preparation step in AWS SageMaker. Option D is correct because splitting data into training, validation, and test sets allows you to evaluate model performance on unseen data and prevent overfitting, which is a standard practice in SageMaker's built-in algorithms and training jobs.

Exam trap

The trap here is that candidates assume all outliers must be removed (Option A) or that normalization is always required (Option E), but the exam tests nuanced understanding that these steps depend on the algorithm and data characteristics, not blanket rules.

Full explanation →

380

MCQmedium

A team is using SageMaker Pipelines to automate retraining and deployment. They want to trigger the pipeline automatically when new training data is available in an S3 bucket. Which approach should they use?

A.Create an Amazon EventBridge rule that triggers the pipeline execution on S3 PutObject events

B.Register the pipeline as a model package in SageMaker Model Registry

C.Configure a cron job to run the pipeline every hour

D.Use AWS Step Functions to poll the S3 bucket and start the pipeline when a new object appears

AnswerA

EventBridge can detect S3 events and start pipeline executions.

Why this answer

Option A is correct because Amazon EventBridge can directly capture S3 PutObject events and invoke a SageMaker Pipeline execution as a target. This provides a fully event-driven, serverless integration without polling or manual intervention, aligning with best practices for automating ML workflows when new data arrives.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing Step Functions (Option D) for orchestration, not realizing that EventBridge provides a simpler, event-driven trigger without the need for polling or additional state machines.

How to eliminate wrong answers

Option B is wrong because registering a pipeline as a model package in SageMaker Model Registry is for versioning and managing trained models, not for triggering pipeline executions based on S3 events. Option C is wrong because a cron job runs on a fixed schedule, which is inefficient and may miss data arrivals or run unnecessarily, whereas the requirement is to trigger only when new data appears. Option D is wrong because using AWS Step Functions to poll S3 introduces latency, cost, and complexity compared to the native event-driven approach with EventBridge, which reacts instantly to S3 events.

Full explanation →

381

MCQhard

A data scientist is preparing a dataset for a regression model that predicts house prices. The dataset includes a `neighborhood` feature with 500 distinct categories. The data scientist wants to encode this feature without increasing dimensionality too much and while capturing the target relationship. Which encoding technique should be used?

A.Target encoding (mean encoding)

B.One-hot encoding

C.Frequency encoding

D.Label encoding

AnswerA

Target encoding captures target relationship with low dimensionality.

Why this answer

Target encoding (mean encoding) is the correct choice because it replaces each of the 500 neighborhood categories with the mean of the target variable (house price) for that category. This captures the relationship between the neighborhood and the target while adding only one new feature column, thus avoiding the massive dimensionality explosion that would occur with one-hot encoding (which would create 500 binary columns).

Exam trap

AWS often tests the trade-off between dimensionality and information retention, and the trap here is that candidates may choose one-hot encoding out of habit, failing to recognize that 500 categories make it impractical, or choose label encoding because it seems simple, ignoring the ordinal assumption it imposes.

How to eliminate wrong answers

Option B (One-hot encoding) is wrong because it would create 500 binary columns, drastically increasing dimensionality and leading to the curse of dimensionality, sparsity, and overfitting. Option C (Frequency encoding) is wrong because it replaces categories with their count/frequency, which does not capture the relationship with the target variable (house price) and loses predictive signal. Option D (Label encoding) is wrong because it assigns arbitrary integer labels (e.g., 1, 2, 3) that imply an ordinal relationship, which is inappropriate for a nominal feature like neighborhood and can mislead the regression model into assuming a false order.

Full explanation →

382

MCQeasy

A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?

A.Clip negative sales values to zero

B.Apply log transformation after adding a constant

C.Remove all rows with negative sales values

D.Impute negative values with the mean

AnswerA

Sets returns to zero, which is appropriate for sales data.

Why this answer

Option A is correct because clipping negative sales values to zero directly addresses the model's requirement for non-negative input while preserving the data's temporal structure. This approach is appropriate for time series forecasting where returns cause occasional negative values, as it treats returns as zero sales rather than removing or distorting the data points.

Exam trap

AWS often tests the misconception that removing or imputing negative values is safe in time series, but the trap here is that these actions break temporal dependencies and introduce bias, whereas clipping preserves the sequence structure.

How to eliminate wrong answers

Option B is wrong because applying a log transformation after adding a constant does not guarantee non-negative values; it only compresses the scale and can introduce bias, especially with negative values that require arbitrary shifting. Option C is wrong because removing all rows with negative sales values disrupts the time series continuity and can lead to loss of important temporal patterns, such as seasonality or trends. Option D is wrong because imputing negative values with the mean introduces statistical bias and distorts the underlying distribution, which is particularly problematic in time series where data points are sequentially dependent.

Full explanation →

383

MCQeasy

Refer to the exhibit. A user is unable to invoke a SageMaker endpoint. The IAM policy shown is attached to the user. Which permission is missing to allow invocation?

A.sagemaker:InvokeEndpoint

B.sagemaker:DescribeEndpoint

C.sagemaker:CreateEndpoint

D.sagemaker:ListEndpoints

AnswerA

InvokeEndpoint is required to send inference requests.

Why this answer

To invoke a SageMaker endpoint, the user needs the `sagemaker:InvokeEndpoint` permission. The IAM policy shown lacks this action, which is required for making real-time inference requests to the endpoint. Without it, any attempt to call the endpoint via the SDK or CLI will fail with an access denied error.

Exam trap

AWS often tests the distinction between read-only permissions (like `DescribeEndpoint` or `ListEndpoints`) and the specific action required to perform an operation, leading candidates to confuse metadata access with actual invocation capability.

How to eliminate wrong answers

Option B is wrong because `sagemaker:DescribeEndpoint` only allows retrieving metadata about an endpoint, not invoking it for inference. Option C is wrong because `sagemaker:CreateEndpoint` is for creating new endpoints, not for sending inference requests to an existing one. Option D is wrong because `sagemaker:ListEndpoints` only lists endpoints in the account, which does not grant the ability to invoke them.

Full explanation →

384

MCQmedium

A company uses Amazon SageMaker Data Wrangler to create a data flow for a classification model. The dataset contains a high-cardinality categorical feature 'product_id' with 50,000 unique values. The data scientist wants to reduce dimensionality while preserving predictive power. Which approach is most effective?

A.Apply one-hot encoding to the 'product_id' column.

B.Perform target encoding by replacing each product ID with the average target value for that product.

C.Use feature hashing to map product IDs to a fixed number of buckets (e.g., 100).

D.Drop the 'product_id' column entirely.

AnswerB

Target encoding condenses information into a single numerical feature while retaining predictive signals.

Why this answer

Target encoding is the most effective approach for high-cardinality categorical features because it replaces each category with the mean of the target variable, preserving predictive signal while drastically reducing dimensionality. In SageMaker Data Wrangler, this can be implemented using the 'Encode categorical' transform with the 'Target encoding' option, which avoids the explosion of features caused by one-hot encoding and retains the relationship between product IDs and the target.

Exam trap

AWS often tests the misconception that feature hashing is always safe for high-cardinality features, but the trap here is that hash collisions can degrade model performance, making target encoding a better choice when the target variable is available and predictive.

How to eliminate wrong answers

Option A is wrong because one-hot encoding on a feature with 50,000 unique values would create 50,000 binary columns, leading to extreme dimensionality and sparsity, which degrades model performance and increases computational cost. Option C is wrong because feature hashing maps product IDs to a fixed number of buckets (e.g., 100), which can cause hash collisions and loss of information, reducing predictive power compared to target encoding. Option D is wrong because dropping the column entirely discards all predictive information contained in the product IDs, which is likely to harm model accuracy.

Full explanation →

385

MCQhard

A financial services company deploys a credit risk model using an Amazon SageMaker endpoint with data capture enabled. The model uses a custom container. The compliance team requires that all inference requests and responses are logged to an S3 bucket with server-side encryption using AWS KMS. The IAM role for the endpoint has the following policy. What must be added to meet the compliance requirement?

A.Add kms:GenerateDataKey and kms:Decrypt permissions to the IAM role.

B.Add s3:PutObjectAcl permission to the IAM role.

C.Enable S3 default encryption on the bucket.

D.Modify the container to handle encryption internally.

AnswerA

These permissions are necessary to write to a KMS-encrypted bucket.

Why this answer

The correct answer is A because the IAM role for the SageMaker endpoint needs permissions to generate a data key (kms:GenerateDataKey) for encrypting captured data and to decrypt (kms:Decrypt) the KMS key when writing to the S3 bucket. Without these, the endpoint cannot use the customer-managed KMS key for server-side encryption, even if the bucket policy allows it.

Exam trap

The trap here is that candidates often assume enabling S3 default encryption (Option C) is sufficient, but SageMaker data capture requires explicit KMS permissions in the endpoint's IAM role to use the customer-managed key.

How to eliminate wrong answers

Option B is wrong because s3:PutObjectAcl is not required for server-side encryption with KMS; it is used for managing object-level access control lists, not encryption. Option C is wrong because enabling S3 default encryption on the bucket does not satisfy the requirement for server-side encryption using AWS KMS for data captured by SageMaker; the endpoint must explicitly use the KMS key via the IAM role. Option D is wrong because modifying the container to handle encryption internally would bypass the managed data capture feature and is not necessary; SageMaker data capture already supports KMS encryption natively.

Full explanation →

386

Multi-Selectmedium

An MLOps team is designing a CI/CD pipeline for deploying machine learning models to production on Amazon SageMaker. They want to ensure that the deployment process is automated and that models are automatically rolled back if performance degrades. Which of the following AWS services or features should they use to achieve this? (Choose THREE.)

Select 3 answers

A.Amazon SageMaker Model Registry

B.Amazon SageMaker Ground Truth

C.Amazon CloudWatch

D.Amazon SageMaker Pipelines

E.AWS CloudTrail

AnswersA, C, D

Model Registry manages model versions and approvals.

Why this answer

Amazon SageMaker Model Registry is correct because it provides a centralized catalog for managing, versioning, and approving ML models. It enables automated deployment by triggering CI/CD pipelines when a model version is approved, and supports automatic rollback by allowing you to revert to a previous approved version if performance degrades, as detected by monitoring metrics.

Exam trap

The trap here is that candidates may confuse SageMaker Ground Truth (a data labeling service) or CloudTrail (an auditing service) with the core MLOps components needed for automated deployment and rollback, overlooking that Model Registry, Pipelines, and CloudWatch are the precise services that form the CI/CD and monitoring backbone.

Full explanation →

387

MCQmedium

A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days after discharge. The dataset contains electronic health record (EHR) data from multiple hospitals, stored as Parquet files in Amazon S3. The data includes patient demographics, diagnoses (ICD-10 codes), medications, lab results, and length of stay. A data scientist notices that the 'lab_result' column has a high number of null values (over 60%) because some tests are not applicable to all patients. Additionally, the 'diagnosis_code' column has over 10,000 unique ICD-10 codes. The company wants to build a model that complies with HIPAA and performs well. The data scientist must prepare the features efficiently using AWS services. Which combination of steps should the data scientist take? (Assume the company can use any AWS service.)

A.Use AWS Glue ETL to impute missing lab results with a value predicted from other features using a model like XGBoost, and apply count encoding to diagnosis codes based on their frequency of occurrence.

B.Replace missing lab results with the overall mean, and use a binary flag for nullness. For diagnosis codes, apply one-hot encoding after grouping codes into 20 categories based on clinical relevance.

C.Drop all records where lab_result is null, and use one-hot encoding for diagnosis codes.

D.Use Amazon SageMaker Data Wrangler's built-in 'Fill missing' with KNN imputation for lab results, and apply ordinal encoding to diagnosis codes based on the order of ICD-10 chapters.

AnswerA

Predictive imputation leverages other features to estimate missing values, retaining data. Count encoding reduces the cardinality of diagnosis codes.

Why this answer

Option A is correct because it uses AWS Glue ETL to impute missing lab results with a predictive model (XGBoost), which is appropriate for high missingness (>60%) where simple imputation would bias the model, and applies count encoding to the high-cardinality diagnosis codes (10,000+ unique values) to avoid the dimensionality explosion of one-hot encoding while preserving frequency information. This approach balances HIPAA compliance (data stays within AWS) with model performance.

Exam trap

The trap here is that candidates often choose simple mean imputation (Option B) or dropping rows (Option C) without considering the impact of high missingness on bias and data loss, or they overcomplicate encoding (Option D) without recognizing that ordinal encoding implies a false order for categorical codes.

How to eliminate wrong answers

Option B is wrong because replacing 60%+ missing lab results with the overall mean ignores the non-random missingness (tests not applicable to all patients) and introduces severe bias, and grouping 10,000+ ICD-10 codes into only 20 categories based on clinical relevance loses granularity and may not reflect readmission risk patterns. Option C is wrong because dropping all records with null lab results would discard over 60% of the data, leading to massive data loss and a non-representative dataset, and one-hot encoding 10,000+ diagnosis codes creates an unmanageable feature space (sparse matrix) that degrades model performance. Option D is wrong because KNN imputation on a dataset with >60% missingness in the same column is computationally expensive and unreliable (neighbors themselves may have missing values), and ordinal encoding based on ICD-10 chapter order imposes an arbitrary ordinal relationship that does not reflect clinical risk or readmission likelihood.

Full explanation →

388

MCQeasy

An ML engineer runs the CLI command shown in the exhibit. However, the training job fails immediately with an error: 'Unable to assume role'. What is the most likely cause?

A.The IAM role 'SageMakerExecutionRole' does not have permission to create the training job.

B.The training image in ECR does not exist.

C.The S3 bucket 'my-bucket' does not exist.

D.The IAM role's trust policy does not grant SageMaker permission to assume the role.

AnswerD

Without proper trust policy, SageMaker cannot assume the role, causing immediate failure.

Why this answer

The 'Unable to assume role' error indicates that SageMaker cannot assume the IAM role specified in the CLI command. This is a trust policy issue: the role's trust policy must include SageMaker as a trusted service (i.e., `"Service": "sagemaker.amazonaws.com"`). Without this, SageMaker is not authorized to assume the role, regardless of the role's permissions.

Exam trap

AWS often tests the distinction between IAM role permissions (what the role can do) and trust policies (who can assume the role), leading candidates to mistakenly select a permission-related option when the error is about trust.

How to eliminate wrong answers

Option A is wrong because the error is about assuming the role, not about the role's permissions to create the training job; permission errors would appear as 'AccessDenied' or similar, not 'Unable to assume role'. Option B is wrong because a missing ECR image would cause an error like 'Image not found' or 'RepositoryNotFoundException', not an assume role error. Option C is wrong because a non-existent S3 bucket would result in an error like 'NoSuchBucket' or 'AccessDenied' when SageMaker tries to access it, not an assume role failure.

Full explanation →

389

Multi-Selecteasy

A company is using Amazon SageMaker to deploy a model for real-time inference. The model requires access to a private S3 bucket that contains reference data. The company wants to ensure that the endpoint can access the S3 bucket without using a public internet connection. Which TWO actions should they take? (Select TWO.)

Select 2 answers

A.Configure the endpoint's security group to allow outbound traffic to the S3 bucket's IP range.

B.Attach the endpoint to a VPC that has a VPC endpoint for S3.

C.Ensure the SageMaker execution role has an IAM policy that grants s3:GetObject access to the bucket.

D.Attach the endpoint to a VPC with an internet gateway and route the S3 traffic through the internet gateway.

E.Attach the endpoint to a VPC with a NAT gateway to route traffic to S3.

AnswersB, C

VPC endpoints allow private connectivity to S3 without internet.

Why this answer

Option B is correct because attaching the SageMaker endpoint to a VPC with a VPC endpoint for S3 (Gateway type) allows the endpoint to access the S3 bucket using AWS's private network, bypassing the public internet. This ensures traffic stays within the AWS backbone, meeting the requirement for no public internet connection. Option C is also correct because the SageMaker execution role must have an IAM policy with s3:GetObject permissions to authorize the read access to the private S3 bucket, which is a prerequisite for any S3 operation.

Exam trap

The trap here is that candidates often confuse VPC endpoints (which keep traffic private) with NAT gateways or internet gateways (which route traffic over the public internet), and they may overlook the mandatory IAM permissions required for S3 access even when using a VPC endpoint.

Full explanation →

390

MCQhard

A team is deploying a model that requires low-latency inference for real-time predictions. They are using a SageMaker endpoint with a single instance. During testing, they observe high latency. Which change would most effectively reduce latency?

A.Use a multi-model endpoint

B.Add Elastic Inference

C.Enable SageMaker Batch Transform

D.Switch to a larger instance type

AnswerD

Correct: Larger instances provide more CPU/GPU for faster inferences.

Why this answer

Option B is correct because switching to a larger instance type provides more compute capacity, reducing inference latency. Option A is wrong because multi-model endpoints may increase latency due to model loading. C is wrong because Batch Transform is for batch, not real-time.

D is wrong because Elastic Inference adds GPU acceleration but may not reduce latency as much as compute upgrade.

Full explanation →

391

MCQmedium

A data scientist trains a neural network on SageMaker using the TensorFlow framework. The training accuracy is lower than expected, and the scientist suspects vanishing gradients. How can the scientist leverage SageMaker Debugger to diagnose this?

A.Increase the number of training epochs to allow gradients to propagate.

B.Export model summaries to TensorBoard for manual inspection.

C.Reduce the learning rate to prevent gradient explosion.

D.Use a built-in Debugger rule to monitor gradient magnitudes during training.

AnswerD

Built-in rules like VanishingGradient can detect and alert when gradients become too small.

Why this answer

Option A is correct because SageMaker Debugger includes built-in rules such as vanishing_gradient and exploding_gradient that automatically monitor tensors. Option B is wrong because TensorBoard is not integrated with SageMaker Debugger directly for rule-based alerts. Option C is wrong because adding more epochs may not solve vanishing gradients.

Option D is wrong because reducing learning rate can help but does not diagnose the issue.

Full explanation →

392

MCQmedium

A team uses AWS Auto Scaling for a SageMaker real-time endpoint. They notice that when scaling in, the latest instance is always terminated first, causing disruption to recent requests. How can they configure the scaling policy to terminate the oldest instance first?

A.Configure the termination policy as 'OldestInstance'

B.No action needed; this is the default behavior

C.Use lifecycle hooks

D.Use AWS CloudFormation to manage the endpoint

AnswerA

You can set the termination policy to 'OldestInstance' in the scaling policy configuration.

Why this answer

Option B is correct because Application Auto Scaling allows setting a termination policy. Option A is for cleanup actions. Option C is for infrastructure as code.

Option D is incorrect; the default may not be oldest instance.

Full explanation →

393

MCQmedium

A team is using Amazon SageMaker to train a neural network. They want to minimize training time while effectively exploring the hyperparameter space. Which approach should they use?

A.Random search

B.Bayesian optimization

C.Grid search

D.Manual tuning

AnswerB

Bayesian optimization uses past evaluations to focus on promising regions, reducing training time.

Why this answer

Bayesian optimization is efficient as it builds a probabilistic model and selects hyperparameters that are likely to improve performance, making it faster than exhaustive methods like grid search.

Full explanation →

394

Multi-Selectmedium

A data science team deploys a TensorFlow model for real-time inference using the Amazon SageMaker model configuration shown. They observe high latency during the first few requests after deployment. Which TWO actions would reduce cold start latency? (Choose two.)

Select 2 answers

A.Enable data capture on the endpoint

B.Set the SAGEMAKER_PROGRAM environment variable to a more optimized entry point

C.Add a secondary container for model ensemble

D.Configure a Production Variant with an initial instance count greater than zero

E.Use Amazon SageMaker Multi-Model Endpoints

AnswersD, E

Setting an initial instance count ensures that instances are always running, preventing cold start.

Why this answer

Options C and D are correct. Using Multi-Model Endpoints allows the endpoint to stay warm and reduces the time to load a model on demand. Setting an initial instance count greater than zero ensures that the endpoint always has at least one instance running, eliminating cold starts.

Option A (adding another container) increases cold start latency. Option B (changing environment variable) does not affect model loading time. Option E (data capture) adds overhead without reducing cold start latency.

Full explanation →

395

MCQhard

Refer to the exhibit. A SageMaker execution role has the IAM policy shown. The team attempts to run a training job that writes results to 's3://my-bucket/training/output/model.tar.gz'. What will happen?

A.The training job will fail because the Deny statement blocks all PutObject actions.

B.The training job will succeed and write the model artifact.

C.The training job will fail because the Deny statement overrides the Allow.

D.The training job will succeed, but the output file will be encrypted with a different key.

AnswerB

The Deny does not affect this resource.

Why this answer

Option C is correct. The Deny statement blocks PutObject on the specific object 'sensitive-data.csv', but the write to 'model.tar.gz' is allowed by the second statement. There is no explicit deny on 'model.tar.gz'.

Option A is wrong because the Deny is specific. Option B is wrong because there is no conflict; Deny only applies to that one object. Option D is wrong because the Deny does not affect this write.

Full explanation →

396

MCQhard

A data scientist is using Amazon SageMaker Data Wrangler for feature engineering on a large dataset stored in S3. The dataset has a column 'ProductCategory' with 1000+ unique values. To reduce dimensionality, they want to group categories that appear less than 1% of the time into an 'Other' category. Which Data Wrangler transform should they use?

A.Group similar categories

B.Custom transform with Python

C.Handle rare values

D.One-hot encode with threshold

AnswerC

This built-in transform can group categories below a frequency threshold into an 'Other' value.

Why this answer

The 'Handle rare values' transform in SageMaker Data Wrangler is specifically designed to group infrequent category values into a single 'Other' bucket based on a frequency threshold (e.g., less than 1%). This directly addresses the need to reduce dimensionality by consolidating rare categories without requiring custom code or manual grouping.

Exam trap

The trap here is that candidates may confuse the 'Handle rare values' transform with the 'One-hot encode with threshold' transform, mistakenly thinking the threshold in one-hot encoding serves the same purpose as grouping rare categories, when in fact it limits the number of one-hot columns created, not the grouping of infrequent values.

How to eliminate wrong answers

Option A is wrong because 'Group similar categories' is a manual grouping transform that requires the user to explicitly define which categories to combine, not an automated threshold-based grouping of rare values. Option B is wrong because while a custom Python transform could technically achieve this, it is unnecessary and less efficient when a built-in, optimized transform ('Handle rare values') exists for this exact purpose. Option D is wrong because 'One-hot encode with threshold' applies to one-hot encoding (creating binary columns) and its threshold controls the maximum number of one-hot features, not the grouping of rare categories into an 'Other' bucket.

Full explanation →

397

MCQeasy

A data engineer is preparing a large dataset of 10 TB for ML training on Amazon SageMaker. The data is stored in Amazon S3 as CSV files. To reduce training time and cost, the engineer wants to use a columnar format that is optimized for analytical queries. Which format should the engineer convert the data to?

A.XML

B.Parquet

C.ORC

D.JSON Lines

AnswerB

Parquet is a columnar format that speeds up data access and reduces storage costs.

Why this answer

Parquet is a columnar storage format that is highly optimized for analytical queries and is natively supported by Amazon SageMaker for efficient data loading. By converting the 10 TB of CSV data to Parquet, the data engineer can reduce I/O and storage costs because columnar formats allow SageMaker to read only the columns needed for training, rather than scanning entire rows. This directly addresses the goal of reducing training time and cost for ML workloads.

Exam trap

AWS often tests the distinction between columnar formats (Parquet vs. ORC) by making both appear correct, but the trap here is that ORC is tightly coupled with Hive and less commonly used with SageMaker, while Parquet is the de facto standard for AWS-native ML and analytics services.

How to eliminate wrong answers

Option A (XML) is wrong because XML is a verbose, row-oriented text format that is not optimized for analytical queries; it would increase storage size and I/O overhead, making training slower and more expensive. Option C (ORC) is also a columnar format optimized for analytical queries, but it is primarily designed for and tightly integrated with the Apache Hive ecosystem, whereas Parquet is the more universally supported and recommended format for Amazon SageMaker and AWS analytics services. Option D (JSON Lines) is wrong because it is a row-oriented, text-based format that lacks the compression and columnar pruning benefits of Parquet, leading to higher storage costs and slower data access for ML training.

Full explanation →

398

MCQmedium

A team is tuning hyperparameters for a neural network using SageMaker's HyperparameterTuningJob with Bayesian optimization. After several trials, the objective metric has not improved significantly. Which action is most likely to help continue making progress?

A.Expand the hyperparameter ranges

B.Switch to random search strategy

C.Use a warm start with previous tuning results

D.Switch to Bayesian search

AnswerB

Random search introduces exploration and can discover new promising regions beyond the current exploitation focus.

Why this answer

Option D is correct because switching to random search introduces exploration and can help escape local optima that Bayesian optimization might be stuck exploiting. Option A (switch to Bayesian) is already in use. Option B (warm start) uses previous results but does not change the search strategy.

Option C (expand ranges) might help if the optimum lies outside current ranges, but stagnation often requires more exploration.

Full explanation →

399

MCQeasy

A data scientist is using SageMaker to train a linear regression model. After training, they evaluate the model on the test set and get an R² of 0.95. However, when they deploy the model to a SageMaker endpoint and run predictions on new data, the predictions are far off. What is the most likely cause?

A.The endpoint is using a different inference script.

B.The test set is not representative of the production data distribution.

C.The model was trained with a wrong algorithm.

D.The model is overfitting the training data.

AnswerB

Correct: Data drift causes model to perform poorly on new data despite good test metrics.

Why this answer

Option A is correct because the test set is not representative of the production data distribution (data drift). The high R² on the test set suggests the model fits well, but production data differs. Option B is wrong because overfitting would show lower test R².

Option C is wrong because different inference scripts would cause errors, not just poor predictions. Option D is wrong because the algorithm is appropriate.

Full explanation →

400

MCQhard

A healthcare startup has deployed a machine learning model on Amazon SageMaker that predicts patient readmission risks. The model uses sensitive health data stored in an S3 bucket encrypted with AWS KMS. The SageMaker endpoint is configured with an IAM role that has the following policy attached: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:*", "Resource": "arn:aws:s3:::healthcare-data/*", "Condition": { "Bool": { "aws:SecureTransport": "true" } } }, { "Effect": "Allow", "Action": "kms:Decrypt", "Resource": "*" } ] }. During a security audit, the team discovers that the IAM role's KMS permission is too permissive because it allows decryption of any KMS key in the account. The team needs to modify the policy to follow the principle of least privilege while still allowing the SageMaker endpoint to read the encrypted data. Which modification should the team make?

A.Change the KMS statement Action to "kms:DescribeKey" instead of "kms:Decrypt"

B.Add a condition to the KMS statement: "Condition": { "StringEquals": { "kms:ViaService": "s3.us-east-1.amazonaws.com" } }

C.Remove the KMS statement entirely, as S3 bucket policies with SSE-KMS do not require KMS permissions

D.Change the KMS statement to: "Action": "kms:Decrypt", "Resource": "arn:aws:kms:us-east-1:123456789012:key/1234abcd-12ab-34cd-56ef-1234567890ab"

AnswerD

Restricting the Resource to the specific KMS key ARN ensures that the role can only decrypt the key used for the healthcare data, adhering to least privilege.

Why this answer

The current policy allows kms:Decrypt on any KMS key (*). To follow least privilege, the team should restrict the Resource to the specific KMS key used to encrypt the S3 bucket. Option A (change the Action to kms:Decrypt and restrict Resource to the specific key ARN) is correct.

Option B (remove the KMS statement entirely) would break the endpoint because it cannot decrypt the data. Option C (add a condition for specific encryption context) is good practice but still allows decryption of any key if condition is met, not least privilege. Option D (use kms:DescribeKey instead of kms:Decrypt) does not allow decryption.

Full explanation →

401

MCQhard

Refer to the exhibit. A team receives an error when running a SageMaker Model Monitor schedule for data quality. What should they do to resolve this issue?

A.Update the IAM role to allow S3 access

B.Restart the monitoring schedule

C.Enable data capture on the endpoint

D.Create a baseline job using the training dataset

AnswerD

A baseline must be generated from training data to compare inference data against.

Why this answer

Option B is correct because Model Monitor requires a baseline (constraints and statistics) generated from the training data. The error indicates the baseline is missing. Option A enables capture but does not resolve baseline.

Option C is incorrect because the schedule is fine but baseline missing. Option D is about permissions, not baseline.

Full explanation →

402

MCQeasy

A data scientist is preparing a dataset for a linear regression model. The dataset has a few missing values in a numerical feature with a normal distribution and no outliers. Which imputation method is most appropriate?

A.Impute with mode

B.Impute with mean

C.Impute with median

D.Drop rows with missing values

AnswerB

Mean is appropriate for normally distributed numerical data without outliers.

Why this answer

Option B is correct because mean imputation is suitable for normally distributed data without outliers. Option A (drop rows) reduces sample size. Option C (median) is robust to outliers but not needed.

Option D (mode) is for categorical data.

Full explanation →

403

MCQmedium

A company wants to deploy a machine learning model that makes real-time predictions for a mobile app. The model is a deep neural network with a large model size (500 MB). Which SageMaker endpoint configuration is most cost-effective while meeting low-latency requirements?

A.Multi-model endpoint

B.Serverless inference

C.Real-time endpoint with a single instance

D.Batch transform

AnswerC

Ensures low latency and is cost-effective for a single model with sustained traffic.

Why this answer

A real-time endpoint with a single instance provides consistent low latency and is cost-effective for a single large model. Other options either do not meet latency requirements or are designed for multiple models.

Full explanation →

404

MCQhard

A machine learning engineer is deploying a PyTorch model for real-time inference on SageMaker. The model requires GPU for low-latency predictions. The deployment fails with the error: 'The primary container does not support the requested instance type.' The instance type is ml.p3.2xlarge. Which action should the engineer take to resolve the issue?

A.Use SageMaker Neo to compile the model for the target instance type

B.Request a service quota increase for the ml.p3.2xlarge instance type

C.Verify that the PyTorch framework version specified in the SageMaker estimator matches a version that supports GPU instances

D.Create a custom inference container and use it with the SageMaker model

AnswerC

Older PyTorch versions may not support GPU; using a supported version resolves the error.

Why this answer

Option C is correct because the error 'The primary container does not support the requested instance type' typically occurs when the specified PyTorch framework version in the SageMaker estimator does not include GPU support for the chosen instance type (ml.p3.2xlarge). SageMaker's prebuilt PyTorch containers are version-specific and only certain versions are compiled with CUDA and GPU libraries; using a version that lacks GPU support causes the container to reject GPU instance types. Verifying and selecting a PyTorch version that explicitly supports GPU instances resolves the mismatch.

Exam trap

The trap here is that candidates often assume the error is due to resource limits (quota) or hardware incompatibility (Neo), rather than recognizing it as a framework version and container image mismatch specific to GPU support.

How to eliminate wrong answers

Option A is wrong because SageMaker Neo compiles models for edge devices or optimized inference on specific hardware, but it does not fix a container-instance type compatibility error; the error occurs before model compilation. Option B is wrong because a service quota increase addresses insufficient capacity or account limits for the instance type, not a container-level compatibility error; the error indicates the container rejects the instance type, not that the instance is unavailable. Option D is wrong because creating a custom inference container is unnecessary when the issue is simply a version mismatch in the prebuilt container; the error can be resolved by selecting a supported PyTorch version without custom container overhead.

Full explanation →

405

MCQeasy

A data scientist is training a regression model in Amazon SageMaker. The dataset contains missing values in several features. The scientist wants to handle missing values as part of the training pipeline to ensure consistency between training and inference. Which approach should the scientist use?

A.Impute missing values in a separate Jupyter notebook and save the cleaned data.

B.Use SageMaker Autopilot to automatically handle missing values.

C.Drop all rows with missing values before training.

D.Use a scikit-learn container in SageMaker to create a preprocessing step that imputes missing values and include it in the inference pipeline.

AnswerD

Consistent preprocessing in pipeline.

Why this answer

Option D is correct because it uses a scikit-learn container within SageMaker to create a preprocessing step that imputes missing values, then includes that step in the inference pipeline. This ensures the same imputation logic (e.g., mean, median, or custom strategy) is applied consistently during both training and inference, preventing data drift and maintaining reproducibility. SageMaker Pipelines or the built-in scikit-learn container allow the preprocessing to be serialized as part of the model artifact, so inference requests automatically undergo the same transformation.

Exam trap

The trap here is that candidates often assume SageMaker Autopilot (Option B) is the correct choice because it automates preprocessing, but they miss that the question specifically requires a custom, reproducible pipeline that ensures consistency between training and inference, which Autopilot does not expose for custom control.

How to eliminate wrong answers

Option A is wrong because handling missing values in a separate Jupyter notebook and saving the cleaned data breaks the training-inference consistency; the imputation logic is not captured in a reusable pipeline, leading to potential mismatch when new data arrives during inference. Option B is wrong because SageMaker Autopilot is an automated machine learning service that handles missing values internally during model selection, but it does not allow the data scientist to control the imputation method or integrate a custom preprocessing step into a production inference pipeline. Option C is wrong because dropping all rows with missing values can discard valuable data, reduce model performance, and is not feasible when missing values appear in inference-time data, as the pipeline would have no strategy to handle them.

Full explanation →

406

MCQhard

A machine learning engineer is training a deep learning model using TensorFlow in SageMaker. The training runs on an ml.p3.16xlarge instance (8 GPUs). The engineer notices that GPU utilization is low (~30%) and time per epoch is high. The model uses a custom training loop. Which configuration change is most likely to improve GPU utilization?

A.Increase the batch size to match GPU memory

B.Reduce the number of data loading workers

C.Use mixed precision training

D.Enable SageMaker Managed Warm Pools

AnswerA

Larger batch size increases the amount of computation per step, keeping GPUs more fully utilized.

Why this answer

Option C is correct because increasing the batch size increases the computational work per GPU, keeping them busier and improving utilization. Option A (mixed precision) can improve throughput but not necessarily utilization if batch size remains small. Option B (SageMaker Managed Warm Pools) is for inference.

Option D (reducing data loading workers) could worsen data starvation, decreasing utilization.

Full explanation →

407

MCQhard

A data scientist is running a SageMaker training job with a custom PyTorch image. The training script loads a large dataset into memory, and the job fails with an out-of-memory error after a few minutes. The instance type is ml.m5.xlarge (16 GB RAM). What should the data scientist do to resolve this issue without changing the instance type?

A.Enable SageMaker Managed Spot Training to free memory

B.Implement data loading with multiprocessing and increase the number of workers

C.Reduce the batch size in the training script

D.Use SageMaker Pipe mode to stream data from S3

AnswerC

Smaller batch sizes reduce memory consumption per step, helping to fit within the available RAM.

Why this answer

Reducing the batch size decreases the memory required for each training step, which can prevent out-of-memory errors. Pipe mode can help by streaming data, but if the entire dataset is loaded, it may not be sufficient. Multiprocessing can increase memory usage.

Spot training does not free memory.

Full explanation →

408

MCQeasy

A company has deployed a SageMaker real-time endpoint for a model that predicts customer churn. The endpoint uses a single ml.m5.large instance. After deployment, the team notices that during peak hours, the endpoint returns 5xx errors for about 20% of requests. The endpoint has not been configured with any scaling policy. The team needs to resolve this issue with minimal cost increase. Which solution should the team implement?

A.Deploy the model to a multi-model endpoint to reduce resource utilization.

B.Enable Auto Scaling for the endpoint with a target tracking policy based on the average InvocationsPerInstance metric.

C.Increase the instance type to ml.m5.xlarge to handle more concurrent requests.

D.Use SageMaker batch transform instead of real-time inference to process peak traffic asynchronously.

AnswerB

Auto Scaling adds instances only when needed, minimizing cost while handling peak load.

Why this answer

Option B is correct because enabling Auto Scaling with a target tracking policy based on the average InvocationsPerInstance metric dynamically adjusts the number of instances in response to traffic spikes, preventing 5xx errors during peak hours without over-provisioning. This approach minimizes cost by scaling only when needed, unlike manual instance upgrades or batch transforms that either increase baseline cost or introduce latency.

Exam trap

The trap here is that candidates often confuse 'scaling up' (increasing instance size) with 'scaling out' (adding more instances), and overlook that Auto Scaling with a target tracking policy is the most cost-effective way to handle variable traffic, as it matches capacity to demand in real time.

How to eliminate wrong answers

Option A is wrong because deploying to a multi-model endpoint reduces resource utilization by sharing a single container across multiple models, but it does not address the root cause of insufficient capacity for a single model under peak load; it may even exacerbate contention. Option C is wrong because increasing the instance type to ml.m5.xlarge provides more compute per instance but incurs a fixed higher cost regardless of traffic, failing the 'minimal cost increase' requirement and not dynamically adapting to variable load. Option D is wrong because SageMaker batch transform is designed for asynchronous, offline inference on large datasets, not for real-time requests; it would introduce unacceptable latency and cannot serve interactive predictions, thus not resolving the immediate 5xx errors during peak hours.

Full explanation →

409

Multi-Selectmedium

An MLOps engineer is designing a CI/CD pipeline for deploying machine learning models to a production SageMaker endpoint. The pipeline should include automated testing, approval gates, and rollback capability. Which THREE components should be included in the pipeline? (Select THREE.)

Select 3 answers

A.A step to register the model in SageMaker Model Registry.

B.A CloudFormation template to deploy the endpoint infrastructure, enabling rollback via stack update.

C.A separate staging endpoint to validate the model before production deployment.

D.A manual approval step after staging testing.

E.A step to run SageMaker Debugger to monitor training.

AnswersB, C, D

Infrastructure as code allows precise rollback by redeploying a previous CloudFormation stack.

Why this answer

Option B is correct because using a CloudFormation template to deploy the SageMaker endpoint infrastructure enables rollback via stack update. If a deployment fails, CloudFormation can automatically roll back the stack to the previous known good state, ensuring infrastructure consistency and reducing downtime.

Exam trap

The trap here is that candidates confuse model registry steps (Option A) or training monitoring tools (Option E) with deployment pipeline components, but the question specifically asks for components that enable automated testing, approval gates, and rollback capability in the CI/CD pipeline for deploying to a production SageMaker endpoint.

Full explanation →

410

MCQhard

A machine learning engineer is deploying a pre-trained NLP model on Amazon SageMaker for real-time inference. The model expects input sequences of variable length, and performance is critical. The engineer wants to minimize latency while handling the variable-length inputs efficiently. Which approach should the engineer choose?

A.Reduce the model size by pruning and quantization.

B.Pad all input sequences to the maximum length in the batch.

C.Use dynamic batching with a custom inference script that groups requests by sequence length.

D.Process each request individually to avoid padding overhead.

AnswerC

Dynamic batching reduces padding and latency.

Why this answer

Option C is correct because dynamic batching with a custom inference script that groups requests by sequence length minimizes padding overhead and maximizes hardware utilization. By batching similar-length sequences together, the model avoids excessive padding to the maximum length in the batch, which reduces wasted computation and latency. This approach is particularly effective for variable-length NLP inputs on SageMaker, where the inference container can be customized to implement the grouping logic.

Exam trap

AWS often tests the misconception that padding to the maximum length is always necessary or efficient, but the trap here is that dynamic batching with length-based grouping is a more sophisticated technique that balances batching efficiency with minimal padding overhead.

How to eliminate wrong answers

Option A is wrong because pruning and quantization reduce model size and can improve latency, but they do not address the core issue of efficiently handling variable-length input sequences; they are orthogonal optimizations. Option B is wrong because padding all sequences to the maximum length in the batch introduces significant wasted computation and memory, especially when sequence lengths vary widely, leading to higher latency. Option D is wrong because processing each request individually eliminates batching benefits, resulting in lower throughput and higher per-request latency due to underutilized hardware accelerators.

Full explanation →

411

MCQmedium

An ML team at a financial services company has developed a fraud detection model using Amazon SageMaker. The model is currently deployed to a production endpoint with a single variant using the previous model version. The team wants to deploy a new model version with a canary deployment where 10% of traffic goes to the new version and 90% remains on the old version for 30 minutes before shifting all traffic to the new version if no issues are detected. Which step is essential to achieve this safe rollout?

A.Use the 'Deploy' method on the model object with the 'mode' parameter set to 'canary' within the built-in XGBoost algorithm container.

B.Update the endpoint with a new production variant for the new model version and set the 'InitialVariantWeight' to 10 for the new variant and 90 for the old variant, specifying a 'BlueGreenUpdatePolicy' with a 'TrafficRoutingConfiguration' for canary.

C.Ensure the endpoint is hosted on at least two instances to enable load balancing, then deploy the new model version as a separate variant and manually adjust the endpoint's DNS to split traffic.

D.Deploy the new model as a separate endpoint and use a SageMaker predictor to randomly route 10% of inference requests to the new endpoint.

AnswerB

This configuration uses SageMaker's blue/green deployment with canary traffic shifting, which is the correct approach.

Why this answer

Option C is correct because SageMaker canary deployments are configured by setting the 'BlueGreenUpdatePolicy' in an endpoint update. Option A is incorrect because SageMaker does not support A/B testing through the predictor directly. Option B is incorrect because SageMaker does not provide a built-in canary deployment mode via the built-in algorithms.

Option D is incorrect because while the endpoint must be hosted on multiple instances, that alone does not enable canary routing.

Full explanation →

412

Multi-Selecthard

A data scientist is working with a dataset containing customer demographics and purchase history. The dataset includes categorical variables with high cardinality (e.g., ZIP code, product ID). The data scientist wants to perform feature engineering to improve model performance. Which THREE feature engineering techniques should the data scientist consider? (Choose three.)

Select 3 answers

A.Principal Component Analysis (PCA) to reduce dimensionality of numerical features.

B.Domain-specific feature engineering based on business rules.

C.Target encoding for high-cardinality categorical variables.

D.Frequency encoding to represent categories by their occurrence count.

E.One-hot encoding all categorical features.

AnswersA, C, D

PCA can reduce noise and multicollinearity.

Why this answer

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated numerical features into a smaller set of uncorrelated principal components, capturing the maximum variance in the data. This is correct because the dataset includes numerical features (e.g., purchase amounts, age) where PCA can reduce noise and multicollinearity, improving model performance without losing critical information.

Exam trap

AWS often tests the distinction between techniques that are universally applicable (like PCA for numerical features) versus those that are specifically designed to handle high-cardinality categorical variables (like target encoding and frequency encoding), tempting candidates to choose one-hot encoding without considering its impracticality for high cardinality.

Full explanation →

413

MCQeasy

Refer to the exhibit. A data scientist reviews the output of a SageMaker training job. The model has 95% training accuracy and 92% validation accuracy. Which statement is true?

A.The model has acceptable performance with a small generalization gap

B.The model is underfitting because the validation accuracy is too low

C.The model needs more epochs to improve validation accuracy

D.The model is overfitting because the training accuracy is higher than validation accuracy

AnswerA

The 3% gap is typical and the accuracy values are high.

Why this answer

Option C is correct because a 3% gap between training and validation accuracy is typically considered a small generalization gap and indicates acceptable performance. Option A (overfitting) would be a larger gap. Option B (underfitting) would show low accuracy on both sets.

Option D (more epochs) may not help if the model is already converging.

Full explanation →

414

Multi-Selectmedium

A machine learning engineer is training a neural network using Amazon SageMaker. The training job uses a single GPU instance. To improve training speed using distributed training, which two steps should they take? (Select TWO.)

Select 2 answers

A.Split the dataset into smaller files

B.Use SageMaker's distributed data parallelism library

C.Modify the training script to use Horovod or PyTorch DistributedDataParallel

D.Enable automatic mixed precision

E.Increase the number of worker instances in the training job

AnswersC, E

These frameworks enable multi-GPU communication and are necessary for distributed training.

Why this answer

Distributed training requires both modifying the training script to use a distributed framework (e.g., Horovod, PyTorch DDP) and increasing the number of instances. Splitting the dataset into smaller files can improve I/O but is not about distribution. SageMaker's distributed data parallelism library is one option, but modifying the script with a framework is the general step.

Automatic mixed precision improves speed on a single GPU but does not enable distributed training.

Full explanation →

415

MCQeasy

An e-commerce company uses a SageMaker endpoint to serve a product recommendation model. The model is retrained every month using batch transforms. The ML team has set up a retraining pipeline using SageMaker Processing jobs and Step Functions. Recently, the Step Functions workflow has been failing at the retraining step with an error: 'AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/RetrainingRole/abc123 is not authorized to perform: s3:GetObject on resource: arn:aws:s3:::training-data/processed/latest.parquet'. The team confirms that the S3 bucket exists and the object is present. The retraining role has the following policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::training-data/*" } ] }. The team also verifies that the bucket policy does not explicitly deny access. What is the MOST likely cause of the AccessDenied error?

A.The Step Functions execution role does not have permission to invoke the SageMaker Processing job

B.The path in the error message is misspelled; the actual object is at a different key

C.The S3 bucket has a bucket policy that denies access to the retraining role based on a condition like aws:SourceIp

D.The training data object uses server-side encryption with AWS KMS (SSE-KMS), and the retraining role lacks kms:Decrypt permission on the KMS key

AnswerD

If the object is encrypted with SSE-KMS, the role needs both s3:GetObject and kms:Decrypt. The current IAM policy does not include KMS permissions.

Why this answer

The error indicates that the retraining role is not authorized to GetObject on the specific object. Even though the policy allows 'arn:aws:s3:::training-data/*', if the object is encrypted with SSE-KMS, the role also needs kms:Decrypt permission on the KMS key. The bucket policy might also require encryption.

Option B is the most likely cause. Option A (wrong region) would give a different error. Option C (lack of S3 bucket policy) is not the issue if there is no explicit deny.

Option D (path typo) would result in a 404 Not Found error, not AccessDenied.

Full explanation →

416

MCQmedium

A data engineer needs to join two large datasets from Amazon S3: one containing customer demographics and another containing transaction history. The join key is `customer_id`. To minimize data shuffling and improve performance, the engineer decides to use Amazon SageMaker Processing with Spark. Which configuration should the engineer use?

A.Use a bucketed join with the same number of buckets

B.Broadcast join the larger dataset

C.Use a bucketed join with the same number of buckets and co-location

D.Use a repartition on the join key before join

AnswerC

Bucketing with co-location allows Spark to perform the join without shuffling.

Why this answer

Option C is correct because bucketed joins with the same number of buckets and co-location ensure that data with the same `customer_id` hash is physically stored together on the same nodes. This eliminates the need for expensive shuffles during the join, as Spark can perform the join locally within each executor, dramatically improving performance for large datasets in SageMaker Processing.

Exam trap

The trap here is that candidates assume bucketing alone (same number of buckets) is sufficient, but without co-location, Spark still performs a shuffle to align the data, so both conditions are required for a shuffle-free join.

How to eliminate wrong answers

Option A is wrong because bucketed joins require both datasets to have the same number of buckets AND co-location (data physically stored together); without co-location, Spark still shuffles data to align partitions, negating the performance benefit. Option B is wrong because broadcast join is only efficient when one dataset is small enough to fit in memory (typically <100 MB); the transaction history dataset is large, so broadcasting it would cause out-of-memory errors or severe performance degradation. Option D is wrong because repartitioning on the join key before the join adds an extra shuffle step, increasing overhead rather than reducing it; bucketing with co-location avoids shuffles entirely.

Full explanation →

417

Multi-Selecteasy

A data scientist is using SageMaker Autopilot to automatically build a model. Which TWO aspects does Autopilot handle? (Choose TWO.)

Select 2 answers

A.Data ingestion

B.Model deployment

C.Feature engineering

D.Data labeling

E.Hyperparameter tuning

AnswersC, E

Correct: Autopilot automatically explores different feature transformations.

Why this answer

Options A and B are correct because Autopilot performs automated feature engineering and hyperparameter tuning. Option C is wrong because Autopilot does not deploy the model; that is a separate step. D is wrong because data labeling is not part of Autopilot.

E is wrong because Autopilot does not handle data ingestion beyond using provided dataset.

Full explanation →

418

MCQhard

A data scientist is preparing data for a regression model. The target variable has a skewed distribution. The scientist wants to apply a log transformation to make it closer to normal. Which step should be taken before applying log transformation?

A.Standardize the data to zero mean and unit variance

B.Remove outliers using IQR

C.Ensure all values are positive

D.Center the data by subtracting the mean

AnswerC

Log is undefined for zero and negative values. If present, add a constant or use other transformations.

Why this answer

The log transformation is defined only for positive real numbers; applying it to zero or negative values results in undefined or complex outputs. Therefore, before applying a log transformation, you must ensure all values in the target variable are positive, typically by adding a constant (e.g., log(x + 1)) if zeros are present. This step is a fundamental data preparation requirement for log transformations in regression modeling.

Exam trap

AWS often tests the assumption that candidates will confuse data normalization or centering with the domain restriction of the log function, leading them to pick standardization or mean-centering as a preparatory step.

How to eliminate wrong answers

Option A is wrong because standardizing to zero mean and unit variance (z-score normalization) does not guarantee all values become positive; it centers data around zero, which can produce negative values, making log transformation invalid. Option B is wrong because removing outliers using IQR is not a prerequisite for log transformation; while outliers can affect model performance, the log transformation itself can help mitigate skewness and reduce the influence of outliers, and removing them beforehand is an optional, separate step. Option D is wrong because centering data by subtracting the mean shifts values to have a mean of zero, which inevitably introduces negative values, directly contradicting the requirement for positive inputs for log transformation.

Full explanation →

419

Multi-Selectmedium

Which TWO options are recommended best practices for monitoring model performance in production on SageMaker? (Choose 2.)

Select 2 answers

A.Retrain the model daily based on recent data without evaluation.

B.Use SageMaker Clarify for bias monitoring and feature importance drift.

C.Enable SageMaker Model Monitor to capture data drift and model quality metrics.

D.Set up a CloudWatch alarm on the endpoint's Invocations metric.

E.Manually compare prediction distributions weekly.

AnswersB, C

Clarify can monitor bias and explainability over time.

Why this answer

Options A and C are correct. SageMaker Model Monitor can track data quality and model quality drifts. Option B is wrong because CloudWatch alarms can notify but are not a complete monitoring solution.

Option D is wrong because manual review is not scalable. Option E is wrong because retraining without monitoring may be premature.

Full explanation →

420

MCQeasy

A data science team has trained a model using SageMaker and wants to deploy it to a production endpoint with automatic scaling based on request volume. Which SageMaker feature should they use to configure scaling?

A.SageMaker Endpoint Autoscaling

B.SageMaker Debugger

C.SageMaker Model Registry

D.SageMaker Pipelines

AnswerA

Endpoint Autoscaling automatically adjusts the number of instances based on demand.

Why this answer

SageMaker Endpoint Autoscaling is the correct feature because it automatically adjusts the number of instances behind a SageMaker hosted endpoint based on a target metric (e.g., requests per minute, CPU utilization) using Application Auto Scaling. This allows the endpoint to handle varying request volumes without manual intervention, ensuring cost efficiency and performance.

Exam trap

The trap here is that candidates may confuse SageMaker Debugger (a training debugger) or SageMaker Pipelines (a workflow tool) with scaling features, when only Endpoint Autoscaling directly manages production instance count based on request volume.

How to eliminate wrong answers

Option B (SageMaker Debugger) is wrong because it is a monitoring and debugging tool for training jobs, not for scaling production endpoints. Option C (SageMaker Model Registry) is wrong because it is a catalog for versioning and managing trained models, not a scaling mechanism. Option D (SageMaker Pipelines) is wrong because it is a workflow orchestration service for building and automating ML pipelines, not for configuring endpoint scaling.

Full explanation →

421

MCQhard

A data scientist creates a feature group as shown in the exhibit. When ingesting data with an 'age' column of integer values, the ingestion fails. What is the most likely cause?

A.The role does not have permissions to write to the feature store.

B.The `age` feature type should be `Integral`, not `String`.

C.The `OnlineStoreConfig` must include a `SecurityConfig`.

D.The `EventTimeFeatureName` is incorrectly spelled.

AnswerB

The feature type must match the ingested data type.

Why this answer

Option B is correct because the feature group definition specifies the 'age' column as a `String` type, but the ingested data contains integer values. Amazon SageMaker Feature Store requires that the data types of ingested records match the schema defined in the feature group. When a mismatch occurs, such as providing an integer for a string field, the ingestion fails with a type conversion error.

Exam trap

AWS often tests the distinction between schema definition and actual data types, trapping candidates who overlook that the feature group schema must exactly match the ingested data's types, not just the column names.

How to eliminate wrong answers

Option A is wrong because the question states the ingestion fails specifically due to a data type mismatch, not a permissions issue; a permissions error would typically occur at the API call level, not during data parsing. Option C is wrong because `SecurityConfig` is not a required field in `OnlineStoreConfig`; the online store configuration only requires an `EnableOnlineStore` boolean and optionally a `SecurityGroupIdList` and `SubnetIdList` for VPC settings. Option D is wrong because the `EventTimeFeatureName` is spelled correctly as 'EventTime' in the exhibit, and a misspelling would cause a different error (e.g., 'InvalidParameterValue') rather than a data type mismatch.

Full explanation →

422

Multi-Selectmedium

A dataset for binary classification has a severe class imbalance (5% positive class). Which two data preparation techniques can help address this imbalance? (Choose two.)

Select 2 answers

A.Remove outliers from the minority class

B.Apply PCA to reduce dimensionality

C.Use stratified splitting for train/test sets

D.Undersample the majority class

E.Oversample the minority class using SMOTE

AnswersD, E

Reduces majority class size to balance with minority class.

Why this answer

Option D is correct because undersampling the majority class reduces the number of instances from the dominant class, helping to balance the dataset and prevent the model from being biased toward the majority class. This technique is straightforward and can be effective when the majority class has redundant or noisy samples, though it risks losing valuable information.

Exam trap

AWS often tests the distinction between techniques that change the dataset distribution (like undersampling and oversampling) versus those that only affect model training or evaluation (like stratified splitting), leading candidates to mistakenly select stratified splitting as a balancing technique.

Full explanation →

423

MCQhard

An ML team uses SageMaker Pipelines to automate retraining. After a pipeline failure, they need to reprocess only the failed step without rerunning the entire pipeline. What should they do?

A.Create a new pipeline version for each run.

B.Use SageMaker Model Monitor to detect drift and trigger retraining.

C.Use SageMaker Pipelines Cache with step-level caching.

D.Manually rerun the pipeline with updated parameters.

AnswerC

Caching enables the pipeline to skip completed steps and resume from the failed step.

Why this answer

SageMaker Pipelines Cache with step-level caching allows you to reuse outputs from previous successful runs of unchanged steps. When a pipeline fails, only the failed step and any downstream steps that depend on it need to be re-executed, because cached results from prior successful steps are automatically retrieved. This avoids rerunning the entire pipeline, saving time and compute resources.

Exam trap

The trap here is that candidates confuse SageMaker Pipelines Cache with Model Monitor's drift detection, assuming that monitoring automatically handles retraining failures, when in fact caching is the correct mechanism for step-level reuse.

How to eliminate wrong answers

Option A is wrong because creating a new pipeline version for each run does not address step-level reuse; it creates an entirely new pipeline execution history, forcing a full rerun. Option B is wrong because SageMaker Model Monitor is designed for detecting data drift and model quality degradation, not for caching or resuming failed pipeline steps. Option D is wrong because manually rerunning the pipeline with updated parameters still executes all steps from scratch, ignoring any previously successful step outputs.

Full explanation →

424

Multi-Selecthard

A company needs to secure a SageMaker notebook instance that contains sensitive data. Which THREE of the following are effective security measures? (Select THREE.)

Select 3 answers

A.Use IAM policies to restrict who can access the notebook instance.

B.Disable direct internet access and use a VPC with a NAT gateway for outbound.

C.Attach a lifecycle configuration that runs a script to download data from a public S3 bucket.

D.Enable AWS CloudTrail to log all notebook API calls.

E.Encrypt the notebook instance's EBS volume using AWS KMS.

AnswersA, D, E

IAM policies can limit which users can create presigned URLs for the notebook.

Why this answer

Encrypting the EBS volume with KMS protects data at rest, IAM policies control access, and CloudTrail provides auditing. Disabling internet access is also good, but the question asks for three from the list.

Full explanation →

425

MCQmedium

A company collects sensor data from IoT devices. The data arrives with missing timestamps due to network issues. For anomaly detection, the engineer needs to create features that capture rolling statistics over fixed windows. Which data preprocessing step is essential before feature generation?

A.Remove missing timestamps

B.Resample data to a fixed frequency

C.Sort data by device ID

D.Impute missing values with forward fill

AnswerB

Resampling ensures consistent time intervals, which is required for rolling windows.

Why this answer

Resampling the data to a fixed frequency is essential because rolling window statistics require a consistent time index to compute accurate aggregations over fixed windows. Without a uniform timestamp grid, the window boundaries become ambiguous and the resulting features will be misaligned or incomplete, undermining the anomaly detection model.

Exam trap

AWS often tests the distinction between handling missing values (imputation) and handling irregular timestamps (resampling), leading candidates to confuse forward-fill as a solution for time alignment when it only addresses missing data points, not the underlying time index irregularity.

How to eliminate wrong answers

Option A is wrong because simply removing missing timestamps discards valuable data and does not address the need for a consistent time index; the remaining timestamps remain irregularly spaced. Option C is wrong because sorting by device ID organizes data by device but does not fix the irregular timestamp spacing required for fixed-window rolling statistics. Option D is wrong because forward-fill imputation fills missing values but does not create a uniform time grid; the timestamps themselves remain irregular, so rolling windows cannot be applied consistently.

Full explanation →

426

MCQeasy

A startup is building a serverless inference API using AWS Lambda. They have a TensorFlow model that is 400 MB in size. They packaged the model and inference code into a Lambda function using a container image. When they test the function with a small input, it consistently times out after 3 seconds. The Lambda function has 512 MB of memory and a timeout of 30 seconds. The business requirement is that inference must complete in less than 5 seconds under normal conditions. What is the most likely cause of the slow performance, and which change should they make?

A.The function timeout is too low; increase the timeout to 60 seconds.

B.The function is experiencing a cold start; use provisioned concurrency to keep the container warm.

C.The Lambda function memory is insufficient for the model size; increase memory to 1024 MB or higher.

D.Use a Lambda function with a GPU container to accelerate inference.

AnswerC

Lambda allocates CPU proportionally to memory. More memory speeds up computation and reduces swapping.

Why this answer

The most likely cause is that the Lambda function's memory (512 MB) is insufficient to load the 400 MB TensorFlow model into memory, causing excessive swapping or out-of-memory errors that drastically slow inference. Increasing memory to 1024 MB or higher provides more CPU and memory resources, allowing the model to fit and inference to complete within the required 5 seconds.

Exam trap

The trap here is that candidates confuse cold start latency with runtime performance issues, assuming provisioned concurrency (Option B) fixes all slow Lambda functions, when in fact memory/CPU insufficiency is the root cause for large model inference.

How to eliminate wrong answers

Option A is wrong because the function already has a 30-second timeout, and the issue is not timeout-related—the function consistently times out after 3 seconds due to resource constraints, not because the timeout is too low. Option B is wrong because provisioned concurrency addresses cold starts (initialization latency), but the problem here is runtime performance after the function is already warm; the 3-second timeout occurs consistently, not just on first invocation. Option D is wrong because Lambda does not support GPU containers; GPU acceleration is not available in AWS Lambda, and the inference time is dominated by memory/CPU bottlenecks, not lack of GPU.

Full explanation →

427

MCQhard

Refer to the exhibit. A company configures a SageMaker Model Monitor Data Quality monitoring schedule as shown. The schedule runs every hour. However, the team notices that the monitoring job fails intermittently with an AccessDenied error when accessing the S3 bucket for output. The IAM role SageMakerMonitorRole has permissions to write to s3://my-bucket/monitor-output. What is the MOST likely cause of the failure?

A.The S3UploadMode is set to Continuous, which is only supported for batch transform jobs.

B.The monitoring job runs in a VPC that does not have an S3 VPC endpoint, and the bucket policy denies requests from outside the VPC.

C.The cron expression is invalid; it should use rate(1 hour) instead.

D.The baseline constraints and statistics files are missing from the S3 bucket.

AnswerB

VPC restrictions can cause AccessDenied even if the IAM role allows.

Why this answer

The intermittent AccessDenied error when SageMaker Model Monitor attempts to write to the S3 output bucket strongly indicates a network or policy restriction. If the monitoring job is configured to run inside a VPC (common for security compliance) and that VPC lacks an S3 VPC endpoint, traffic to S3 traverses the public internet. If the S3 bucket policy explicitly denies requests from outside the VPC (using a condition like `aws:SourceVpce` or `aws:SourceVpc`), then jobs running inside the VPC without an endpoint will be denied access intermittently, especially if the job's execution role is assumed from within the VPC.

Exam trap

AWS often tests the interaction between VPC networking and S3 bucket policies, where candidates overlook that a VPC without an S3 endpoint will cause AccessDenied errors even if the IAM role has full S3 permissions, because the bucket policy itself blocks non-VPC-endpoint traffic.

How to eliminate wrong answers

Option A is wrong because S3UploadMode is not a valid parameter for SageMaker Model Monitor; it is a concept for batch transform jobs, and the error is about access permissions, not upload mode. Option C is wrong because the cron expression `cron(0 * * * ? *)` is valid for hourly execution and is the correct format for SageMaker schedules; `rate(1 hour)` is used for EventBridge rules, not for SageMaker monitoring schedule expressions. Option D is wrong because missing baseline files would cause a different error (e.g., `NoSuchKey` or validation failure), not an intermittent AccessDenied error, and the error specifically points to S3 write access.

Full explanation →

428

Multi-Selecteasy

A data engineer is using AWS Glue to prepare a dataset for machine learning. The dataset has several columns with outliers. The engineer wants to detect and handle outliers in a scalable manner. Which TWO approaches should the engineer consider? (Select TWO.)

Select 2 answers

A.Manually remove outliers by inspecting the data in Amazon S3.

B.Train a neural network to identify anomalies and remove them.

C.Use pandas in a SageMaker notebook to calculate z-scores and filter outliers.

D.Use AWS Glue DynamicFrame with Apache Spark to compute interquartile range (IQR) and filter outliers.

E.Use Amazon SageMaker Data Wrangler to apply an outlier detection transform.

AnswersD, E

Spark can handle large-scale data and IQR is a standard method.

Why this answer

Option D is correct because AWS Glue DynamicFrames, built on Apache Spark, provide a scalable, distributed computing environment to compute statistical measures like the interquartile range (IQR) across large datasets. This allows the engineer to programmatically filter outliers without manual intervention, leveraging Spark's parallel processing for efficient handling of data at scale.

Exam trap

The trap here is that candidates may assume that only a single AWS service can handle outlier detection at scale, but the question requires selecting two approaches, and both Glue DynamicFrames and SageMaker Data Wrangler are valid, scalable, and managed AWS solutions for this task.

Full explanation →

429

MCQmedium

An e-commerce company uses a machine learning model to predict customer churn. They notice that the model's performance degrades after a major marketing campaign changes customer behavior. Which approach is MOST effective to detect and respond to this type of concept drift?

A.Deploy an A/B test to compare the current model with a baseline.

B.Use SageMaker Model Monitor to track prediction distribution and trigger retraining.

C.Manually review model accuracy each month.

D.Set up a weekly batch transform job to compute accuracy against historical data.

E.Increase the number of instances for the endpoint.

AnswerB

Correct. Model Monitor continuously checks for drift and can initiate automated retraining.

Why this answer

SageMaker Model Monitor can automatically detect drift in prediction distributions and trigger retraining pipelines.

Full explanation →

430

MCQhard

An e-commerce company uses a multi-model endpoint on Amazon SageMaker to serve several deep learning models. After a new model version is deployed, the endpoint starts returning 503 errors for some models. Monitoring shows that the endpoint's memory utilization is near 100%. What should the team do to resolve this issue while minimizing operational overhead?

A.Increase the number of instances for the endpoint and configure an auto-scaling policy based on memory utilization.

B.Deploy each model on its own separate endpoint to isolate memory usage.

C.Use Amazon SageMaker Model Monitor to detect memory leaks and send alerts.

D.Use SageMaker's built-in model scaling feature to allocate more memory to the affected model.

AnswerA

Adds capacity and auto-scales.

Why this answer

Option C is correct because increasing the endpoint's instance count spreads the memory load, and automating instance scaling with a target tracking policy adjusts based on memory. Option A is wrong because SageMaker does not support per-model scaling; scaling is per endpoint. Option B is wrong because moving to single-model endpoints would increase operational overhead and cost.

Option D is wrong because Model Monitor doesn't help with scaling.

Full explanation →

431

MCQeasy

A data scientist is preparing a dataset for a binary classification model to predict customer churn. The dataset contains a timestamp column 'signup_date' that is not relevant for the prediction. What is the most appropriate action to handle this column?

A.Apply one-hot encoding to the year, month, and day components.

B.Convert the timestamp to a numeric feature (e.g., days since signup) and include it.

C.Use leave-one-out encoding based on the target variable.

D.Drop the 'signup_date' column from the dataset.

AnswerD

Irrelevant columns should be removed to prevent noise.

Why this answer

Option D is correct because the 'signup_date' column is explicitly stated as not relevant for the prediction. In binary classification for customer churn, including an irrelevant timestamp can introduce noise, increase dimensionality, and potentially cause overfitting. Dropping the column is the most appropriate action to maintain model simplicity and focus on predictive features.

Exam trap

AWS often tests the misconception that all timestamp data must be transformed into numeric features, but the key is to first assess relevance—if the column is explicitly not relevant, dropping it is the correct action, not engineering features from it.

How to eliminate wrong answers

Option A is wrong because one-hot encoding the year, month, and day components would create multiple sparse features from an irrelevant column, adding unnecessary complexity and potentially misleading the model with temporal patterns that have no causal relationship with churn. Option B is wrong because converting the timestamp to a numeric feature like 'days since signup' would still retain irrelevant temporal information, which could introduce a spurious correlation or bias, especially if the dataset has a time-based split that leaks future information. Option C is wrong because leave-one-out encoding based on the target variable would leak target information into the feature, causing data leakage and overfitting, as the encoding uses the target value of other rows to encode the current row, which is inappropriate for an irrelevant column.

Full explanation →

432

MCQeasy

A data scientist is preparing a dataset for binary classification using SageMaker. The dataset has 100 features and 10,000 rows, but the target variable is highly imbalanced (95% negative, 5% positive). Which technique should the data scientist apply during data preparation to address the imbalance?

A.Oversampling the minority class by duplicating examples

B.Collect more data to match the number of samples in both classes

C.Random undersampling of the majority class

D.Apply SMOTE to generate synthetic samples for the minority class

AnswerD

SMOTE creates synthetic examples along the line segments of minority class nearest neighbors, addressing imbalance.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the most appropriate technique because it generates synthetic samples for the minority class by interpolating between existing minority instances, which avoids the overfitting risk of simple duplication (oversampling) and the information loss from undersampling. In SageMaker, SMOTE can be applied during data preparation using libraries like imbalanced-learn before training, or via SageMaker Data Wrangler's built-in transform, making it a robust choice for handling class imbalance without discarding data.

Exam trap

AWS often tests the distinction between oversampling by duplication and synthetic oversampling (SMOTE), where candidates mistakenly choose simple duplication (Option A) because they think 'more data is always better,' failing to recognize that SMOTE generates diverse synthetic samples to reduce overfitting.

How to eliminate wrong answers

Option A is wrong because oversampling the minority class by duplicating examples leads to overfitting, as the model sees the same exact data points repeatedly, which does not introduce new variance and can cause poor generalization. Option B is wrong because collecting more data to match the number of samples in both classes is often impractical, costly, or impossible in real-world scenarios, and the question specifically asks for a technique to apply during data preparation, not a data collection strategy. Option C is wrong because random undersampling of the majority class discards potentially valuable data, which can lead to loss of important patterns and reduce model performance, especially when the dataset is already limited to 10,000 rows.

Full explanation →

433

Multi-Selecteasy

A data science team needs to track and compare multiple ML training runs, including hyperparameters, metrics, and output artifacts. Which TWO AWS services can be used together to meet this requirement? (Choose two.)

Select 2 answers

A.Amazon SageMaker Experiments

B.Amazon S3

C.Amazon SageMaker Studio notebooks

D.Amazon SageMaker Model Registry

E.Amazon CloudWatch Logs

AnswersA, D

SageMaker Experiments captures and compares training runs, metrics, and parameters.

Why this answer

Amazon SageMaker Experiments provides experiment tracking and comparison. Amazon SageMaker Model Registry helps manage model versions and artifacts. SageMaker Studio notebooks alone lack tracking; S3 provides storage but no tracking; CloudWatch logs is for monitoring.

Full explanation →

434

MCQhard

A model deployed on SageMaker uses custom inference code. The endpoint is showing intermittent 500 errors. CloudWatch logs reveal 'TimeoutError: Request timed out after 60 seconds'. The model takes on average 55 seconds to process. What is the most effective solution?

A.Increase the invocation timeout in the SageMaker API call.

B.Increase the SageMaker endpoint's model container timeout setting.

C.Optimize the inference code to reduce latency.

D.Increase the endpoint's instance count.

AnswerC

Reducing inference latency below the timeout threshold is the most direct and effective solution, as it addresses the root cause.

Why this answer

Option A is correct because the timeout is at the container level, but the issue is latency near limit; optimizing code is most effective. Option B might help with load but not per-request latency. Option C does not exist (invocation timeout is set by client but server-side timeout is 60s default).

Option D: container timeout can be increased, but default is 60s; increasing might mask performance issues.

Full explanation →

435

MCQhard

A team uses SageMaker Neo to compile a model for deployment on a target device. After compilation, they deploy the compiled model to a SageMaker endpoint using the Neo-optimized container. The endpoint fails to start with error "RuntimeError: Unable to load model". What could be the issue?

A.The compiled model was not uploaded to the correct S3 path.

B.The Neo compilation job failed silently.

C.The endpoint instance type does not support Neo.

D.The target device architecture during compilation does not match the endpoint instance architecture.

AnswerD

Neo models are compiled for specific architectures; mismatch causes load failure.

Why this answer

Option D is correct because SageMaker Neo compiles a model for a specific target architecture (e.g., ARM, x86, GPU). When deploying the compiled model to a SageMaker endpoint, the endpoint instance type must have a CPU or accelerator architecture that matches the target device specified during compilation. If they do not match, the Neo-optimized runtime cannot load the compiled binary, resulting in a 'RuntimeError: Unable to load model'.

Exam trap

AWS often tests the misconception that Neo compilation is a generic optimization that works on any endpoint instance, when in fact the target architecture must exactly match the deployment instance's hardware.

How to eliminate wrong answers

Option A is wrong because if the compiled model were not uploaded to the correct S3 path, the endpoint would fail with an 'Unable to find model artifact' or S3 access error, not a runtime model loading error. Option B is wrong because if the Neo compilation job failed silently, no compiled model artifact would be produced, and the deployment would fail earlier with a missing artifact error, not a runtime load error. Option C is wrong because all SageMaker endpoint instance types support Neo-optimized containers; Neo does not restrict which instance types can host compiled models—the restriction is on the architecture match between the compilation target and the endpoint instance.

Full explanation →

436

MCQeasy

A company uses SageMaker to train a model. The training job is failing with an error "ResourceLimitExceeded". What is the most likely cause?

A.The account has reached the limit for number of training instances

B.The model artifact is too large to upload

C.Invalid hyperparameters

D.The training data size exceeds the available instance storage

AnswerA

ResourceLimitExceeded occurs when you exceed a service quota, such as instance count.

Why this answer

The error indicates that a service limit has been reached, commonly the number of concurrent training instances. Other options would produce different error messages.

Full explanation →

437

MCQeasy

A company wants to automate the deployment of a SageMaker model into production whenever a new model version is approved in the Model Registry. Which service can be used to trigger the deployment pipeline?

A.AWS Lambda

B.Amazon CloudWatch Events (EventBridge)

C.Amazon S3 Events

D.Amazon SNS

E.AWS Config

AnswerB

Correct. EventBridge can capture Model Registry events and trigger downstream actions like CodePipeline.

Why this answer

Amazon EventBridge can respond to Model Registry events (e.g., approval status change) and start an automated pipeline.

Full explanation →

438

Multi-Selectmedium

A company is preparing data for a time-series forecasting model. The data is collected from IoT sensors at irregular intervals. Which TWO steps are necessary to prepare the data? (Choose 2.)

Select 2 answers

A.Normalize the data to a 0-1 range

B.Resample the data to a fixed frequency

C.Fill missing values using forward fill or interpolation

D.Remove outlier data points

E.Encode categorical features

AnswersB, C

Resampling creates regular time intervals required by most forecasting models.

Why this answer

Time-series forecasting models require data at consistent time intervals to capture temporal patterns and seasonality. Resampling the irregular IoT sensor data to a fixed frequency (e.g., every 5 minutes) creates a uniform time index, which is essential for algorithms like ARIMA, Prophet, or LSTM. This step ensures the model can learn from a structured sequence rather than being confused by variable time gaps.

Exam trap

AWS often tests the misconception that data normalization or outlier removal is a universal first step, but for time-series with irregular intervals, the critical preparatory steps are resampling and handling missing values to create a regular time grid.

Full explanation →

439

MCQmedium

An ML team is deploying a model using SageMaker. The model requires GPU inference and must be available in multiple AWS regions for low latency. The team has created a multi-model endpoint with GPU instances. After deployment, they notice high latency spikes when a new model is loaded. What is the most likely cause?

A.The team is using a multi-model endpoint, which loads models on demand; loading a model into GPU memory causes latency spikes.

B.The endpoint is configured with a single production variant, causing all traffic to overload one instance.

C.The endpoint is using the wrong instance type that lacks sufficient GPU memory.

D.The model is too large for the specified container memory, causing swap to disk.

AnswerA

Multi-model endpoints load and unload models from memory, causing latency spikes when a new model is accessed.

Why this answer

A multi-model endpoint (MME) loads models on demand from Amazon S3 into the instance's memory. When a new model is requested and not already cached, SageMaker must download the model artifacts and load them into GPU memory, which is a time-consuming operation that causes a latency spike for the first inference request. This cold-start behavior is inherent to MMEs and explains the observed spikes.

Exam trap

The trap here is that candidates may confuse multi-model endpoint cold-start latency with general endpoint misconfiguration (like instance type or variant count), but the key clue is the timing of the spikes—only when a new model is loaded—which directly points to the on-demand loading behavior of MMEs.

How to eliminate wrong answers

Option B is wrong because a single production variant does not inherently cause latency spikes when loading new models; it would cause consistent high latency under load, not spikes tied to model loading. Option C is wrong because the question states the team is using GPU instances, and insufficient GPU memory would cause out-of-memory errors or failures, not latency spikes. Option D is wrong because swap to disk would cause severe performance degradation for all inferences, not just when a new model is loaded, and SageMaker containers typically do not use swap for GPU memory.

Full explanation →

440

Multi-Selecthard

A financial services company must ensure that all data used by Amazon SageMaker training jobs is encrypted at rest. The company wants to use a customer-managed key (CMK) for the encryption. Which steps are necessary to achieve this? (Choose TWO.)

Select 2 answers

A.Enable SageMaker's default encryption for the training job by setting the EnableDefaultEncryption flag.

B.Create a CMK in AWS KMS and add the SageMaker service principal to the key policy to allow it to use the key.

C.Enable S3 default encryption using the CMK on all buckets containing training data.

D.Specify the CMK's ARN in the VolumeKmsKeyId parameter when creating the training job.

E.Use CloudWatch Logs encryption to protect the training logs.

AnswersB, D

SageMaker needs permission to use the CMK.

Why this answer

Options A and C are correct. A: Grant SageMaker permissions to use the CMK. C: Specify the CMK in the KmsKeyId parameter of the training job.

Option B is wrong because adding to S3 encryption is for S3, not SageMaker. Option D is wrong because enabling S3 default encryption does not cover SageMaker's internal storage. Option E is wrong because SageMaker encrypts at rest by default with AWS-managed keys, but for CMK you specify it.

Full explanation →

441

MCQeasy

A team wants to automate the retraining and deployment of an ML model whenever new labeled data arrives in S3. The workflow includes data preprocessing, training, evaluation, and conditional deployment. Which AWS service is best suited for orchestrating this end-to-end pipeline?

A.AWS Step Functions with Lambda functions for each step.

B.AWS Glue workflows with triggers based on S3 events.

C.AWS CodePipeline with source from S3 and build from CodeBuild.

D.Amazon SageMaker Pipelines triggered by S3 events via EventBridge.

AnswerD

SageMaker Pipelines is designed for ML workflows and supports S3 event triggers.

Why this answer

Amazon SageMaker Pipelines is purpose-built for ML workflows, offering native integration with SageMaker for training, evaluation, and conditional deployment steps. Triggered by S3 events via Amazon EventBridge, it automates the end-to-end pipeline from data preprocessing to conditional model deployment without requiring custom orchestration code.

Exam trap

The trap here is that candidates often choose AWS Step Functions (Option A) because it is a general-purpose orchestrator, but they overlook that SageMaker Pipelines provides tighter integration with ML-specific steps and reduces custom code overhead.

How to eliminate wrong answers

Option A is wrong because AWS Step Functions with Lambda functions would require you to manually implement each ML step (e.g., training, evaluation) and manage SageMaker API calls, lacking native ML-specific features like built-in model evaluation and conditional deployment logic. Option B is wrong because AWS Glue workflows are designed for ETL and data preparation, not for orchestrating ML training, evaluation, and deployment steps; they lack native support for SageMaker training jobs or model endpoints. Option C is wrong because AWS CodePipeline is a CI/CD service for application code, not optimized for ML workflows; it does not natively handle model evaluation, conditional deployment, or SageMaker-specific resources like training jobs and endpoints.

Full explanation →

442

MCQhard

A large enterprise has multiple SageMaker endpoints serving models for different business units. Each endpoint uses a separate instance type and scaling policy. The enterprise wants to implement a unified monitoring and logging solution to track endpoint health, latency, and errors across all endpoints. They also want to set up alerts when the error rate exceeds 5% over a 5-minute period. The solution must be centralized and use AWS-native services. Which solution should the team implement?

A.Enable SageMaker Model Monitor data capture on each endpoint and stream captured data to Amazon Kinesis for analysis.

B.Use AWS CloudTrail to audit all API calls to SageMaker and set up alarms on error responses.

C.Use Amazon CloudWatch Logs to collect logs from each endpoint, and use a Lambda function to parse logs and calculate error rates, then publish custom metrics.

D.Use Amazon CloudWatch dashboards to aggregate metrics from all endpoints, and create a composite alarm based on the Sum of 5xx error counts across endpoints.

AnswerD

CloudWatch natively aggregates metrics and composite alarms can alert on the combined error rate.

Why this answer

Option D is correct because Amazon CloudWatch can natively ingest SageMaker endpoint metrics (e.g., 5xx error counts, latency, invocation counts) without additional configuration. By creating a CloudWatch dashboard, you aggregate metrics from all endpoints into a single view, and a composite alarm using the Sum statistic across endpoints over a 5-minute period directly triggers when the error rate exceeds 5%. This approach is fully centralized, uses only AWS-native services, and requires no custom code or data streaming.

Exam trap

The trap here is that candidates confuse SageMaker Model Monitor (data quality) with endpoint monitoring (operational health), or assume CloudWatch Logs are required when SageMaker endpoints already emit rich metrics directly to CloudWatch.

How to eliminate wrong answers

Option A is wrong because SageMaker Model Monitor is designed for detecting data drift and quality issues in the input data, not for tracking endpoint health, latency, or error rates; it captures inference data to Amazon S3, not to Kinesis, and does not provide real-time error rate alerts. Option B is wrong because AWS CloudTrail records API calls (e.g., CreateEndpoint, InvokeEndpoint) but does not capture the actual inference request/response payloads or error rates; it cannot measure latency or 5xx errors per invocation. Option C is wrong because SageMaker endpoints do not natively emit logs to CloudWatch Logs for inference requests; they emit metrics directly to CloudWatch, so a Lambda function parsing logs would be unnecessary and would require custom instrumentation to generate logs, violating the 'AWS-native' requirement.

Full explanation →

443

MCQeasy

A data scientist wants to version control trained models and manage approvals for deployment. Which SageMaker feature should they use?

A.SageMaker Model Registry.

B.SageMaker Experiments.

C.SageMaker Feature Store.

D.SageMaker Ground Truth.

AnswerA

Model Registry provides version control for models and supports approval workflows for deployment.

Why this answer

Option B is correct because SageMaker Model Registry is purpose-built for model versioning and approval workflows. Option A is for experiment tracking, not deployment. Option C is for labeling data.

Option D is for feature storage.

Full explanation →

444

MCQhard

A team is using Amazon SageMaker's Automatic Model Tuning (AMT) to optimize hyperparameters for a random forest model. After 10 training jobs, the best objective metric value plateaus. The team wants to explore the search space more broadly. Which AMT strategy should they use?

A.Grid search

B.Random search

C.Bayesian optimization

D.Hyperband

AnswerB

Random search explores the entire search space uniformly, increasing the chance of finding new promising regions.

Why this answer

Random search samples hyperparameters randomly and covers the search space more broadly, which can help escape a plateau. Bayesian optimization focuses on promising regions, which may not explore broadly. Grid search is exhaustive but expensive.

Hyperband uses early stopping to allocate resources efficiently but still may not explore broadly if the plateau persists.

Full explanation →

445

MCQmedium

A company has a batch transform job in Amazon SageMaker that processes large datasets every night. Recently, the job has been failing sporadically with an out-of-memory error. The data size has not increased. What is the MOST likely cause?

A.The custom inference code has a memory leak that gradually consumes available memory.

B.The data distribution has shifted, causing different memory usage patterns.

C.The instance type is not large enough to handle the dataset.

D.The batch transform input data has increased in size.

AnswerA

A memory leak can cause OOM even with same data size.

Why this answer

Option D is correct because a memory leak in custom code would cause increasing memory usage over time within a single job, leading to OOM. Option A is wrong because instance type is fixed; if it worked before, instance type is not the issue. Option B is wrong because if the data size hasn't increased, total data is not the cause.

Option C is wrong because data distribution change doesn't directly cause OOM; it might cause different processing but not necessarily memory exhaustion.

Full explanation →

446

MCQmedium

A data scientist is using Amazon SageMaker Studio to develop a model. The training job is taking longer than expected. The data scientist suspects that the data is being downloaded from Amazon S3 each time the training starts. What is the BEST way to reduce data loading time?

A.Use SageMaker Pipe Input mode to stream data directly from S3.

B.Enable S3 transfer acceleration and cache the data in S3.

C.Use a larger instance type with more network bandwidth.

D.Use Amazon FSx for Lustre to mount a high-performance file system.

AnswerA

Pipe mode streams data without downloading, reducing start time.

Why this answer

SageMaker Pipe input mode streams data directly from S3 into the training algorithm without first downloading it to the training instance's local storage. This eliminates the bottleneck of copying entire datasets, reducing startup time and disk usage. It is the most direct and efficient way to address the issue of repeated downloads from S3.

Exam trap

The trap here is that candidates often choose a 'bigger instance' (Option C) as a brute-force fix, overlooking that Pipe mode fundamentally changes the data access pattern to eliminate the download bottleneck entirely.

How to eliminate wrong answers

Option B is wrong because S3 Transfer Acceleration speeds up uploads to S3 over long distances, not downloads during training, and caching in S3 does not change the fact that data must still be transferred to the instance. Option C is wrong because while a larger instance with more network bandwidth may reduce transfer time, it does not eliminate the fundamental overhead of downloading the entire dataset to local storage before training begins. Option D is wrong because Amazon FSx for Lustre provides a high-performance file system that can be mounted to SageMaker, but it still requires data to be loaded from S3 into the file system (e.g., via `lustre` import), adding complexity and not directly solving the repeated download issue as efficiently as Pipe mode.

Full explanation →

447

MCQmedium

Refer to the exhibit. A SageMaker endpoint is failing health checks. What is the most likely cause?

A.The endpoint is not correctly configured with VPC settings.

B.The model is too large for the instance memory.

C.The inference code has a file descriptor leak.

D.The model server is using an incorrect port.

AnswerC

The error explicitly indicates too many open files, which is a classic symptom of a file descriptor leak.

Why this answer

Option C is correct because the error 'Too many open files' indicates a file descriptor leak in the inference code. Option A would show memory errors. Option B would show network errors.

Option D would show incorrect port errors.

Full explanation →

448

Multi-Selectmedium

A data scientist is using Amazon SageMaker Data Wrangler for data preparation. Which two tasks can be performed using Data Wrangler's built-in transforms? (Choose two.)

Select 2 answers

A.Running a SQL query on the data

B.Encoding categorical variables

C.Handling missing values

D.Creating an ensemble of models

E.Training a custom machine learning model

AnswersB, C

Built-in transform for one-hot encoding or label encoding.

Why this answer

Data Wrangler includes built-in transforms for handling missing values and encoding categorical variables. Running SQL queries is possible via custom import, but not a built-in transform. Training models and creating ensembles are not part of Data Wrangler.

Full explanation →

449

MCQeasy

Refer to the exhibit. A team configured a SageMaker Model Monitor schedule for data quality. The baseline was created from a training dataset. After running for a day, the monitoring results show frequent violations. What is the most likely cause?

A.The baseline was created from a dataset that does not represent production data.

B.The environment variable max_runtime_in_seconds is too low.

C.The schedule runs too often (every hour), causing overload.

D.The monitoring output destination is incorrect.

AnswerA

If the baseline does not reflect real-world data, constraints will be frequently violated.

Why this answer

Option A is correct because the baseline from training data may not represent the production data distribution, causing frequent violations. Option B is not likely because hourly monitoring is typical. Option C would cause job failures, not violations.

Option D would cause timeout, not violations.

Full explanation →

450

MCQhard

Refer to the exhibit. A data scientist runs a SageMaker training job with the above configuration. The training completes but the model performance is poor. Which change to the hyperparameters is most likely to improve the model's AUC?

A.Increase max_depth to 10

B.Increase subsample to 1.0

C.Increase num_round to 200

D.Decrease eta to 0.1

AnswerD

A lower learning rate improves generalization by taking smaller steps, often yielding better AUC.

Why this answer

Option C is correct because reducing eta (learning rate) from 0.3 to 0.1 allows the model to converge more carefully, often improving generalization and AUC. Option A (increase num_round) may cause overfitting, especially with a high learning rate. Option B (increase max_depth) can also lead to overfitting.

Option D (increase subsample to 1.0) uses all data per round, which may reduce regularization and exacerbate overfitting.

Full explanation →

Page 6 of 7

All pages

Practice MLA-C01 by domain

Target a specific domain to shore up weak areas.

Data Preparation for Machine Learning ML Model Development Deployment and Orchestration of ML Workflows ML Solution Monitoring, Maintenance and Security

See all domains with question counts →