Knowledge + Practice

AWS Certified Machine Learning Engineer Associate MLA-C01 (MLA-C01) — Questions 1–75

507 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 1 of 7

1

MCQhard

Refer to the exhibit. A data scientist used a SageMaker training job with a custom Scikit-learn script. The training job failed with the error shown. What is the most likely cause of this failure?

A.The training script is reading the CSV file incorrectly, causing a shape mismatch.

B.The InputDataConfig specifies ContentType text/csv but the actual file is not CSV.

C.The SageMaker training image is outdated and does not support Scikit-learn 1.0.

D.The training data contains missing values that need to be imputed.

AnswerA

Correct: The error indicates a shape issue, and SageMaker's CSV loading can produce 1D arrays for single-column data, which the script must handle.

Why this answer

The error 'Expected 2D array, got 1D array' indicates the input data is being interpreted as a single-dimensional array. In SageMaker, when reading a CSV file with one column, the default behavior may produce a 1D array. The script likely expects a 2D array.

Option A is correct because the script is incorrectly reading or processing the CSV, causing a shape mismatch.

Full explanation →

2

MCQhard

A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The source data in S3 has a schema that evolves over time (new columns are added occasionally). The Glue job schema is defined as a fixed schema in the job script. After an update to the source data, the Glue job fails with an error about mismatched schemas. How should the data engineer modify the data preparation process to handle schema evolution?

A.Modify the Glue job to use a dynamic frame and enable schema updates with a 'applyMapping' that includes new columns

B.Run a Glue crawler before each job to update the Data Catalog, but keep the fixed schema in the job

C.Store the schema definition in a separate file in S3 and read it at runtime

D.Manually update the Glue job script each time the schema changes

AnswerA

Dynamic frames with schema detection can adapt to schema changes.

Why this answer

Option A is correct because AWS Glue DynamicFrames natively handle schema evolution by allowing you to apply a mapping that can include new columns. By using `applyMapping` with `resolveChoice`, you can define how to handle new fields (e.g., cast to a type or keep as a struct), preventing job failures when the source schema changes. This avoids the rigidity of a fixed schema in the job script.

Exam trap

The trap here is that candidates often assume updating the Data Catalog via a crawler is sufficient, but they miss that the job script's fixed schema must also be updated or made dynamic to avoid mismatches.

How to eliminate wrong answers

Option B is wrong because running a Glue crawler updates the Data Catalog but does not automatically adapt the fixed schema defined in the job script; the job will still fail if the script expects a specific schema. Option C is wrong because storing the schema in a separate S3 file and reading it at runtime still requires manual updates to that file when the schema changes, which does not provide dynamic adaptation. Option D is wrong because manually updating the job script each time the schema changes is error-prone, not scalable, and defeats the purpose of automated ETL processing.

Full explanation →

3

MCQhard

Refer to the exhibit. A data scientist ran a SageMaker training job using a built-in algorithm. The job failed with the above error. What is the most likely cause?

A.The S3 bucket lacks proper permissions for SageMaker to read the training data.

B.The input CSV file has missing or mismatched column headers.

C.The built-in algorithm does not support CSV input format.

D.The training instance ran out of memory.

AnswerB

The failure reason indicates the CSV headers do not match the training schema.

Why this answer

Option B is correct because the error explicitly states the data format is incorrect; the CSV headers do not match the expected schema. Option A is wrong because the error does not mention memory. Option C is wrong because the error is about data format, not permissions.

Option D is wrong because the algorithm is built-in and should support CSV if headers match.

Full explanation →

4

Multi-Selecthard

A company is building a CI/CD pipeline for ML models using AWS CodePipeline and SageMaker. The pipeline should include steps to automatically retrain, evaluate, and deploy models. Which THREE components are essential for this pipeline? (Choose three.)

Select 3 answers

A.SageMaker Pipelines to orchestrate training and evaluation steps.

B.Amazon S3 bucket to store training data and model artifacts.

C.Amazon CloudWatch to log API calls.

D.SageMaker Model Registry to store and version models.

E.AWS Lambda function to trigger evaluation.

AnswersA, B, D

Pipelines define the sequence of steps and conditional logic for retraining and evaluation.

Why this answer

SageMaker Pipelines is essential because it provides a native orchestration service to define, automate, and manage the end-to-end ML workflow, including training, evaluation, and conditional deployment steps. It integrates directly with other SageMaker components and CodePipeline, enabling a seamless CI/CD pipeline without requiring custom orchestration logic.

Exam trap

The trap here is that candidates often confuse monitoring services like CloudWatch with essential pipeline components, or assume that a serverless function like Lambda is required for evaluation when SageMaker Pipelines already provides native evaluation capabilities.

Full explanation →

5

MCQmedium

A team used the above config to create an endpoint. However, the endpoint fails to invoke because of a "ModelError". What is the most likely cause?

A.The instance type is not available in the region.

B.The IAM role does not have permission to access the S3 bucket.

C.The model data URL points to a non-existent file.

D.The ECR image URI is incorrect for the region.

AnswerB

Without s3:GetObject, the endpoint cannot load the model artifact.

Why this answer

The most likely cause of a ModelError when invoking a SageMaker endpoint is that the IAM role associated with the endpoint does not have the necessary permissions to access the S3 bucket containing the model artifacts. SageMaker downloads the model data from S3 during endpoint creation, and if the role lacks s3:GetObject permission on the bucket, the model fails to load, resulting in a ModelError.

Exam trap

AWS often tests the distinction between errors that occur during model creation (e.g., invalid S3 URI, missing file) versus errors that occur at invocation time (ModelError), leading candidates to incorrectly choose Option C when the actual cause is a permissions issue that prevents the model from being loaded.

How to eliminate wrong answers

Option A is wrong because an unavailable instance type would cause an 'InsufficientInstanceCapacity' or 'ResourceLimitExceeded' error, not a ModelError. Option C is wrong because a non-existent model data file would cause a 'ModelError' only if the file path is syntactically valid but missing; however, the question states the endpoint fails to invoke, and a missing file typically raises a 'ValidationError' during model creation, not a runtime ModelError. Option D is wrong because an incorrect ECR image URI would cause an 'ImageNotFoundException' or 'AccessDeniedException' during model creation, not a ModelError at invocation time.

Full explanation →

6

MCQhard

A company uses SageMaker endpoint with production variants for canary deployments. The team wants to gradually shift traffic from the old model variant (variant A) to the new model variant (variant B) over a period of 10 minutes. After the shift, if the new variant's error rate increases by more than 5%, they want to roll back automatically. Which solution meets these requirements with minimal manual intervention?

A.Use AWS Cloud Map to register the new variant and perform a slow rollout.

B.Deploy variant B as a separate endpoint and use Route 53 weighted routing to shift traffic.

C.Use the SageMaker UpdateEndpoint API with a linear traffic shift from variant A to variant B over 10 minutes, and configure a CloudWatch alarm on the new variant's error rate that triggers a Lambda function to revert the traffic weights.

D.Use AWS CodeDeploy with a deployment group to shift traffic and automatically roll back if CloudWatch alarms trigger.

AnswerC

This approach automates both the gradual shift and the rollback based on error rates.

Why this answer

Option C is correct because the SageMaker UpdateEndpoint API supports a linear traffic shift between production variants, allowing you to gradually route traffic from variant A to variant B over a specified time period (here, 10 minutes). By attaching a CloudWatch alarm on the new variant's error rate that triggers a Lambda function to revert the traffic weights, you achieve automatic rollback with minimal manual intervention when the error rate exceeds the 5% threshold.

Exam trap

The trap here is that candidates often assume AWS CodeDeploy (Option D) can manage SageMaker endpoints because it supports canary deployments for other services, but SageMaker has its own native traffic shifting and rollback mechanisms that are not integrated with CodeDeploy.

How to eliminate wrong answers

Option A is wrong because AWS Cloud Map is a service for service discovery and does not provide traffic shifting or canary deployment capabilities for SageMaker endpoints. Option B is wrong because deploying variant B as a separate endpoint and using Route 53 weighted routing would shift traffic at the DNS level, which introduces latency due to DNS caching and does not integrate with SageMaker's native variant monitoring or automatic rollback mechanisms. Option D is wrong because AWS CodeDeploy does not natively support SageMaker endpoints; it is designed for EC2, Lambda, and ECS deployments, and cannot directly manage traffic shifting or rollback for SageMaker production variants.

Full explanation →

7

MCQeasy

A team is building a machine learning model for natural language processing using SageMaker BlazingText. The data preparation step must format the training data correctly. What format does BlazingText require for supervised text classification?

A.One-hot encoded feature vectors stored in CSV

B.JSON lines with a 'text' and 'label' field

C.Tokenized words separated by spaces, with text and labels combined in a single line (e.g., '__label__positive great product')

D.TFRecord files with sequence features

AnswerC

BlazingText expects this format for supervised learning.

Why this answer

BlazingText for supervised text classification expects the training data in a specific format where each line contains the text and its labels, with labels prefixed by '__label__'. This format allows BlazingText to efficiently parse and process the data for training the word2vec or classification model without additional preprocessing. Option C correctly describes this format, where the label and text are space-separated on a single line.

Exam trap

The trap here is that candidates often confuse the JSON lines format (used by other SageMaker algorithms like BlazingText for Word2Vec or built-in Text Classification) with the specific '__label__' prefix format required for BlazingText's supervised text classification, leading them to select option B.

How to eliminate wrong answers

Option A is wrong because BlazingText does not accept one-hot encoded feature vectors in CSV; it requires raw text with inline labels for supervised classification. Option B is wrong because while JSON lines are common in other SageMaker built-in algorithms (e.g., BlazingText for Word2Vec or Text Classification using JSON lines), BlazingText's supervised text classification specifically requires the '__label__' prefix format, not JSON. Option D is wrong because TFRecord files are used by TensorFlow-based algorithms, not by BlazingText, which expects plain text files with the label-prefixed format.

Full explanation →

8

MCQhard

During a blue/green deployment of a SageMaker endpoint, the team notices that traffic is not being fully shifted to the new variant after the update. The endpoint has two variants with equal initial weights (50% each). The team wants to shift 100% traffic to the new variant. What is the most likely cause?

A.The new variant is using a different instance type that is not supported in the same endpoint

B.The new variant's model container is failing health checks, so traffic is not routed to it

C.The new variant's weight was set to 100 but the maximum weight per variant is 50

D.The endpoint's load balancer is misconfigured and not forwarding traffic to the new variant

AnswerB

SageMaker performs health checks; if the new variant fails, it stays in 'Creating' state and no traffic is routed.

Why this answer

Option B is correct because SageMaker endpoints route traffic only to variants that pass health checks. If the new variant's model container fails health checks (e.g., due to a misconfigured inference script or incompatible dependencies), SageMaker will not send any traffic to it, regardless of the weight setting. This explains why traffic remains stuck at 50% on the old variant despite the intended shift to 100%.

Exam trap

The trap here is that candidates assume weight settings alone control traffic distribution, overlooking that SageMaker enforces health checks as a prerequisite for routing traffic to any variant.

How to eliminate wrong answers

Option A is wrong because SageMaker endpoints support multiple instance types across variants; a different instance type does not prevent traffic routing. Option C is wrong because SageMaker allows a single variant's weight to be set to 100 (the maximum is 100, not 50), so this would not block the shift. Option D is wrong because SageMaker endpoints use an internal application load balancer managed by the service; there is no customer-accessible load balancer to misconfigure.

Full explanation →

9

MCQhard

A company uses AWS Glue ETL jobs to transform data for machine learning. They have a dataset with a column 'income' that is heavily right-skewed. Which transformation should be applied to make the distribution more Gaussian-like?

A.Log transformation (natural log)

B.Standardization (z-score)

C.Min-max scaling to [0,1]

D.Equal-width binning

AnswerA

Reduces right skewness, makes distribution more symmetric.

Why this answer

A log transformation is appropriate for heavily right-skewed data because it compresses the long tail by applying a concave function, pulling extreme values closer to the mean and making the distribution more symmetric. In AWS Glue ETL, you can apply this using Spark SQL's `LOG` function or a Python UDF with `numpy.log`, which directly addresses the skewness to better approximate a Gaussian distribution for downstream ML models.

Exam trap

The trap here is that candidates confuse scaling (standardization or min-max) with shape-changing transformations, assuming any normalization makes data Gaussian, when in fact only non-linear transformations like log or Box-Cox address skewness.

How to eliminate wrong answers

Option B is wrong because standardization (z-score) centers and scales data to have mean 0 and standard deviation 1, but it does not change the shape of the distribution—it only rescales, so right-skewness remains. Option C is wrong because min-max scaling to [0,1] linearly compresses the data into a fixed range, which preserves the relative distances and does not alter skewness or make the distribution Gaussian-like. Option D is wrong because equal-width binning discretizes the continuous 'income' column into fixed intervals, which loses granularity and does not transform the distribution toward Gaussian—it creates a categorical or ordinal feature instead.

Full explanation →

10

MCQhard

A machine learning engineer deploys a model to an Amazon SageMaker endpoint with data capture enabled. The endpoint uses a production variant with initial instance count of 2. After a week, they notice that the captured data is not being sent to the specified Amazon S3 bucket. The IAM role used by the endpoint has the following policy attached. What is the MOST likely reason for the failure?

A.The S3 bucket does not exist.

B.The S3 bucket uses AWS KMS encryption and the role lacks kms:Decrypt permission.

C.The IAM role does not have permission to write to the correct S3 prefix.

D.The IAM role does not have s3:ListBucket permission.

AnswerC

The policy restricts writes to 'captures/' prefix, but the endpoint may use a different prefix.

Why this answer

Option C is correct because the IAM role attached to the SageMaker endpoint must have write permissions to the exact S3 prefix where data capture is configured. The policy shown likely grants access to a broader bucket or a different prefix, but not the specific path (e.g., s3://bucket-name/prefix/) that the endpoint's DataCaptureConfig specifies. Without s3:PutObject on that exact prefix, the captured data fails to upload silently.

Exam trap

The trap here is that candidates often assume any S3 write permission on the bucket is sufficient, but SageMaker data capture requires explicit permission on the exact prefix path, not just the bucket or a wildcard that doesn't match the configured prefix.

How to eliminate wrong answers

Option A is wrong because if the S3 bucket did not exist, SageMaker would fail at endpoint creation or deployment time, not after a week of operation. Option B is wrong because the question does not mention KMS encryption being enabled on the bucket, and the policy shown does not include kms:Decrypt; if KMS were used, the role would need kms:GenerateDataKey and kms:Decrypt, but the absence of those is not the issue here. Option D is wrong because s3:ListBucket is not required for writing captured data; SageMaker only needs s3:PutObject on the specific prefix, not ListBucket on the bucket.

Full explanation →

11

MCQeasy

A company has a model that receives low traffic but needs to handle sudden spikes. Which deployment option is most cost-effective?

A.SageMaker Serverless Inference

B.SageMaker Real-Time Endpoint with Auto Scaling

C.SageMaker Multi-Model Endpoint

D.SageMaker Batch Transform

AnswerA

Serverless scales to zero during idle periods and handles spikes, minimizing cost.

Why this answer

SageMaker Serverless Inference is the most cost-effective option for low-traffic models with sudden spikes because it automatically scales to zero when not in use and scales up instantly to handle bursts, charging only for the compute time consumed per inference request. This eliminates the cost of idle provisioned infrastructure, making it ideal for unpredictable or intermittent traffic patterns.

Exam trap

AWS often tests the misconception that auto-scaling (Option B) is the most cost-effective for spikes, but the trap is that auto-scaling still requires a baseline of provisioned instances that incur cost even when idle, whereas serverless inference scales to zero and charges only for active compute time.

How to eliminate wrong answers

Option B (SageMaker Real-Time Endpoint with Auto Scaling) is wrong because it requires always-on provisioned instances, incurring costs even during idle periods, and auto-scaling has a lag that may not handle sudden spikes as quickly as serverless. Option C (SageMaker Multi-Model Endpoint) is wrong because it still uses provisioned instances that run continuously, and while it shares resources across models, it does not scale to zero or handle sudden spikes without pre-provisioned capacity. Option D (SageMaker Batch Transform) is wrong because it is designed for offline, asynchronous batch processing on a complete dataset, not for real-time inference with low latency or handling live traffic spikes.

Full explanation →

12

MCQhard

A machine learning team is training a large natural language processing model on Amazon SageMaker using the SageMaker Hugging Face container. The training job runs on multiple instances and uses Managed Spot Training to reduce costs. However, the job frequently gets interrupted by Spot interruptions, causing long training times. What should the team do to mitigate this issue?

A.Use a reserved capacity with Savings Plans

B.Use a larger instance type to finish faster

C.Enable checkpointing and increase the number of save intervals

D.Disable Managed Spot Training and use On-Demand instances

AnswerC

Checkpointing saves model state so training can resume after a Spot interruption; more frequent saves reduce the amount of work lost.

Why this answer

Enabling checkpointing and saving intermediate model states at appropriate intervals allows the training job to resume from the last checkpoint after a Spot interruption, significantly reducing wasted time. Increasing save intervals means more frequent saving, which reduces work lost. Reserved capacity does not help with interruptions; using larger instances doesn't prevent interruptions; disabling Spot increases cost.

Full explanation →

13

MCQhard

A data engineer is processing a large dataset in Amazon S3 with AWS Glue ETL. The dataset contains timestamps in multiple time zones. The engineer needs to create a feature for hour-of-day consistent across all records. Which approach ensures correctness?

A.Convert all timestamps to UTC in the ETL script using Spark's from_utc_timestamp

B.Use AWS Glue's built-in transform to parse timestamps with timezone offsets

C.Use Python's datetime.strptime with tzlocal

D.Convert all timestamps to UTC during the ETL process, then extract hour

AnswerD

Normalizing to UTC before extracting hour guarantees consistency across time zones.

Why this answer

Option D is correct because converting all timestamps to UTC during the ETL process ensures a consistent time zone reference before extracting the hour-of-day feature. This avoids ambiguity from mixed time zones and aligns with best practices for machine learning feature engineering. AWS Glue ETL with Apache Spark provides built-in functions like `to_utc_timestamp()` to perform this conversion reliably.

Exam trap

AWS often tests the confusion between `from_utc_timestamp` and `to_utc_timestamp` in Spark, where candidates mistakenly choose the function that converts away from UTC instead of to UTC, leading to incorrect hour-of-day features.

How to eliminate wrong answers

Option A is wrong because `from_utc_timestamp` in Spark converts a UTC timestamp to a specified time zone, not to UTC, which would introduce inconsistency. Option B is wrong because AWS Glue's built-in transforms (e.g., `ResolveChoice`) do not provide a dedicated transform to parse timestamps with timezone offsets and normalize them to a single time zone; they only handle schema resolution. Option C is wrong because Python's `datetime.strptime` with `tzlocal` relies on the local system time zone, which is not deterministic in a distributed ETL environment like AWS Glue and can vary across workers, leading to incorrect hour extraction.

Full explanation →

14

MCQhard

A machine learning team is developing a deep learning model for image classification. They observe that the training loss decreases rapidly but the validation loss starts increasing after a few epochs. Which strategy should they implement to address this issue?

A.Increase the batch size

B.Add more convolutional layers

C.Increase the learning rate

D.Apply dropout regularization

AnswerD

Dropout prevents co-adaptation of neurons, acting as a regularizer to reduce overfitting.

Why this answer

Increasing validation loss indicates overfitting. Dropout regularization randomly drops neurons during training, which reduces overfitting. Increasing learning rate would make training unstable.

Adding more layers increases capacity and likely worsens overfitting. Increasing batch size can have a regularizing effect but is not as direct as dropout.

Full explanation →

15

Multi-Selectmedium

A company uses Amazon SageMaker Model Monitor to track data quality. The monitoring job triggers an alert indicating that the data distribution has shifted beyond the configured threshold. Which TWO actions should the team take? (Choose TWO.)

Select 2 answers

A.Update the Model Monitor baseline if the drift is acceptable

B.Delete the monitoring schedule

C.Retrain the model with updated training data

D.Increase the instance count of the endpoint

E.Evaluate the data quality report

AnswersA, C

If the drift reflects a new normal, updating the baseline prevents false alerts.

Why this answer

Options A and B are correct because the team should retrain the model on the new data distribution if the drift is significant, or update the baseline if the drift is acceptable and represents expected behavior. Options C, D, and E are not appropriate immediate actions.

Full explanation →

16

Multi-Selectmedium

A data scientist needs to prepare a dataset for a binary classification model. The dataset contains 100,000 records with 50 features, including categorical variables with high cardinality, missing values in 30% of records for a key numeric feature, and a severe class imbalance (5% positive class). The data is stored in an Amazon S3 bucket. Which TWO actions should the data scientist take to improve model performance and ensure robust data preparation? (Choose two.)

Select 2 answers

A.Use stratified sampling to split the dataset into training and test sets, preserving the class imbalance ratio.

B.Delete all records with missing values to ensure data integrity.

C.Apply one-hot encoding to all categorical features regardless of cardinality.

D.Randomly undersample the majority class to balance the dataset before training.

E.Use scikit-learn's StandardScaler inside an AWS Glue job to standardize numeric features.

AnswersA, E

Stratified sampling ensures that both training and test sets have the same class distribution, which is critical for imbalanced data.

Why this answer

Option B is correct because standard scaling is important for distance-based models, and option D is correct because stratified sampling preserves class distribution in train/test split. Option A is wrong because deleting records with missing values would discard 30% of data, leading to loss of information and potential bias. Option C is wrong because one-hot encoding high-cardinality features creates too many dummy variables, causing the curse of dimensionality.

Option E is wrong because random undersampling can discard valuable majority class examples, reducing model performance.

Full explanation →

17

MCQhard

Refer to the exhibit. A SageMaker Pipeline fails with 'Invalid output reference' at the TrainingStep. What is the most likely cause?

A.The TuningStep output name is misspelled

B.The pipeline role lacks permissions

C.The TrainingStep expects a single artifact but TuningStep produces multiple

D.The instance type is incompatible

AnswerC

Tuning step outputs multiple models; directly passing to training step causes ambiguity.

Why this answer

Option C is correct because in SageMaker Pipelines, a `TrainingStep` that expects a single artifact as input will fail with 'Invalid output reference' if the preceding `TuningStep` produces multiple artifacts (e.g., from multiple training jobs). The pipeline cannot resolve which specific artifact to pass, causing the error.

Exam trap

AWS often tests the subtle distinction between output reference errors caused by naming mismatches versus those caused by cardinality mismatches, where candidates mistakenly focus on permissions or spelling instead of the pipeline's inability to handle multiple artifacts.

How to eliminate wrong answers

Option A is wrong because a misspelled output name would cause a different error (e.g., 'Property not found'), not 'Invalid output reference', and the pipeline would fail at the step referencing the name, not at the TrainingStep. Option B is wrong because insufficient pipeline role permissions typically result in an 'AccessDenied' or 'UnauthorizedOperation' error, not 'Invalid output reference'. Option D is wrong because an incompatible instance type causes an 'InsufficientInstanceCapacity' or 'ResourceLimitExceeded' error during step execution, not an output reference validation error.

Full explanation →

18

MCQmedium

A retail company uses SageMaker to train a multi-class image classification model with a custom ResNet-50 implemented in TensorFlow. The training data is 500 GB of images stored in S3. The data scientist uses a ml.p3.2xlarge instance with a single GPU. The training takes 10 hours per epoch, and the model does not converge after 5 epochs. The scientist needs to accelerate training and improve model accuracy. The current implementation loads images individually from S3 using TensorFlow's tf.data API. The scientist also notices high I/O wait time. Which combination of actions should the scientist take? (Assume the scientist is aware of best practices.) The answer is a single choice from A-D.

A.Increase the number of epochs to 20 and enable early stopping with patience 5.

B.Convert images to RecordIO format and store them on Amazon EFS for faster access.

C.Deploy the model on a SageMaker endpoint and use batch transform for offline predictions.

D.Use SageMaker Pipe mode for data ingestion and upgrade to a ml.p3.8xlarge instance.

AnswerD

Pipe mode reduces I/O wait by streaming data; more GPUs parallelize training.

Why this answer

Option B is correct because using SageMaker Pipe mode streams data directly from S3 to the training container, reducing I/O bottlenecks. Additionally, switching to a multi-GPU instance like ml.p3.8xlarge speeds up computation. Option A is wrong because increasing epochs does not address I/O or speed.

Option C is wrong because batch transform is for inference. Option D is wrong because recordIO is not natively supported by TensorFlow tf.data without conversion, and EFS adds network latency.

Full explanation →

19

MCQeasy

A data scientist is performing feature engineering on a dataset containing a categorical feature with high cardinality (over 1000 unique values). Which encoding method is most appropriate to use as input for a tree-based model?

A.One-hot encoding

B.Label encoding

C.Target encoding

D.Binary encoding

AnswerB

Label encoding converts categories to integers, which tree-based models can handle without expanding feature space.

Why this answer

Option A is correct because label encoding assigns integer labels to categories, and tree-based models can effectively split on these ordinal-like values without creating a large number of features. Option B (one-hot encoding) would produce too many features. Option C (target encoding) risks data leakage.

Option D (binary encoding) creates fewer features than one-hot but still many and may not be as interpretable for trees.

Full explanation →

20

Multi-Selecthard

A machine learning engineer is deploying a custom PyTorch model to a SageMaker endpoint for real-time inference. The model requires GPU acceleration. The engineer wants to minimize latency and cost. Which THREE actions should the engineer take? (Select THREE.)

Select 3 answers

A.Use an ml.c5.2xlarge instance with CPU only

B.Use SageMaker Batch Transform for inference

C.Compile the model with SageMaker Neo

D.Use SageMaker Elastic Inference (EI) instead of a full GPU instance

E.Use an ml.p3.2xlarge instance for the endpoint

AnswersC, D, E

Neo optimizes the model for faster inference on target hardware.

Why this answer

SageMaker Neo compiles the PyTorch model into an optimized runtime binary that is specifically tuned for the target hardware (e.g., GPU instances like ml.p3). This reduces inference latency by applying graph-level optimizations, operator fusion, and memory layout transformations without changing the model's accuracy, while also lowering compute resource usage and cost.

Exam trap

AWS often tests the distinction between real-time vs. batch inference and the trade-off between full GPU instances and lighter acceleration options like Elastic Inference, expecting candidates to recognize that Batch Transform is not suitable for low-latency endpoints and that CPU-only instances cannot meet GPU requirements.

Full explanation →

21

MCQmedium

A machine learning engineer sees the above error in Amazon CloudWatch Logs for a SageMaker endpoint. What is the most likely cause?

A.The model file is corrupted during deployment.

B.The data capture configuration is incorrectly set to capture only the response body.

C.The inference code in the Docker container outputs a different response format than expected by the endpoint.

D.The endpoint is overloaded and dropping requests.

AnswerC

The inference script (e.g., in a SageMaker inference container) must output the exact JSON structure the endpoint expects. This error shows a mismatch.

Why this answer

The error indicates the model returned a response with an unexpected structure. The expected format was a JSON with a 'predictions' array, but the model output a single 'prediction' field. This mismatch is typically due to a bug in the inference code within the Docker container, not corruption, overload, or misconfiguration of data capture.

Full explanation →

22

Multi-Selectmedium

A data engineer is preparing a dataset for a classification model. The dataset contains duplicate rows. Which TWO approaches are appropriate to handle duplicates in AWS? (Choose 2.)

Select 2 answers

A.Use the RemoveDuplicates built-in feature in Amazon QuickSight

B.Use the DistinctRows transform in Amazon SageMaker Data Wrangler

C.Use the DropDuplicates transform in AWS Glue

D.Use a SQL query with SELECT DISTINCT in Amazon Athena to create a deduplicated table

E.Use the pandas drop_duplicates() method in a SageMaker notebook

AnswersC, D

Glue's DropDuplicates removes duplicate rows in a distributed manner.

Why this answer

Option C is correct because AWS Glue provides a DropDuplicates transform within its DynamicFrame API, which is designed for ETL operations on large-scale datasets. This transform efficiently removes duplicate rows by comparing all columns or a specified subset, making it a native and scalable solution for deduplication in AWS.

Exam trap

The trap here is that candidates confuse the existence of a feature name (e.g., 'DistinctRows' in Data Wrangler) with the actual available transform, or they incorrectly assume that any Python code in a SageMaker notebook qualifies as an 'AWS approach' rather than a custom script.

Full explanation →

23

MCQeasy

A company has 10 TB of log data in compressed JSON format stored in Amazon S3. The data needs to be processed and transformed into a structured format for machine learning. The processing requires complex transformations, including parsing nested JSON and joining with a reference table. The company wants to minimize infrastructure management. Which approach should the company use?

A.Use SageMaker Processing jobs to run custom scripts.

B.Use Amazon Athena to query and transform the data.

C.Use Amazon EMR with Apache Spark.

D.Use AWS Glue ETL with PySpark.

AnswerC

EMR is designed for large-scale data processing with Spark.

Why this answer

Option C is correct because Amazon EMR with Apache Spark is designed for large-scale data processing (10 TB) and can handle complex transformations like parsing nested JSON and joining with reference tables efficiently. Spark's in-memory processing and support for structured data via DataFrames make it ideal for this workload, while EMR minimizes infrastructure management by providing a managed Hadoop/Spark cluster.

Exam trap

The trap here is that candidates confuse AWS Glue ETL with PySpark (Option D) as the default managed ETL service, but for large-scale complex transformations, EMR offers better performance and cost control, while Glue is more suited for smaller, simpler workloads or serverless needs.

How to eliminate wrong answers

Option A is wrong because SageMaker Processing jobs are optimized for ML-specific tasks like training data preprocessing, not for general-purpose ETL on 10 TB of data; they lack native support for complex joins and nested JSON parsing at scale. Option B is wrong because Amazon Athena is a serverless query engine that excels at ad-hoc SQL queries but struggles with complex transformations like parsing deeply nested JSON and joining large reference tables due to its per-query pricing and lack of native procedural logic. Option D is wrong because AWS Glue ETL with PySpark is a valid alternative for ETL, but it is less performant and more expensive than EMR for large-scale (10 TB) data processing due to its auto-scaling overhead and limited tuning capabilities; EMR provides finer control over cluster configuration and cost optimization for batch jobs.

Full explanation →

24

Multi-Selecteasy

A company is adopting Amazon SageMaker Pipelines to automate their ML workflow. They want to choose three key benefits that SageMaker Pipelines provides over traditional manual scripts and ad-hoc steps. Which THREE benefits are correct?

Select 3 answers

A.Model lineage tracking from raw data to trained model artifacts.

B.Automated deployment of models to endpoints upon pipeline completion.

C.Event-driven execution when new data arrives in S3.

D.Automatic scaling of compute resources based on data volume.

E.Reproducible execution through a directed acyclic graph (DAG) of steps with re-run capabilities.

AnswersA, C, E

Pipelines automatically capture lineage metadata.

Why this answer

Option A is correct because SageMaker Pipelines automatically captures and tracks the lineage of every artifact, including datasets, processing jobs, training jobs, and model versions. This lineage is stored in SageMaker's metadata store, enabling full traceability from raw data to the final model artifact, which is critical for auditability and compliance in ML workflows.

Exam trap

AWS often tests the distinction between orchestration features (like SageMaker Pipelines) and infrastructure management features (like auto-scaling), leading candidates to confuse pipeline benefits with SageMaker's broader managed service capabilities.

Full explanation →

25

MCQmedium

A company is using SageMaker Model Registry to manage model versions. They want to automatically deploy the latest approved model to production after retraining. Which approach is best?

A.Manually deploy the approved model using the SageMaker console

B.Use AWS Lambda to update the endpoint whenever a new model version is created

C.Create a SageMaker Pipeline that includes a model approval step and deployment step

D.Schedule a CloudWatch Event to invoke a SageMaker update endpoint API daily

AnswerC

SageMaker Pipelines can model the entire workflow including conditional deployment based on approval.

Why this answer

Option C is correct because a SageMaker Pipeline can orchestrate the entire workflow from retraining to deployment, including a model approval step that gates deployment to production only when the model is approved. This automates the process end-to-end, ensuring that only approved models are deployed, which aligns with the requirement to automatically deploy the latest approved model after retraining.

Exam trap

The trap here is that candidates may choose Option B because it sounds automated, but they overlook the critical requirement for model approval before deployment, which Lambda alone cannot enforce without additional logic.

How to eliminate wrong answers

Option A is wrong because manual deployment via the SageMaker console does not automate the process and violates the requirement for automatic deployment after retraining. Option B is wrong because using AWS Lambda to update the endpoint whenever a new model version is created would deploy models without waiting for approval, bypassing the model approval step and potentially deploying unapproved models. Option D is wrong because scheduling a CloudWatch Event to invoke a SageMaker update endpoint API daily does not tie deployment to model approval or retraining events; it deploys on a fixed schedule regardless of model status.

Full explanation →

26

MCQmedium

A data scientist needs to ensure that the same train/test split is used across multiple experiments for reproducibility in SageMaker. Which approach should they take?

A.Use the same SageMaker instance type

B.Use the same hyperparameter values

C.Use the same dataset version

D.Set a random seed in the training script

AnswerD

Correct: Setting a random seed ensures reproducibility of random operations like data splits.

Why this answer

Option C is correct because setting a random seed in the training script ensures reproducibility of the data split. Options A, B, and D are incorrect because instance type, dataset version, and hyperparameters do not control the random split.

Full explanation →

27

Multi-Selecthard

A company is using an AWS Step Functions state machine to orchestrate a multi-step ML deployment. The workflow includes: training a model, evaluating it, registering the model, and deploying to a staging endpoint. They need to implement an approval gate before deploying to production. Which THREE components are necessary to achieve this? (Choose three.)

Select 3 answers

A.An AWS CodePipeline pipeline with approval stage

B.A task in the state machine that pauses and waits for manual approval via SNS or Lambda

C.Model Registry to store the approved model version after evaluation

D.An Amazon SNS topic for notification of approval status

E.An API call to SageMaker to create or update the production endpoint

AnswersB, C, E

Step Functions can use 'Wait for Task Token' to implement human approval.

Why this answer

Option B is correct because Step Functions can use a task with a callback pattern (`.waitForTaskToken`) to pause the workflow and wait for external manual approval. When combined with an SNS topic or Lambda function that sends a task success or failure signal back to Step Functions, this creates a reliable approval gate. This pattern allows the state machine to halt execution until a human approves or rejects the deployment, which is essential for production deployment control.

Exam trap

AWS often tests the distinction between a notification-only service (like SNS) and a service that can actively pause and resume a workflow (like Step Functions with task tokens), leading candidates to mistakenly select SNS as a sufficient approval gate component.

Full explanation →

28

MCQeasy

A company stores its model training data in Amazon S3. To meet compliance requirements, all data in transit between the S3 bucket and SageMaker must be encrypted. What should the company enforce?

A.Enable S3 versioning

B.Enable S3 access logging

C.Enforce HTTPS for all S3 access

D.Use S3 server-side encryption (SSE-S3)

AnswerC

HTTPS provides encryption in transit.

Why this answer

Option C is correct because enforcing HTTPS for all S3 access ensures that data in transit between the S3 bucket and SageMaker is encrypted using TLS. This meets the compliance requirement for encrypting data in transit, as HTTPS uses TLS to protect data as it travels over the network.

Exam trap

The trap here is that candidates often confuse encryption at rest (SSE-S3) with encryption in transit (HTTPS/TLS), leading them to select Option D when the question explicitly asks about data in transit.

How to eliminate wrong answers

Option A is wrong because S3 versioning is a data protection feature that preserves, retrieves, and restores every version of an object stored in a bucket; it does not encrypt data in transit. Option B is wrong because S3 access logging provides detailed records of requests made to a bucket for auditing purposes, but it does not enforce or provide encryption for data in transit. Option D is wrong because S3 server-side encryption (SSE-S3) encrypts data at rest within S3, not data in transit between S3 and SageMaker.

Full explanation →

29

MCQhard

A company is running multiple SageMaker endpoints for different models, each serving a separate business unit. The total cost is growing rapidly. The ML engineering team wants to reduce costs without sacrificing performance or isolation. They are considering either consolidating models into a Multi-Model Endpoint (MME) or onto a Multi-Container Endpoint (MCE). The models vary in size from 100 MB to 5 GB, and traffic patterns are unpredictable. Which recommendation is MOST appropriate?

A.Use a Multi-Model Endpoint with a single large instance type to host all models, and enable SageMaker inference pipelines if pre-processing is needed.

B.Use Multi-Container Endpoints to deploy multiple models on a single instance.

C.Migrate all models to AWS Lambda functions for serverless inference.

D.Keep individual endpoints but switch to Graviton-based instances for cost savings.

AnswerA

Multi-Model Endpoints load models on demand, allowing many small models to share an instance, reducing cost. They support isolation through model directories and can be combined with inference pipelines.

Why this answer

A Multi-Model Endpoint (MME) is the most appropriate choice because it allows hosting multiple models on a single instance while keeping them isolated in separate memory spaces, which reduces cost by sharing the underlying infrastructure. MME dynamically loads and unloads models based on traffic, making it ideal for unpredictable patterns and model sizes ranging from 100 MB to 5 GB. Inference pipelines can be added for pre-processing without breaking the multi-model architecture, preserving performance and isolation.

Exam trap

The trap here is that candidates confuse Multi-Model Endpoints with Multi-Container Endpoints, assuming both provide similar isolation and cost benefits, but MCE is designed for microservices-like architectures where all containers must be active, not for dynamic model loading based on traffic.

How to eliminate wrong answers

Option B is wrong because Multi-Container Endpoints (MCE) run multiple containers on the same instance but all containers are always active, which does not optimize for unpredictable traffic and can lead to higher memory usage and cost. Option C is wrong because AWS Lambda has a maximum deployment package size of 250 MB (unzipped, including layers) and a 15-minute timeout, making it unsuitable for models up to 5 GB and real-time inference with unpredictable latency. Option D is wrong because keeping individual endpoints, even with Graviton instances, does not address the core issue of cost from multiple endpoints; it only reduces per-instance cost marginally, while consolidation is needed for significant savings.

Full explanation →

30

MCQmedium

A company deploys a real-time inference endpoint on SageMaker for a customer-facing application. Traffic patterns are unpredictable and sometimes spike. The endpoint must scale automatically to handle load while minimizing cost. Which approach should the company take?

A.Switch to batch transform for all inference requests.

B.Use a larger instance type to handle peak traffic.

C.Configure a target tracking scaling policy on the endpoint using CloudWatch metrics.

D.Deploy multiple models behind an Application Load Balancer.

AnswerC

Target tracking scales instances in/out based on a metric threshold, matching demand.

Why this answer

Option D is correct because a target tracking scaling policy with a metric like average latency or request count scales automatically based on demand. Option A is wrong because using multiple models does not address scaling. Option B is wrong because increasing instance type without auto-scaling leads to over-provisioning.

Option C is wrong because batch transform is for asynchronous, not real-time.

Full explanation →

31

MCQhard

In SageMaker Data Wrangler, you have a flow that imports data from Amazon S3 and needs to join it with a table from Amazon Redshift. The data volumes are large (hundreds of GB). Which approach is most efficient within Data Wrangler?

A.Use Amazon Athena federated query to join in place and import the result

B.Export the Redshift table to S3 as Parquet, then import both datasets into Data Wrangler and join

C.Use AWS Glue to join the datasets and output to S3, then import the joined result into Data Wrangler

D.Import the Redshift table directly using a Data Wrangler source step and apply a join transform

AnswerD

Data Wrangler can connect to Redshift natively and perform joins efficiently.

Why this answer

Option D is correct because SageMaker Data Wrangler natively supports Amazon Redshift as a source via a direct connection, allowing you to import the Redshift table as a source step and then apply a join transform within the same visual flow. This approach avoids unnecessary data movement or intermediate exports, which is critical for hundreds of GB of data, as it leverages Data Wrangler's optimized in-memory and Spark-based processing to perform the join efficiently.

Exam trap

The trap here is that candidates assume large-scale joins must be offloaded to external services like AWS Glue or Athena, but Data Wrangler's native Redshift source and join transform are designed for this exact use case, making the direct approach the most efficient.

How to eliminate wrong answers

Option A is wrong because Amazon Athena federated query is designed for querying across data sources, but it does not integrate directly as a join step within Data Wrangler; you would need to export the result to S3 and re-import, adding latency and complexity. Option B is wrong because exporting the Redshift table to S3 as Parquet introduces an extra data movement step that is inefficient for large volumes, and Data Wrangler can directly import from Redshift without this intermediate export. Option C is wrong because using AWS Glue to join the datasets and output to S3 adds an unnecessary orchestration layer and data duplication, whereas Data Wrangler can perform the join natively without external services.

Full explanation →

32

MCQhard

A data scientist is preparing a large dataset (50 GB) for training a TensorFlow model on SageMaker. The dataset consists of many small CSV files. Training is slow due to I/O bottlenecks. Which data preparation strategy most effectively accelerates training?

A.Convert the dataset to TFRecord format and use tf.data pipeline with prefetching

B.Convert the dataset to Parquet format and use Apache Arrow for loading

C.Compress the CSV files and decompress during data loading

D.Use a larger instance type with more vCPUs

AnswerA

TFRecord combines many records into a few large files, and prefetching improves data pipeline efficiency.

Why this answer

Option A is correct because TFRecord format stores data in a binary, row-oriented format that TensorFlow's tf.data API can read efficiently, especially with prefetching to overlap data loading with model computation. This eliminates the per-file open/parse overhead of many small CSV files, which is the primary cause of I/O bottlenecks in this scenario.

Exam trap

The trap here is that candidates often choose larger instances (Option D) as a brute-force fix, failing to recognize that the root cause is the small-file I/O pattern, which requires a format change (TFRecord) rather than more compute resources.

How to eliminate wrong answers

Option B is wrong because Parquet is a columnar storage format optimized for analytical queries and selective column reads, not for sequential row-by-row training loops typical in deep learning; Apache Arrow adds overhead without solving the small-file problem. Option C is wrong because compressing CSV files reduces storage size but increases CPU load during decompression, often worsening I/O bottlenecks due to the many small files still requiring individual decompression. Option D is wrong because increasing vCPUs does not fix the fundamental I/O bottleneck caused by many small files; it may even exacerbate contention on shared storage without addressing the file access pattern.

Full explanation →

33

MCQeasy

A machine learning engineer is training a model using SageMaker's built-in XGBoost algorithm. The training job fails with an error indicating insufficient memory. Which parameter should be adjusted to reduce memory usage?

A.subsample

B.num_round

C.max_depth

D.colsample_bytree

AnswerC

Reducing max_depth decreases tree depth and memory usage.

Why this answer

Option A is correct because decreasing max_depth reduces the size of each tree, which lowers memory consumption during training. Option B (subsample) reduces the number of rows per iteration but may not directly address tree size. Option C (colsample_bytree) reduces the number of features per tree but is less effective than max_depth.

Option D (num_round) increases the number of trees but does not directly reduce per-tree memory.

Full explanation →

34

MCQmedium

A team deploys a model with SageMaker and notices that the model returns inconsistent results during inference. They suspect a mismatch in feature transformation between the training pipeline and the inference pipeline. Which SageMaker feature can help compare the feature distributions?

A.Amazon SageMaker Model Monitor

B.Amazon SageMaker Autopilot

C.Amazon SageMaker Clarify

D.Amazon SageMaker Debugger

AnswerA

Model Monitor can compare inference data statistics against a baseline to detect drift.

Why this answer

Option B is correct because SageMaker Model Monitor can track feature distributions over time and detect drift. Option A is for debugging training jobs. Option C is for explainability and bias detection.

Option D is for automated model building.

Full explanation →

35

MCQeasy

A data engineer needs to prepare a large dataset for machine learning. The data is stored in an Amazon RDS MySQL database and needs to be transformed and moved to an S3 bucket in Parquet format for use with SageMaker. Which AWS service is most suitable for this extraction, transformation, and loading (ETL) task?

A.Use AWS Glue ETL jobs with PySpark to read from RDS, apply transformations, and write to S3 as Parquet.

B.Use Amazon Athena CTAS statements to copy data from RDS to S3.

C.Use SageMaker Data Wrangler to connect to RDS and export transformed data to S3.

D.Use Amazon EMR with Spark to read from RDS, transform, and write to S3.

AnswerA

Glue is purpose-built for this workload.

Why this answer

AWS Glue ETL jobs with PySpark are the most suitable service for this task because Glue is a fully managed, serverless ETL service that can natively connect to Amazon RDS MySQL via JDBC, apply transformations using PySpark, and write the output directly to S3 in Parquet format. This aligns perfectly with the requirement to extract, transform, and load a large dataset into a machine-learning-ready format without managing infrastructure.

Exam trap

The trap here is that candidates may confuse SageMaker Data Wrangler's ability to connect to RDS and export data with a full ETL capability, overlooking that it is an interactive tool for data preparation within SageMaker Studio rather than a serverless batch ETL service like AWS Glue.

How to eliminate wrong answers

Option B is wrong because Amazon Athena CTAS statements cannot read directly from Amazon RDS; Athena only queries data already in S3 or other data sources via federated queries, but CTAS itself requires the source to be in S3 or a cataloged table, not a live RDS database. Option C is wrong because SageMaker Data Wrangler is designed for interactive data preparation and feature engineering within SageMaker Studio, not for running serverless ETL jobs at scale; it can import data from RDS but lacks the native ability to schedule or run large-scale batch transformations and write to S3 as Parquet without additional infrastructure. Option D is wrong because while Amazon EMR with Spark can technically perform this task, it requires provisioning and managing a cluster, which adds operational overhead; AWS Glue is more suitable as a serverless, cost-effective alternative for this specific ETL workload without the need to manage EC2 instances or cluster lifecycle.

Full explanation →

36

MCQhard

A retail company has deployed a real-time recommendation model on a SageMaker endpoint. The model is trained daily using SageMaker Pipelines that process user interaction data from a large S3 bucket. Recently, the operations team noticed that the endpoint's predictions have become stale; users are seeing recommendations based on data from days ago. The pipeline runs successfully every day at 2 AM UTC, but the endpoint continues to serve the old model version. The team checks the pipeline and finds no errors. The model registry contains multiple model versions approved automatically. The endpoint is configured with production variants, but only one variant is active. The team suspects the issue is with the deployment step in the pipeline. They want to automatically deploy new model versions to the endpoint as soon as they are registered and approved. What should they do?

A.Set up the SageMaker Model Registry to trigger a Lambda function on approval that updates the endpoint using the new model version.

B.Configure the pipeline to stop automatic approval and require manual approval before deployment.

C.Modify the pipeline to run batch transforms on the new model and compare metrics, then update the endpoint.

D.Create a daily cron job that checks for new model versions and manually updates the endpoint configuration.

AnswerA

Event-driven deployment ensures immediate update on approval.

Why this answer

Option B is correct because using a SageMaker Model Registry with an automatic deployment pipeline (via EventBridge or Lambda triggered by approval) ensures new models are deployed when approved. Option A (manual approval) is not automatic. Option C (test on batch transform) doesn't deploy to endpoint.

Option D (change endpoint configuration manually) is not automated and suggested they do currently? Actually they need automation.

Full explanation →

37

MCQmedium

A company uses Amazon SageMaker Pipelines for automated retraining. The pipeline includes a processing step that runs a Python script. The script uses the boto3 library to call an AWS service, but the calls are being throttled. What is the MOST effective way to address this within the pipeline?

A.Increase the instance count for the processing step to distribute the API calls.

B.Modify the Python script to include retry with exponential backoff when receiving throttling exceptions.

C.Request a service quota increase for the throttling limit.

D.Add a wait step in the pipeline before the processing step.

AnswerB

Standard best practice for handling API throttling.

Why this answer

Option A is correct because implementing retry logic with exponential backoff in the script is the standard approach to handle throttling. Option B is wrong because Service Quotas increase is a longer-term solution but not always available. Option C is wrong because using a larger instance does not reduce throttling.

Option D is wrong because there is no built-in pipeline step for backoff; it must be in code.

Full explanation →

38

MCQhard

A team uses SageMaker Pipelines for CI/CD. The training step fails due to insufficient memory. How to fix without rewriting code?

A.Modify the training algorithm to use less memory

B.Reduce the batch size in the training script

C.Increase the instance type in the pipeline step configuration

D.Enable managed spot training

AnswerC

Changing instance type is a configuration change, not a code change.

Why this answer

Option C is correct because SageMaker Pipelines allows you to specify the instance type for each training step in the pipeline definition. By increasing the instance type (e.g., from ml.m5.large to ml.m5.xlarge or a memory-optimized instance like ml.r5.large), you allocate more memory to the training container without modifying the training script or algorithm. This directly resolves the out-of-memory error while preserving the existing code.

Exam trap

AWS often tests the distinction between infrastructure-level fixes (changing instance type in pipeline config) and code-level fixes (modifying script or algorithm), trapping candidates who think reducing batch size or enabling spot instances solves memory issues without considering the 'no code rewrite' constraint.

How to eliminate wrong answers

Option A is wrong because modifying the training algorithm to use less memory requires rewriting code, which violates the constraint of fixing the issue without rewriting code. Option B is wrong because reducing the batch size in the training script also requires modifying the training code, and while it may reduce memory usage, it does not meet the 'without rewriting code' condition. Option D is wrong because enabling managed spot training does not increase memory; it only reduces cost by using spare EC2 capacity and can cause interruptions, but it does not address insufficient memory for the training step.

Full explanation →

39

MCQeasy

A data science team needs to deploy a frequently updated PyTorch model for real-time inference. The model is retrained weekly and versioned using SageMaker Model Registry. Which deployment strategy minimizes downtime and allows easy rollback?

A.Deploy the model on an EC2 instance behind an Application Load Balancer and manually update the instance with the new model version.

B.Deploy the model using AWS Lambda with a container image and trigger via API Gateway.

C.Configure SageMaker endpoints with multiple production variants and use canary deployment to shift traffic gradually.

D.Use SageMaker hosting with a single production variant and update the endpoint with a new model configuration each week.

AnswerC

Canary deployment allows gradual traffic shift, minimizing downtime and enabling rollback.

Why this answer

Option C is correct because SageMaker endpoints with multiple production variants enable canary deployment, which shifts traffic gradually from the old model to the new one. This minimizes downtime by keeping both variants active during the transition and allows easy rollback by simply redirecting all traffic back to the previous variant if issues arise.

Exam trap

The trap here is that candidates often assume a single production variant with endpoint updates is sufficient, overlooking the downtime and rollback limitations, while the canary deployment pattern with multiple variants directly addresses the requirements for minimal downtime and easy rollback.

How to eliminate wrong answers

Option A is wrong because manually updating an EC2 instance behind an ALB introduces downtime during the update process and lacks automated rollback capabilities, making it unsuitable for a frequently updated model requiring minimal downtime. Option B is wrong because AWS Lambda has a maximum invocation duration of 15 minutes and is designed for stateless, short-lived functions, not for hosting real-time inference workloads that require persistent, low-latency serving. Option D is wrong because using a single production variant and updating the endpoint configuration each week requires a full endpoint update, which causes downtime during the deployment and does not support gradual traffic shifting or easy rollback without redeploying the previous version.

Full explanation →

40

Multi-Selecthard

A company operates multiple AWS accounts with SageMaker workloads. They need to implement governance and security controls for model monitoring and maintenance. Which THREE actions should they take to meet compliance requirements?

Select 3 answers

A.Deploy a SageMaker model registry in a centralized account.

B.Use AWS CloudTrail to log all API calls to SageMaker and S3.

C.Enable VPC Flow Logs for SageMaker notebooks.

D.Use IAM roles with cross-account trust policies for all SageMaker endpoints.

E.Use AWS Config rules to enforce encryption of model artifacts.

AnswersA, B, E

Correct. A central registry ensures model approval and version control across accounts.

Why this answer

CloudTrail logging, AWS Config rules for encryption, and a centralized model registry help enforce governance across accounts.

Full explanation →

41

MCQeasy

A company uses an Amazon SageMaker endpoint for real-time inference. The security team requires that all traffic between the endpoint and the client application be encrypted in transit. Which configuration ensures this?

A.Deploy the endpoint in a VPC and use VPC Endpoints.

B.Use AWS Key Management Service (KMS) to encrypt the data in transit.

C.The endpoint is automatically served over HTTPS; no additional configuration is needed.

D.Attach an AWS Certificate Manager (ACM) certificate to the endpoint.

AnswerC

SageMaker endpoints use HTTPS by default.

Why this answer

Option B is correct because SageMaker endpoints are already HTTPS-enabled by default, providing encryption in transit via TLS. Option A is wrong because a VPC does not provide encryption in transit by default; it provides network isolation. Option C is wrong because SageMaker does not directly use AWS Certificate Manager for endpoint encryption; it uses AWS-managed certificates.

Option D is wrong because AWS KMS is for encryption at rest, not in transit.

Full explanation →

42

Multi-Selectmedium

A company is deploying a machine learning model using SageMaker hosting. They need to support multiple versions of the model for A/B testing. Which TWO actions are required to set up the A/B test? (Choose two.)

Select 2 answers

A.Enable shadow variants to capture traffic for the new model without affecting users

B.Set up a batch transform job to compare performance offline

C.Configure the endpoint to route a percentage of traffic to each variant using initial variant weight

D.Register both models in SageMaker Model Registry

E.Create an endpoint with two production variants, each serving a different model version

AnswersC, E

Traffic splitting is achieved via variant weights.

Why this answer

Option C is correct because SageMaker endpoints use `initial variant weight` to distribute traffic among production variants. By setting this weight, you can route a specific percentage of inference requests to each model version, enabling A/B testing without changing the endpoint configuration.

Exam trap

The trap here is that candidates confuse shadow variants (which are for passive monitoring) with production variants (which are for active traffic splitting), leading them to select Option A instead of understanding that A/B testing requires explicit traffic routing via variant weights.

Full explanation →

43

MCQeasy

A data scientist is building a model to predict customer churn based on historical data. The dataset has 10 features and 100,000 records, and the target is binary. Which algorithm is most appropriate for this binary classification problem?

A.Principal component analysis

B.K-means clustering

C.Linear regression

D.Logistic regression

AnswerD

Logistic regression is designed for binary classification and handles large datasets efficiently.

Why this answer

Logistic regression is a standard algorithm for binary classification, providing probabilistic outputs and interpretability. Linear regression is for regression, K-means is for clustering, and PCA is for dimensionality reduction.

Full explanation →

44

MCQhard

A healthcare company is deploying a model for predicting patient outcomes. The model must be deployed across multiple AWS accounts to meet compliance requirements. Each account has its own Amazon SageMaker endpoint. The company wants to centralize monitoring of model performance without exposing data across accounts. Which solution should the company use?

A.Establish VPC peering between accounts and call the endpoints from a central monitoring service.

B.Replicate the inference data to a central S3 bucket in the management account using cross-account replication, then run Model Monitor centrally.

C.Use SageMaker Model Monitor in each account and publish custom metrics to a central CloudWatch account using cross-account observability.

D.Create a shared SageMaker Model Registry across accounts and aggregate monitoring.

AnswerC

This allows centralized monitoring without moving data across accounts.

Why this answer

Option C is correct because it uses SageMaker Model Monitor in each account to detect data drift and model degradation locally, then publishes custom metrics to a central CloudWatch account via cross-account observability. This approach centralizes monitoring without moving raw inference data across accounts, satisfying the compliance requirement of not exposing data.

Exam trap

The trap here is confusing data replication (which exposes raw data) with metric aggregation (which exposes only statistical summaries), leading candidates to pick Option B despite its compliance violation.

How to eliminate wrong answers

Option A is wrong because VPC peering enables network connectivity but does not provide a mechanism to centralize monitoring metrics or avoid exposing inference data across accounts; it would require direct data transfer to a central service, violating compliance. Option B is wrong because replicating inference data to a central S3 bucket using cross-account replication exposes raw data across accounts, which directly violates the requirement to not expose data. Option D is wrong because a shared SageMaker Model Registry aggregates model metadata and versions, not real-time monitoring metrics or data drift detection; it does not provide centralized performance monitoring.

Full explanation →

45

MCQhard

A machine learning engineer is setting up automated retraining for a model using SageMaker Pipelines. The pipeline should trigger when a data drift alert is received from Model Monitor. Which event source should the engineer use to initiate the pipeline?

A.Amazon CloudWatch Events (Amazon EventBridge) rule that captures Model Monitor outcome.

B.AWS Lambda function that polls CloudWatch logs.

C.S3 event notification on the monitoring output bucket.

D.SageMaker model monitor webhook.

AnswerA

Model Monitor publishes violation events to EventBridge, which can trigger a pipeline execution reliably.

Why this answer

Option D is correct because SageMaker Model Monitor publishes drift results as Amazon CloudWatch Events (now EventBridge). Option A: S3 events on monitoring output are not directly linked to drift alerts. Option B: no webhook exists.

Option C: polling is inefficient.

Full explanation →

46

MCQhard

A team is using SageMaker to run a large-scale distributed training job for a language model. They are using SageMaker's Pipe mode to stream data from S3 to reduce IO. They observe that the training throughput is lower than expected, and the CPU utilization is high while GPU utilization is low. The training script uses PyTorch's DataLoader with num_workers=0. The data preprocessing is minimal. Which change is most likely to improve GPU utilization?

A.Increase the number of data loading workers (num_workers).

B.Use a larger instance with more vCPUs.

C.Increase the number of GPUs per instance.

D.Switch from Pipe mode to File mode.

AnswerA

Correct: Increasing num_workers parallelizes data loading, reducing CPU bottleneck and improving GPU utilization.

Why this answer

Option D is correct because setting num_workers=0 forces the main process to load data, causing a CPU bottleneck. Increasing num_workers parallelizes data loading, reducing GPU idle time. Option A is wrong because adding GPUs does not address the data loading bottleneck.

Option B is wrong because more vCPUs without more workers does not help. Option C is wrong because switching to File mode would increase IO overhead, worsening the problem.

Full explanation →

47

Multi-Selectmedium

A data scientist is training a deep learning model using SageMaker and wants to use distributed training across multiple GPUs to reduce training time. Which TWO actions should the scientist take to configure distributed training? (Select TWO.)

Select 2 answers

A.Reduce the number of epochs to match the number of GPUs

B.Use the SageMaker distributed data parallelism library

C.Manually split the training data into shards and upload to S3

D.Configure the SageMaker estimator with a distribution parameter

E.Set the instance count to 1 with a multi-GPU instance

AnswersB, D

The library automatically distributes data across GPUs.

Why this answer

The SageMaker distributed data parallelism library (option B) automatically partitions training data and synchronizes gradients across multiple GPUs, reducing training time without manual data splitting. Configuring the SageMaker estimator with a distribution parameter (option D) enables this library by specifying the distribution strategy (e.g., 'torch_distributed' or 'tensorflow_distributed'), which is required to activate distributed training.

Exam trap

The trap here is that candidates confuse single-instance multi-GPU training (option E) with true distributed training across multiple instances, or assume manual data sharding (option C) is required when SageMaker automates it.

Full explanation →

48

MCQmedium

A machine learning engineer is configuring auto-scaling for a SageMaker real-time endpoint. The endpoint is expected to have steady traffic during business hours and low traffic at night. The engineer wants to minimize costs by scaling in during low traffic, but the model container has a long start-up time (about 5 minutes). Which scaling policy should the engineer use to prevent request drops during sudden traffic spikes?

A.Use a step scaling policy based on invocations per minute with a step that adds two instances at a time.

B.Use a target tracking scaling policy based on average invocations per minute with a warm-up of 300 seconds.

C.Use a scheduled scaling action to add instances before business hours and remove them after.

D.Use a simple scaling policy based on average CPU utilization with a cooldown period of 5 minutes.

AnswerB

Target tracking with a warm-up period ensures that newly launched instances are not included in the metric until they are ready, preventing traffic loss.

Why this answer

Option B is correct because target tracking scaling policies in SageMaker automatically adjust capacity to maintain a target metric value, and the warm-up time of 300 seconds accounts for the 5-minute container start-up latency. This prevents request drops during sudden traffic spikes by ensuring new instances are fully initialized before they receive traffic, while still allowing the endpoint to scale in during low traffic to minimize costs.

Exam trap

The trap here is that candidates often choose a step scaling policy (Option A) because they think adding multiple instances at once handles spikes faster, but they overlook the critical need for a warm-up period to account for container start-up latency, which target tracking with warm-up explicitly addresses.

How to eliminate wrong answers

Option A is wrong because step scaling policies add instances in fixed increments (e.g., two at a time) without considering the long start-up time; this can lead to over-provisioning or under-provisioning during sudden spikes, and the lack of a warm-up period means new instances may not be ready to handle incoming requests, causing drops. Option C is wrong because scheduled scaling actions only handle predictable traffic patterns (e.g., business hours) and cannot react to sudden, unplanned traffic spikes, leaving the endpoint vulnerable to request drops. Option D is wrong because simple scaling policies based on average CPU utilization with a cooldown period of 5 minutes do not account for the model container's start-up latency; the cooldown prevents further scaling actions during the start-up period, but the policy itself cannot pre-warm instances, so traffic spikes during the cooldown can still cause request drops.

Full explanation →

49

Multi-Selecthard

A company is preparing a large dataset for a SageMaker built-in XGBoost model. The dataset has missing values in both numeric and categorical features, and some categorical features have high cardinality. Which THREE data preparation steps should the company take to optimize model performance? (Choose three.)

Select 3 answers

A.Remove any rows with outlier values.

B.Split the data into training, validation, and test sets before any imputation.

C.Impute missing numeric values with median or mean.

D.For categorical features, use one-hot encoding for low cardinality and target encoding for high cardinality.

E.Apply target encoding to all categorical features regardless of cardinality.

AnswersB, C, D

Splitting first prevents data leakage from imputation statistics.

Why this answer

Option B is correct because splitting the data into training, validation, and test sets before any imputation prevents data leakage. If imputation statistics (e.g., mean, median) were computed on the full dataset, information from the validation and test sets would influence the training data, leading to overly optimistic performance estimates and poor generalization to new data.

Exam trap

AWS often tests the misconception that all data cleaning (including imputation) should be done on the full dataset before splitting, but the correct order is to split first to preserve the independence of the test set and avoid data leakage.

Full explanation →

50

Multi-Selecthard

A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?

Select 3 answers

A.Apply one-hot encoding to each word

B.Remove punctuation and special characters

C.Compute TF-IDF vectors

D.Perform stemming or lemmatization

E.Convert all text to lowercase

AnswersB, D, E

Removes noise that does not contribute to meaning.

Why this answer

Option B is correct because punctuation and special characters (e.g., commas, exclamation marks) introduce irrelevant noise that does not carry semantic meaning for most NLP models. Removing them reduces vocabulary size and prevents the model from treating 'hello!' and 'hello' as distinct tokens, which improves generalization and reduces overfitting.

Exam trap

AWS often tests the distinction between preprocessing steps (cleaning) and feature engineering steps (vectorization), so the trap here is that candidates mistake TF-IDF or one-hot encoding as essential preprocessing for noise reduction when they are actually downstream representation techniques.

Full explanation →

51

Multi-Selecteasy

A company stores training data in Amazon S3 and uses Amazon SageMaker for model training. They need to ensure data is encrypted at rest. Which THREE encryption options are supported by SageMaker for data stored in S3? (Choose THREE.)

Select 3 answers

A.SSE-C (customer-provided keys)

B.Client-side encryption

C.SSE-KMS (KMS-managed keys)

D.Amazon CloudFront encryption

E.SSE-S3 (S3-managed keys)

AnswersA, C, E

SageMaker supports SSE-C, but the user must provide the key during training.

Why this answer

Options A, B, and C are correct because SageMaker supports all three Amazon S3 server-side encryption options. Option D is not supported for automatic decryption by SageMaker. Option E is for content delivery, not storage encryption.

Full explanation →

52

MCQeasy

A machine learning engineer is deploying a model using AWS Lambda for inference. The model is a small scikit-learn classifier with a size of 50 MB. The Lambda function is invoked by an API Gateway REST API. The engineer notices that cold starts are causing high latency. Which action would most effectively reduce cold start latency without increasing costs significantly?

A.Store the model in Amazon EFS and load it at runtime.

B.Increase the Lambda function memory to the maximum of 10,240 MB.

C.Configure provisioned concurrency for the Lambda function.

D.Package the model in a container image and deploy using Lambda container support.

AnswerC

Provisioned concurrency keeps instances initialized and ready to respond immediately.

Why this answer

Option C is correct because provisioned concurrency pre-initializes the Lambda execution environment, keeping it warm and ready to handle requests immediately. This eliminates the cold start overhead for the first request, directly reducing latency without incurring the ongoing costs of a larger memory allocation or the complexity of EFS/container management.

Exam trap

The trap here is that candidates often confuse 'reducing cold start latency' with 'reducing compute time' or 'improving model loading speed', leading them to choose options like increasing memory or using EFS, which do not address the fundamental issue of environment initialization.

How to eliminate wrong answers

Option A is wrong because Amazon EFS adds network latency for each invocation to load the model, which can actually increase cold start time and does not address the root cause of cold starts. Option B is wrong because increasing memory to the maximum (10,240 MB) increases cost significantly (Lambda pricing scales linearly with memory) and does not eliminate cold starts; it only reduces compute time for the same workload. Option D is wrong because deploying as a container image does not inherently reduce cold start latency; container images can actually increase cold start time due to image pull overhead unless combined with provisioned concurrency.

Full explanation →

53

Multi-Selecteasy

A data engineer needs to provide the data science team with access to various data sources for machine learning. The team uses Amazon SageMaker Studio. Which TWO data sources can be accessed directly from SageMaker Studio notebooks without additional infrastructure? (Choose two.)

Select 2 answers

A.Amazon S3.

B.Amazon Redshift.

C.Amazon DynamoDB.

D.Amazon RDS (MySQL).

E.Amazon Athena.

AnswersA, E

S3 is natively integrated with SageMaker.

Why this answer

Amazon SageMaker Studio notebooks have a built-in SageMaker SDK that can directly read from and write to Amazon S3 using the `s3fs` filesystem or the SageMaker `s3_utils` module. This integration requires no additional infrastructure because S3 is the default storage backend for SageMaker, and the notebook environment is pre-configured with the necessary IAM roles and boto3 libraries to access S3 buckets directly.

Exam trap

The trap here is that candidates often assume any AWS database service (like Redshift, DynamoDB, or RDS) can be accessed 'directly' from SageMaker Studio, but the exam specifically tests the distinction between services that require additional infrastructure (VPC, endpoints, or client libraries) and those that are natively integrated without extra setup.

Full explanation →

54

MCQeasy

A machine learning team needs to deploy a model that was built using scikit-learn. They want to use SageMaker for hosting. Which approach should they take?

A.Create a Jupyter notebook that loads the model and runs predictions on the SageMaker notebook instance

B.Create a custom Docker container with scikit-learn and deploy it on SageMaker

C.Launch a SageMaker training job with the model and use the training instance as an endpoint

D.Package the model artifacts and use the SageMaker built-in scikit-learn container for inference

AnswerD

Built-in container supports scikit-learn models; simply point to model artifacts.

Why this answer

Option D is correct because SageMaker provides a pre-built, optimized Docker container for scikit-learn that supports inference. By packaging the model artifacts (e.g., a joblib or pickle file) and deploying them using the built-in container, the team avoids the overhead of custom container creation while ensuring compatibility with SageMaker's hosting infrastructure, including automatic scaling and load balancing.

Exam trap

The trap here is that candidates often overcomplicate the solution by assuming a custom Docker container is always required for scikit-learn, overlooking the fact that SageMaker provides a fully managed, built-in container specifically for this framework.

How to eliminate wrong answers

Option A is wrong because a Jupyter notebook on a notebook instance is designed for interactive development and testing, not for production hosting; it lacks the necessary endpoint management, scaling, and availability features of SageMaker hosting. Option B is wrong because while a custom Docker container is a valid approach, it is unnecessary when SageMaker provides a built-in scikit-learn container that already includes the required dependencies and is optimized for inference, making this option over-engineered and more complex than needed. Option C is wrong because a SageMaker training job is ephemeral and intended for model training, not for serving inference requests; using a training instance as an endpoint is not supported, as training instances lack the persistent endpoint infrastructure (e.g., HTTPS endpoints, auto-scaling groups) required for production hosting.

Full explanation →

55

MCQeasy

A machine learning engineer at a retail company is monitoring a production model that predicts inventory demand. The model's prediction accuracy has dropped significantly over the past week. The engineer checks the model's input data and notices a new product category was introduced with a different distribution. Which concept is most likely causing the performance degradation?

A.Concept drift

B.Covariate shift

C.Data leakage

D.Model decay

AnswerB

Covariate shift occurs when the distribution of input features changes over time.

Why this answer

B is correct because covariate shift occurs when the distribution of the input features changes while the relationship between features and the target remains the same. In this scenario, the introduction of a new product category with a different distribution alters the input data distribution, causing the model to encounter unseen patterns and degrade in prediction accuracy.

Exam trap

AWS often tests the distinction between covariate shift and concept drift, and the trap here is that candidates confuse a change in input distribution (covariate shift) with a change in the relationship between inputs and outputs (concept drift), leading them to incorrectly select concept drift.

How to eliminate wrong answers

Option A is wrong because concept drift refers to a change in the underlying relationship between input features and the target variable over time, not a change in the input distribution itself. Option C is wrong because data leakage involves the accidental inclusion of future information or target data in the training features, which is not indicated by a new product category with a different distribution. Option D is wrong because model decay is a general term for performance degradation over time, but it does not specifically describe the cause as a shift in input distribution; covariate shift is the precise technical concept here.

Full explanation →

56

MCQhard

An ML team trained a model using SageMaker and stored the model artifacts in S3 with server-side encryption using AWS KMS (SSE-KMS). They need to deploy the model to a SageMaker endpoint that uses a different KMS key for inference data encryption. What must they do to ensure the endpoint can decrypt the model artifacts?

A.Provide the same KMS key for both model artifacts and inference data.

B.Use a customer-managed key (CMK) with the same key material.

C.Grant the SageMaker execution role access to both KMS keys.

D.Configure the endpoint to use SSE-S3 instead of SSE-KMS.

AnswerC

The role needs decrypt on the artifact key and encrypt/decrypt on the inference key.

Why this answer

The SageMaker execution role must have kms:Decrypt permission on the KMS key used for model artifacts, and also kms:Encrypt and kms:Decrypt on the key for inference data. Providing the same key is not required.

Full explanation →

57

MCQeasy

A data scientist is training a binary classification model using a dataset that has a severe class imbalance (90% negative, 10% positive). Which technique should be used to address the imbalance during model training?

A.Use a larger batch size

B.Use L2 regularization

C.Apply random oversampling of the minority class

D.Increase the learning rate

AnswerC

Random oversampling balances the class distribution by replicating minority class samples.

Why this answer

Random oversampling of the minority class (Option C) directly addresses class imbalance by duplicating or synthesizing examples from the positive class, which balances the training distribution and prevents the model from becoming biased toward the majority class. This technique is specifically designed to mitigate the skewed gradient updates that occur when the minority class is underrepresented, leading to better recall and precision for the positive class in binary classification tasks.

Exam trap

AWS often tests the misconception that hyperparameter tuning (like batch size or learning rate) can fix data imbalance, when in fact only data-level or algorithm-level techniques (e.g., oversampling, undersampling, or cost-sensitive learning) directly address the skewed class distribution.

How to eliminate wrong answers

Option A is wrong because using a larger batch size does not correct class imbalance; it may even exacerbate the issue by making each batch more likely to contain only majority-class samples, reducing the model's exposure to minority examples. Option B is wrong because L2 regularization is a technique to prevent overfitting by penalizing large weights, but it has no effect on the class distribution or the imbalance between positive and negative samples. Option D is wrong because increasing the learning rate can cause unstable training or divergence, and it does not address the underlying data imbalance; it may lead to the model ignoring minority class patterns altogether.

Full explanation →

58

MCQmedium

A team is using Amazon SageMaker for feature engineering. They have a dataset with a column 'TransactionDate' in string format (e.g., '2023-01-15 10:30:00'). They need to create features: year, month, day, hour, and day_of_week. What is the most efficient way to do this in a SageMaker processing job?

A.Use pandas datetime functions and then split

B.Use SageMaker built-in first party algorithms

C.Use AWS Glue for transformation

D.Use SQL query in Athena on S3 data

AnswerA

Pandas provides built-in datetime accessors for extracting components efficiently.

Why this answer

Option A is correct because using pandas datetime functions within a SageMaker processing job is the most efficient approach for this task. SageMaker processing jobs run custom Python scripts, and pandas provides vectorized operations (e.g., `pd.to_datetime()`, `.dt.year`, `.dt.month`, `.dt.day`, `.dt.hour`, `.dt.dayofweek`) that parse the string column and extract all required features in a single pass without external dependencies or data movement.

Exam trap

AWS often tests the misconception that SageMaker built-in algorithms can handle feature engineering, but they are strictly for training and inference, not data preprocessing — the trap here is assuming 'first-party algorithms' include data transformation capabilities.

How to eliminate wrong answers

Option B is wrong because SageMaker built-in first-party algorithms (e.g., XGBoost, Linear Learner) are designed for model training, not for feature engineering or data transformation tasks like datetime parsing. Option C is wrong because AWS Glue is an ETL service that introduces additional overhead (e.g., Spark cluster startup, schema inference) and is less efficient for a simple in-memory pandas operation within a SageMaker processing job. Option D is wrong because using SQL in Athena on S3 data requires querying the raw data from S3, which incurs scan costs and latency, and Athena's SQL functions for datetime extraction (e.g., `EXTRACT`) are less flexible and slower than pandas for this specific transformation.

Full explanation →

59

Multi-Selecteasy

A company wants to deploy a model on SageMaker serverless inference. Which TWO of the following are limitations of serverless endpoints compared to real-time endpoints? (Choose two.)

Select 2 answers

A.Cold starts can cause increased latency for infrequent requests

B.Cannot deploy multiple containers in the same endpoint

C.No support for GPU instances

D.Maximum memory configuration is 6 GB

E.No automatic scaling – must be configured manually

AnswersC, D

Serverless endpoints only support CPU.

Why this answer

Option C is correct because SageMaker serverless inference does not support GPU instances; it only runs on CPU-based instances. This is a fundamental limitation for workloads requiring GPU acceleration, such as deep learning models. In contrast, real-time endpoints support both CPU and GPU instance types.

Exam trap

The trap here is that candidates may confuse cold starts (option A) as a limitation unique to serverless endpoints, but the question asks for limitations compared to real-time endpoints, and cold starts are inherent to serverless, not a comparative limitation; the two correct answers are the specific technical constraints of no GPU support and the 6 GB memory cap.

Full explanation →

60

MCQhard

A company uses Amazon SageMaker Ground Truth to create a labeled dataset. They want to monitor the accuracy of human labelers during the labeling process. Which metric should they track?

A.Labeling job cost

B.Number of tasks completed

C.Accuracy against blinded ground truth

D.Task acceptance rate

AnswerC

Ground Truth inserts known ground truth tasks to audit labelers; tracking accuracy on these tasks measures labeler performance.

Why this answer

Option C is correct because tracking accuracy against blinded ground truth (known as audit tasks) is the standard way to measure labeler performance. Options A and B are operational metrics not directly measuring accuracy. Option D is not directly accuracy.

Full explanation →

61

MCQeasy

A data science team deploys a PyTorch model on Amazon SageMaker for real-time inference. The model requires GPU for low latency. Which instance type is MOST cost-effective while meeting the GPU requirement?

A.ml.m5.2xlarge

B.ml.p4d.24xlarge

C.ml.p3.2xlarge

D.ml.c5.2xlarge

AnswerC

ml.p3.2xlarge provides a GPU at a cost-effective price point.

Why this answer

Option C (ml.p3.2xlarge) is correct because it provides a GPU (NVIDIA V100) necessary for low-latency PyTorch inference on SageMaker, while being the most cost-effective among GPU options. The ml.p3.2xlarge offers a single GPU with sufficient compute for many real-time inference workloads, avoiding the higher cost of larger instances like ml.p4d.24xlarge.

Exam trap

The trap here is that candidates may assume any GPU instance is equally cost-effective, overlooking that ml.p4d.24xlarge is overprovisioned for typical inference, while CPU-only instances like ml.m5 and ml.c5 are tempting but fail the explicit GPU requirement.

How to eliminate wrong answers

Option A (ml.m5.2xlarge) is wrong because it is a general-purpose CPU instance with no GPU, failing to meet the GPU requirement for low-latency PyTorch inference. Option B (ml.p4d.24xlarge) is wrong because, while it provides powerful GPUs (NVIDIA A100), it is significantly more expensive than necessary for typical real-time inference, making it not the most cost-effective choice. Option D (ml.c5.2xlarge) is wrong because it is a compute-optimized CPU instance with no GPU, which cannot satisfy the GPU requirement for low-latency inference.

Full explanation →

62

MCQeasy

A machine learning engineer wants to encrypt model artifacts stored in Amazon S3. The artifacts are created and used by SageMaker training jobs and endpoints. What is the simplest way to ensure encryption at rest?

A.Create an S3 bucket with default encryption using SSE-S3 and allow SageMaker access.

B.Use SageMaker's default encryption with an AWS managed key.

C.Enable S3 bucket versioning and MFA delete.

D.Use a custom KMS key and grant SageMaker permission to use it.

AnswerA

SSE-S3 provides encryption at rest with no additional configuration, and SageMaker can read/write objects without any extra setup.

Why this answer

Option A is correct because SSE-S3 is the simplest encryption method and works seamlessly with SageMaker. Option B is not a thing; SageMaker does not have default encryption for artifacts. Option C is possible but not simplest.

Option D is for versioning, not encryption.

Full explanation →

63

MCQhard

A data engineer is using Amazon SageMaker Processing to run a data preprocessing script on a dataset with 500 million rows. The script runs out of memory on a single ml.r5.24xlarge instance. The engineer needs to modify the processing job to handle the dataset size. Which approach is most cost-effective and scalable?

A.Configure the Processing job with multiple instances and use ShardedByS3Key for data splitting.

B.Write the script to process data in chunks and write intermediate results to local ephemeral storage.

C.Increase the instance type to a larger one like ml.p3dn.24xlarge with more memory.

D.Reduce the number of instances to one and increase the volume size for swap space.

AnswerA

This distributes the data across instances, leveraging parallel processing and reducing memory per instance.

Why this answer

Option A is correct because SageMaker Processing with ShardedByS3Key splits the input dataset by S3 object boundaries across multiple instances, allowing distributed processing of the 500 million rows without exceeding memory on any single instance. This approach is cost-effective as it uses multiple smaller instances (e.g., ml.r5.xlarge) rather than a single oversized instance, and scales linearly with data size.

Exam trap

AWS often tests the misconception that increasing instance size or using swap space is the primary solution for memory issues, whereas the correct approach is to distribute the workload horizontally using SageMaker's built-in data sharding feature.

How to eliminate wrong answers

Option B is wrong because writing intermediate results to local ephemeral storage does not solve the out-of-memory issue; the script still loads the entire dataset into memory before chunking, and local storage is limited and not designed for large-scale intermediate data. Option C is wrong because increasing to a larger instance like ml.p3dn.24xlarge (which has 192 GB memory vs. ml.r5.24xlarge's 768 GB) actually reduces memory, and GPU instances are not optimized for memory-intensive preprocessing; this approach is neither cost-effective nor scalable. Option D is wrong because reducing to a single instance and increasing volume size for swap space relies on disk-based swapping, which is orders of magnitude slower than RAM and will cause severe performance degradation or job failure due to I/O bottlenecks.

Full explanation →

64

MCQhard

Refer to the exhibit. A SageMaker Processing job configured as above fails with a timeout error. The input data is 100 GB of CSV files. The processing script performs standard data cleaning operations. What is the most likely cause?

A.The processing job does not have enough memory for the data volume

B.The container entrypoint is missing the full path to the script

C.The S3Input S3CompressionType is set to "None" but the file is compressed

D.The IAM role does not have permission to write to the output bucket

AnswerA

ml.m5.large has 8 GB memory; 100 GB data likely causes memory exhaustion and slow disk swapping.

Why this answer

Option A is correct because the SageMaker Processing job is configured with a single `ml.m5.large` instance, which has 8 GiB of memory. The input data is 100 GB of CSV files, and the processing script performs standard data cleaning operations that typically load the entire dataset into memory (e.g., using pandas). With only 8 GiB of RAM, the instance cannot hold 100 GB of data, causing the job to run out of memory and eventually fail with a timeout error as the OS kills the process or the job hangs.

Exam trap

The trap here is that candidates may overlook the memory-to-data ratio and assume a timeout error always indicates a network or permission issue, rather than recognizing that an undersized instance with insufficient RAM for the dataset volume causes the job to stall and eventually time out.

How to eliminate wrong answers

Option B is wrong because if the container entrypoint were missing the full path to the script, the job would fail immediately with a 'No such file or directory' error, not a timeout error. Option C is wrong because `S3CompressionType` set to 'None' means the input files are not compressed; if the files were actually compressed, the job would fail with a decompression error, not a timeout. Option D is wrong because if the IAM role lacked write permission to the output bucket, the job would fail with an access denied error during the output write phase, not a timeout error.

Full explanation →

65

MCQmedium

A company is using AWS Glue to prepare data for a machine learning pipeline. The source data is in an Amazon S3 bucket in CSV format. The data scientist wants to convert the data to Parquet format and partition it by date. Which AWS Glue feature should be used to optimize the data for query performance and reduce storage costs?

A.Use Amazon Athena to convert the data to JSON format and store it in S3.

B.Use AWS Glue DynamicFrame to repartition the data and write it as Parquet.

C.Use AWS Glue to convert the data to Apache Hive format.

D.Use Apache Spark DataFrame to write the data as CSV with Snappy compression.

AnswerB

DynamicFrame supports efficient partitioning and columnar format conversion.

Why this answer

Option B is correct because AWS Glue DynamicFrames provide built-in optimizations for writing data in columnar formats like Parquet, which improves query performance through predicate pushdown and compression, and reduces storage costs by using efficient encoding. The DynamicFrame's `repartition()` method allows you to control the number of output files, and writing as Parquet directly from Glue avoids intermediate conversions, making it the most efficient choice for this task.

Exam trap

The trap here is that candidates confuse 'file format' with 'query engine' (e.g., Hive) or choose a format like JSON that is human-readable but inefficient for analytics, missing that Parquet is the industry standard for performance and cost in data lakes.

How to eliminate wrong answers

Option A is wrong because converting to JSON format would increase storage costs and degrade query performance compared to Parquet, as JSON is a verbose, row-based format with no built-in compression or columnar optimization. Option C is wrong because Apache Hive format is not a specific file format; Hive is a query engine that can read various formats, and the question asks for a format conversion, not a query engine. Option D is wrong because writing as CSV with Snappy compression still results in a row-based format that lacks the columnar storage benefits of Parquet, such as predicate pushdown and efficient compression, and Snappy compression on CSV does not match Parquet's storage efficiency.

Full explanation →

66

MCQeasy

A team uses SageMaker Pipelines to automate model retraining. After a successful pipeline run, they want to register the new model version in the SageMaker Model Registry so that it can be reviewed for approval. Which step type should they add to the pipeline?

A.RegisterModelStep

B.ConditionStep

C.TransformStep

D.TrainingStep

AnswerA

RegisterModelStep is specifically for registering a model version in the Model Registry.

Why this answer

Option B is correct because the RegisterModel step in SageMaker Pipelines registers a model in the Model Registry. Option A (Training) only trains. Option C (Transform) is for batch inference.

Option D (Condition) is for branching.

Full explanation →

67

MCQmedium

A gaming company uses a SageMaker endpoint for real-time player churn prediction. The model is updated weekly. After a recent retraining, the team notices that the endpoint's predicted probabilities for churn have shifted dramatically: the average predicted probability dropped from 0.3 to 0.05. The team suspects concept drift (the relationship between features and target changed) rather than data drift. They have SageMaker Model Monitor set up for data drift and quality metrics, but not for bias or explainability. The team needs to confirm concept drift and take corrective action. Which approach should the team take FIRST?

A.Configure SageMaker Model Monitor's model quality monitoring to compare predictions against actual outcomes collected from a week of production traffic

B.Immediately retrain the model using the most recent month of data and redeploy to the endpoint

C.Use Amazon SageMaker Clarify to compute SHAP values and understand which features are driving the new predictions

D.Investigate data drift by reviewing the Model Monitor feature distribution constraints and comparing recent input data to the baseline

AnswerA

Model quality monitoring tracks metrics like accuracy, precision, recall over time if ground truth labels are available. A significant drop in these metrics would confirm concept drift.

Why this answer

To detect concept drift, the team needs to compare the model's predictions against actual observed outcomes (ground truth). SageMaker Model Monitor's quality monitoring can track prediction accuracy over time if ground truth is provided. Option D (set up Model Monitor's model quality monitoring) is the correct first step.

Option A (retrain with more recent data) might help but does not confirm drift. Option B (data drift monitoring) checks feature distribution, not concept drift. Option C (use Clarify for SHAP values) is for feature importance, not drift detection.

Full explanation →

68

MCQmedium

A team uses SageMaker Model Monitor to track data quality. They notice that the monitor's constraint violations are increasing but the model performance remains good. What should they do?

A.Disable the monitor because it is not affecting performance.

B.Relax the constraint thresholds to reduce alerts.

C.Retrain the model using the latest data.

D.Investigate the specific features that are violating constraints to see if they are still relevant.

AnswerD

Feature distributions may have naturally shifted without harming model performance; investigating helps decide if constraints need updating.

Why this answer

Option A is correct because investigating specific violating features helps determine if constraints are still relevant. Option B is reactive without analysis. Option C ignores the issue.

Option D may be unnecessary if performance is still good.

Full explanation →

69

MCQhard

A team is deploying a machine learning model for real-time fraud detection. The model must have inference latency under 10 ms and handle up to 1000 requests per second. The model is a gradient boosting model using XGBoost. Which SageMaker hosting configuration is MOST cost-effective while meeting the requirements?

A.Use SageMaker Batch Transform with multiple instances

B.Use a SageMaker Multi-Model Endpoint (MME) on an ml.c5.4xlarge instance with auto scaling

C.Deploy on a single ml.c5.xlarge instance with a real-time endpoint

D.Deploy separate real-time endpoints for each model on ml.m5.large instances

AnswerB

MME allows multiple models to share a container, reducing cost while scaling to meet demand.

Why this answer

Option B is correct because a Multi-Model Endpoint (MME) on a single ml.c5.4xlarge instance allows multiple models to share the same endpoint, reducing cost while still meeting the latency (<10 ms) and throughput (1000 req/s) requirements. The ml.c5.4xlarge provides sufficient compute (16 vCPUs, 32 GB memory) for XGBoost inference, and auto scaling ensures capacity adjusts to handle peak load without over-provisioning.

Exam trap

The trap here is that candidates often assume a single large instance is insufficient for high throughput, but MME allows efficient resource sharing across models, making a single ml.c5.4xlarge cost-effective when the model is small and CPU-bound.

How to eliminate wrong answers

Option A is wrong because SageMaker Batch Transform is designed for offline, asynchronous inference on large datasets, not real-time fraud detection with sub-10 ms latency. Option C is wrong because a single ml.c5.xlarge instance (4 vCPUs, 8 GB memory) cannot handle 1000 requests per second with <10 ms latency for an XGBoost model; it would be CPU-bound and cause request throttling or timeouts. Option D is wrong because deploying separate real-time endpoints on ml.m5.large instances (2 vCPUs, 8 GB memory each) is cost-inefficient and would require many instances to meet throughput, increasing cost without latency benefit; also, ml.m5 instances are memory-optimized but XGBoost inference is CPU-intensive, making ml.c5 instances more suitable.

Full explanation →

70

MCQeasy

A machine learning team at a retail company has deployed a product recommendation model using Amazon SageMaker. The model is updated weekly with new data. Recently, the team noticed that the model's accuracy on a holdout evaluation set has been declining over the past month. The data pipeline that feeds the training job has not changed. The team suspects data drift. They have SageMaker Model Monitor enabled on the inference endpoint and have set up Amazon CloudWatch metrics for feature distribution distances. Upon reviewing the CloudWatch dashboards, they see that the feature distribution distance metric for the most important feature 'product_category' has increased significantly. However, the team is unsure if this is the root cause. Which remediation step should the team take FIRST?

A.Retrain the model using the most recent week of data and redeploy to the endpoint

B.Investigate the data pipeline that feeds the training job to ensure consistent data collection and encoding of the 'product_category' feature

C.Rebuild the SageMaker endpoint with a different instance type to improve performance

D.Reduce the number of features in the model by removing 'product_category'

AnswerB

The first step should be to confirm that the data pipeline is not introducing errors. If the data is correct, then retraining might be appropriate.

Why this answer

Before retraining the model or deploying a new endpoint, the team should investigate the source of the data drift by checking the input data pipeline. The data pipeline might have introduced a systematic error, such as a change in how 'product_category' is encoded or collected. Option A (retrain the model with more recent data) might not help if the data itself is corrupted.

Option B (reduce the number of features) could ignore the problem. Option D (rebuild the endpoint) would not address the data drift. Therefore, the first step is to investigate the data pipeline.

Full explanation →

71

Multi-Selectmedium

A machine learning engineer is preparing a dataset for a binary classification model. The dataset has 10,000 rows and 200 features, with 5% positive class. The engineer suspects class imbalance may affect model performance. Which TWO actions should the engineer take to mitigate imbalance? (Choose 2.)

Select 2 answers

A.Perform PCA to reduce dimensions

B.Remove features with low variance

C.Use k-fold cross-validation

D.Apply SMOTE only to training data

E.Use class weights in the algorithm

AnswersD, E

SMOTE generates synthetic minority samples, helping the model learn the minority class better.

Why this answer

Option D is correct because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class by interpolating between existing minority instances, which helps balance the class distribution. Applying SMOTE only to the training data is critical to avoid data leakage, as the test set must remain untouched to provide an unbiased evaluation of model performance on the original class distribution.

Exam trap

The trap here is that candidates may confuse techniques for handling class imbalance with general data preprocessing or evaluation methods, leading them to select PCA or cross-validation as solutions, when in fact only resampling (SMOTE) and cost-sensitive learning (class weights) directly address the imbalance problem.

Full explanation →

72

MCQmedium

An ML engineer is using Amazon SageMaker Automatic Model Tuning (AMT) to optimize hyperparameters for a gradient boosting model. The tuning job is taking a long time and has completed many training jobs. The engineer wants to stop training jobs that are unlikely to improve the objective metric. What should they configure?

A.Reduce the number of hyperparameter ranges

B.Use a random search strategy instead of Bayesian

C.Increase the maximum number of training jobs

D.Enable early stopping in the hyperparameter tuning job

AnswerD

Early stopping terminates training jobs that are not meeting an improvement threshold, reducing overall tuning time.

Why this answer

Early stopping in AMT automatically stops training jobs that are not improving the objective, saving time. Reducing hyperparameter ranges may narrow the search but does not stop unpromising jobs. Random search does not incorporate early stopping.

Increasing max jobs would prolong the process.

Full explanation →

73

MCQmedium

A company trains a model daily using Amazon SageMaker and uses the model for real-time inference. They want to detect data drift between the training data and the inference data to decide when to retrain. Which AWS service should they use for this purpose?

A.Amazon Athena

B.Amazon SageMaker Model Monitor

C.AWS Glue

D.AWS Lambda

AnswerB

SageMaker Model Monitor is designed to detect data drift and model quality degradation.

Why this answer

Option B is correct because Amazon SageMaker Model Monitor can detect data drift by comparing inference data against a baseline created from training data. Option A is for ETL, not drift detection. Option C is for serverless compute.

Option D is for querying data, not monitoring drift.

Full explanation →

74

MCQhard

A company uses SageMaker to train a model with a large dataset stored in S3. They notice that the training job is taking longer than expected and the GPU utilization is low. Which action would most likely improve GPU utilization?

A.Increase the batch size

B.Disable distributed training

C.Use a smaller instance type

D.Decrease the batch size

AnswerA

Correct: Larger batch sizes better utilize GPU memory and compute.

Why this answer

Option D is correct because increasing the batch size can help saturate GPU compute. Options A is wrong because decreasing batch size would lower GPU utilization. B is wrong because using a smaller instance provides less compute.

C is wrong because disabling distributed training reduces parallelism.

Full explanation →

75

MCQeasy

A data engineer needs to split a time-series dataset into training and validation sets for a forecasting model. Which split method should be used to avoid data leakage?

A.Use k-fold cross-validation with random shuffling.

B.Use feature importance scores to weight the splitting process.

C.Random split with 80% training and 20% validation.

D.Temporal split where training uses data up to a cutoff date and validation uses later data.

AnswerD

Time-series data must be split chronologically to preserve the temporal dependencies.

Why this answer

Option B is correct because for time-series data, a temporal split ensures that validation data comes from a later time period than training data. Option A is wrong because random splits can cause future data to leak into training. Option C is wrong while cross-validation is useful, it must be done in a time-aware manner (e.g., rolling origin), but standard k-fold cross-validation is not appropriate.

Option D is wrong because feature importance is not a splitting method.

Full explanation →

Page 1 of 7

All pages

Practice MLA-C01 by domain

Target a specific domain to shore up weak areas.

Data Preparation for Machine Learning ML Model Development Deployment and Orchestration of ML Workflows ML Solution Monitoring, Maintenance and Security

See all domains with question counts →