CCNA ML Solution Monitoring, Maintenance and Security Questions

75 of 121 questions · Page 1/2 · ML Solution Monitoring, Maintenance and Security · Answers revealed

1
MCQhard

A machine learning engineer deploys a model to an Amazon SageMaker endpoint with data capture enabled. The endpoint uses a production variant with initial instance count of 2. After a week, they notice that the captured data is not being sent to the specified Amazon S3 bucket. The IAM role used by the endpoint has the following policy attached. What is the MOST likely reason for the failure?

A.The S3 bucket does not exist.
B.The S3 bucket uses AWS KMS encryption and the role lacks kms:Decrypt permission.
C.The IAM role does not have permission to write to the correct S3 prefix.
D.The IAM role does not have s3:ListBucket permission.
AnswerC

The policy restricts writes to 'captures/' prefix, but the endpoint may use a different prefix.

Why this answer

Option C is correct because the IAM role attached to the SageMaker endpoint must have write permissions to the exact S3 prefix where data capture is configured. The policy shown likely grants access to a broader bucket or a different prefix, but not the specific path (e.g., s3://bucket-name/prefix/) that the endpoint's DataCaptureConfig specifies. Without s3:PutObject on that exact prefix, the captured data fails to upload silently.

Exam trap

The trap here is that candidates often assume any S3 write permission on the bucket is sufficient, but SageMaker data capture requires explicit permission on the exact prefix path, not just the bucket or a wildcard that doesn't match the configured prefix.

How to eliminate wrong answers

Option A is wrong because if the S3 bucket did not exist, SageMaker would fail at endpoint creation or deployment time, not after a week of operation. Option B is wrong because the question does not mention KMS encryption being enabled on the bucket, and the policy shown does not include kms:Decrypt; if KMS were used, the role would need kms:GenerateDataKey and kms:Decrypt, but the absence of those is not the issue here. Option D is wrong because s3:ListBucket is not required for writing captured data; SageMaker only needs s3:PutObject on the specific prefix, not ListBucket on the bucket.

2
Multi-Selectmedium

A company uses Amazon SageMaker Model Monitor to track data quality. The monitoring job triggers an alert indicating that the data distribution has shifted beyond the configured threshold. Which TWO actions should the team take? (Choose TWO.)

Select 2 answers
A.Update the Model Monitor baseline if the drift is acceptable
B.Delete the monitoring schedule
C.Retrain the model with updated training data
D.Increase the instance count of the endpoint
E.Evaluate the data quality report
AnswersA, C

If the drift reflects a new normal, updating the baseline prevents false alerts.

Why this answer

Options A and B are correct because the team should retrain the model on the new data distribution if the drift is significant, or update the baseline if the drift is acceptable and represents expected behavior. Options C, D, and E are not appropriate immediate actions.

3
MCQmedium

A machine learning engineer sees the above error in Amazon CloudWatch Logs for a SageMaker endpoint. What is the most likely cause?

A.The model file is corrupted during deployment.
B.The data capture configuration is incorrectly set to capture only the response body.
C.The inference code in the Docker container outputs a different response format than expected by the endpoint.
D.The endpoint is overloaded and dropping requests.
AnswerC

The inference script (e.g., in a SageMaker inference container) must output the exact JSON structure the endpoint expects. This error shows a mismatch.

Why this answer

The error indicates the model returned a response with an unexpected structure. The expected format was a JSON with a 'predictions' array, but the model output a single 'prediction' field. This mismatch is typically due to a bug in the inference code within the Docker container, not corruption, overload, or misconfiguration of data capture.

4
MCQeasy

A company stores its model training data in Amazon S3. To meet compliance requirements, all data in transit between the S3 bucket and SageMaker must be encrypted. What should the company enforce?

A.Enable S3 versioning
B.Enable S3 access logging
C.Enforce HTTPS for all S3 access
D.Use S3 server-side encryption (SSE-S3)
AnswerC

HTTPS provides encryption in transit.

Why this answer

Option C is correct because enforcing HTTPS for all S3 access ensures that data in transit between the S3 bucket and SageMaker is encrypted using TLS. This meets the compliance requirement for encrypting data in transit, as HTTPS uses TLS to protect data as it travels over the network.

Exam trap

The trap here is that candidates often confuse encryption at rest (SSE-S3) with encryption in transit (HTTPS/TLS), leading them to select Option D when the question explicitly asks about data in transit.

How to eliminate wrong answers

Option A is wrong because S3 versioning is a data protection feature that preserves, retrieves, and restores every version of an object stored in a bucket; it does not encrypt data in transit. Option B is wrong because S3 access logging provides detailed records of requests made to a bucket for auditing purposes, but it does not enforce or provide encryption for data in transit. Option D is wrong because S3 server-side encryption (SSE-S3) encrypts data at rest within S3, not data in transit between S3 and SageMaker.

5
MCQmedium

A team deploys a model with SageMaker and notices that the model returns inconsistent results during inference. They suspect a mismatch in feature transformation between the training pipeline and the inference pipeline. Which SageMaker feature can help compare the feature distributions?

A.Amazon SageMaker Model Monitor
B.Amazon SageMaker Autopilot
C.Amazon SageMaker Clarify
D.Amazon SageMaker Debugger
AnswerA

Model Monitor can compare inference data statistics against a baseline to detect drift.

Why this answer

Option B is correct because SageMaker Model Monitor can track feature distributions over time and detect drift. Option A is for debugging training jobs. Option C is for explainability and bias detection.

Option D is for automated model building.

6
MCQhard

A retail company has deployed a real-time recommendation model on a SageMaker endpoint. The model is trained daily using SageMaker Pipelines that process user interaction data from a large S3 bucket. Recently, the operations team noticed that the endpoint's predictions have become stale; users are seeing recommendations based on data from days ago. The pipeline runs successfully every day at 2 AM UTC, but the endpoint continues to serve the old model version. The team checks the pipeline and finds no errors. The model registry contains multiple model versions approved automatically. The endpoint is configured with production variants, but only one variant is active. The team suspects the issue is with the deployment step in the pipeline. They want to automatically deploy new model versions to the endpoint as soon as they are registered and approved. What should they do?

A.Set up the SageMaker Model Registry to trigger a Lambda function on approval that updates the endpoint using the new model version.
B.Configure the pipeline to stop automatic approval and require manual approval before deployment.
C.Modify the pipeline to run batch transforms on the new model and compare metrics, then update the endpoint.
D.Create a daily cron job that checks for new model versions and manually updates the endpoint configuration.
AnswerA

Event-driven deployment ensures immediate update on approval.

Why this answer

Option B is correct because using a SageMaker Model Registry with an automatic deployment pipeline (via EventBridge or Lambda triggered by approval) ensures new models are deployed when approved. Option A (manual approval) is not automatic. Option C (test on batch transform) doesn't deploy to endpoint.

Option D (change endpoint configuration manually) is not automated and suggested they do currently? Actually they need automation.

7
MCQmedium

A company uses Amazon SageMaker Pipelines for automated retraining. The pipeline includes a processing step that runs a Python script. The script uses the boto3 library to call an AWS service, but the calls are being throttled. What is the MOST effective way to address this within the pipeline?

A.Increase the instance count for the processing step to distribute the API calls.
B.Modify the Python script to include retry with exponential backoff when receiving throttling exceptions.
C.Request a service quota increase for the throttling limit.
D.Add a wait step in the pipeline before the processing step.
AnswerB

Standard best practice for handling API throttling.

Why this answer

Option A is correct because implementing retry logic with exponential backoff in the script is the standard approach to handle throttling. Option B is wrong because Service Quotas increase is a longer-term solution but not always available. Option C is wrong because using a larger instance does not reduce throttling.

Option D is wrong because there is no built-in pipeline step for backoff; it must be in code.

8
Multi-Selecthard

A company operates multiple AWS accounts with SageMaker workloads. They need to implement governance and security controls for model monitoring and maintenance. Which THREE actions should they take to meet compliance requirements?

Select 3 answers
A.Deploy a SageMaker model registry in a centralized account.
B.Use AWS CloudTrail to log all API calls to SageMaker and S3.
C.Enable VPC Flow Logs for SageMaker notebooks.
D.Use IAM roles with cross-account trust policies for all SageMaker endpoints.
E.Use AWS Config rules to enforce encryption of model artifacts.
AnswersA, B, E

Correct. A central registry ensures model approval and version control across accounts.

Why this answer

CloudTrail logging, AWS Config rules for encryption, and a centralized model registry help enforce governance across accounts.

9
MCQeasy

A company uses an Amazon SageMaker endpoint for real-time inference. The security team requires that all traffic between the endpoint and the client application be encrypted in transit. Which configuration ensures this?

A.Deploy the endpoint in a VPC and use VPC Endpoints.
B.Use AWS Key Management Service (KMS) to encrypt the data in transit.
C.The endpoint is automatically served over HTTPS; no additional configuration is needed.
D.Attach an AWS Certificate Manager (ACM) certificate to the endpoint.
AnswerC

SageMaker endpoints use HTTPS by default.

Why this answer

Option B is correct because SageMaker endpoints are already HTTPS-enabled by default, providing encryption in transit via TLS. Option A is wrong because a VPC does not provide encryption in transit by default; it provides network isolation. Option C is wrong because SageMaker does not directly use AWS Certificate Manager for endpoint encryption; it uses AWS-managed certificates.

Option D is wrong because AWS KMS is for encryption at rest, not in transit.

10
MCQhard

A machine learning engineer is setting up automated retraining for a model using SageMaker Pipelines. The pipeline should trigger when a data drift alert is received from Model Monitor. Which event source should the engineer use to initiate the pipeline?

A.Amazon CloudWatch Events (Amazon EventBridge) rule that captures Model Monitor outcome.
B.AWS Lambda function that polls CloudWatch logs.
C.S3 event notification on the monitoring output bucket.
D.SageMaker model monitor webhook.
AnswerA

Model Monitor publishes violation events to EventBridge, which can trigger a pipeline execution reliably.

Why this answer

Option D is correct because SageMaker Model Monitor publishes drift results as Amazon CloudWatch Events (now EventBridge). Option A: S3 events on monitoring output are not directly linked to drift alerts. Option B: no webhook exists.

Option C: polling is inefficient.

11
Multi-Selecteasy

A company stores training data in Amazon S3 and uses Amazon SageMaker for model training. They need to ensure data is encrypted at rest. Which THREE encryption options are supported by SageMaker for data stored in S3? (Choose THREE.)

Select 3 answers
A.SSE-C (customer-provided keys)
B.Client-side encryption
C.SSE-KMS (KMS-managed keys)
D.Amazon CloudFront encryption
E.SSE-S3 (S3-managed keys)
AnswersA, C, E

SageMaker supports SSE-C, but the user must provide the key during training.

Why this answer

Options A, B, and C are correct because SageMaker supports all three Amazon S3 server-side encryption options. Option D is not supported for automatic decryption by SageMaker. Option E is for content delivery, not storage encryption.

12
MCQeasy

A machine learning engineer at a retail company is monitoring a production model that predicts inventory demand. The model's prediction accuracy has dropped significantly over the past week. The engineer checks the model's input data and notices a new product category was introduced with a different distribution. Which concept is most likely causing the performance degradation?

A.Concept drift
B.Covariate shift
C.Data leakage
D.Model decay
AnswerB

Covariate shift occurs when the distribution of input features changes over time.

Why this answer

B is correct because covariate shift occurs when the distribution of the input features changes while the relationship between features and the target remains the same. In this scenario, the introduction of a new product category with a different distribution alters the input data distribution, causing the model to encounter unseen patterns and degrade in prediction accuracy.

Exam trap

AWS often tests the distinction between covariate shift and concept drift, and the trap here is that candidates confuse a change in input distribution (covariate shift) with a change in the relationship between inputs and outputs (concept drift), leading them to incorrectly select concept drift.

How to eliminate wrong answers

Option A is wrong because concept drift refers to a change in the underlying relationship between input features and the target variable over time, not a change in the input distribution itself. Option C is wrong because data leakage involves the accidental inclusion of future information or target data in the training features, which is not indicated by a new product category with a different distribution. Option D is wrong because model decay is a general term for performance degradation over time, but it does not specifically describe the cause as a shift in input distribution; covariate shift is the precise technical concept here.

13
MCQhard

An ML team trained a model using SageMaker and stored the model artifacts in S3 with server-side encryption using AWS KMS (SSE-KMS). They need to deploy the model to a SageMaker endpoint that uses a different KMS key for inference data encryption. What must they do to ensure the endpoint can decrypt the model artifacts?

A.Provide the same KMS key for both model artifacts and inference data.
B.Use a customer-managed key (CMK) with the same key material.
C.Grant the SageMaker execution role access to both KMS keys.
D.Configure the endpoint to use SSE-S3 instead of SSE-KMS.
AnswerC

The role needs decrypt on the artifact key and encrypt/decrypt on the inference key.

Why this answer

The SageMaker execution role must have kms:Decrypt permission on the KMS key used for model artifacts, and also kms:Encrypt and kms:Decrypt on the key for inference data. Providing the same key is not required.

14
MCQhard

A company uses Amazon SageMaker Ground Truth to create a labeled dataset. They want to monitor the accuracy of human labelers during the labeling process. Which metric should they track?

A.Labeling job cost
B.Number of tasks completed
C.Accuracy against blinded ground truth
D.Task acceptance rate
AnswerC

Ground Truth inserts known ground truth tasks to audit labelers; tracking accuracy on these tasks measures labeler performance.

Why this answer

Option C is correct because tracking accuracy against blinded ground truth (known as audit tasks) is the standard way to measure labeler performance. Options A and B are operational metrics not directly measuring accuracy. Option D is not directly accuracy.

15
MCQeasy

A machine learning engineer wants to encrypt model artifacts stored in Amazon S3. The artifacts are created and used by SageMaker training jobs and endpoints. What is the simplest way to ensure encryption at rest?

A.Create an S3 bucket with default encryption using SSE-S3 and allow SageMaker access.
B.Use SageMaker's default encryption with an AWS managed key.
C.Enable S3 bucket versioning and MFA delete.
D.Use a custom KMS key and grant SageMaker permission to use it.
AnswerA

SSE-S3 provides encryption at rest with no additional configuration, and SageMaker can read/write objects without any extra setup.

Why this answer

Option A is correct because SSE-S3 is the simplest encryption method and works seamlessly with SageMaker. Option B is not a thing; SageMaker does not have default encryption for artifacts. Option C is possible but not simplest.

Option D is for versioning, not encryption.

16
MCQeasy

A team uses SageMaker Pipelines to automate model retraining. After a successful pipeline run, they want to register the new model version in the SageMaker Model Registry so that it can be reviewed for approval. Which step type should they add to the pipeline?

A.RegisterModelStep
B.ConditionStep
C.TransformStep
D.TrainingStep
AnswerA

RegisterModelStep is specifically for registering a model version in the Model Registry.

Why this answer

Option B is correct because the RegisterModel step in SageMaker Pipelines registers a model in the Model Registry. Option A (Training) only trains. Option C (Transform) is for batch inference.

Option D (Condition) is for branching.

17
MCQmedium

A gaming company uses a SageMaker endpoint for real-time player churn prediction. The model is updated weekly. After a recent retraining, the team notices that the endpoint's predicted probabilities for churn have shifted dramatically: the average predicted probability dropped from 0.3 to 0.05. The team suspects concept drift (the relationship between features and target changed) rather than data drift. They have SageMaker Model Monitor set up for data drift and quality metrics, but not for bias or explainability. The team needs to confirm concept drift and take corrective action. Which approach should the team take FIRST?

A.Configure SageMaker Model Monitor's model quality monitoring to compare predictions against actual outcomes collected from a week of production traffic
B.Immediately retrain the model using the most recent month of data and redeploy to the endpoint
C.Use Amazon SageMaker Clarify to compute SHAP values and understand which features are driving the new predictions
D.Investigate data drift by reviewing the Model Monitor feature distribution constraints and comparing recent input data to the baseline
AnswerA

Model quality monitoring tracks metrics like accuracy, precision, recall over time if ground truth labels are available. A significant drop in these metrics would confirm concept drift.

Why this answer

To detect concept drift, the team needs to compare the model's predictions against actual observed outcomes (ground truth). SageMaker Model Monitor's quality monitoring can track prediction accuracy over time if ground truth is provided. Option D (set up Model Monitor's model quality monitoring) is the correct first step.

Option A (retrain with more recent data) might help but does not confirm drift. Option B (data drift monitoring) checks feature distribution, not concept drift. Option C (use Clarify for SHAP values) is for feature importance, not drift detection.

18
MCQmedium

A team uses SageMaker Model Monitor to track data quality. They notice that the monitor's constraint violations are increasing but the model performance remains good. What should they do?

A.Disable the monitor because it is not affecting performance.
B.Relax the constraint thresholds to reduce alerts.
C.Retrain the model using the latest data.
D.Investigate the specific features that are violating constraints to see if they are still relevant.
AnswerD

Feature distributions may have naturally shifted without harming model performance; investigating helps decide if constraints need updating.

Why this answer

Option A is correct because investigating specific violating features helps determine if constraints are still relevant. Option B is reactive without analysis. Option C ignores the issue.

Option D may be unnecessary if performance is still good.

19
MCQeasy

A machine learning team at a retail company has deployed a product recommendation model using Amazon SageMaker. The model is updated weekly with new data. Recently, the team noticed that the model's accuracy on a holdout evaluation set has been declining over the past month. The data pipeline that feeds the training job has not changed. The team suspects data drift. They have SageMaker Model Monitor enabled on the inference endpoint and have set up Amazon CloudWatch metrics for feature distribution distances. Upon reviewing the CloudWatch dashboards, they see that the feature distribution distance metric for the most important feature 'product_category' has increased significantly. However, the team is unsure if this is the root cause. Which remediation step should the team take FIRST?

A.Retrain the model using the most recent week of data and redeploy to the endpoint
B.Investigate the data pipeline that feeds the training job to ensure consistent data collection and encoding of the 'product_category' feature
C.Rebuild the SageMaker endpoint with a different instance type to improve performance
D.Reduce the number of features in the model by removing 'product_category'
AnswerB

The first step should be to confirm that the data pipeline is not introducing errors. If the data is correct, then retraining might be appropriate.

Why this answer

Before retraining the model or deploying a new endpoint, the team should investigate the source of the data drift by checking the input data pipeline. The data pipeline might have introduced a systematic error, such as a change in how 'product_category' is encoded or collected. Option A (retrain the model with more recent data) might not help if the data itself is corrupted.

Option B (reduce the number of features) could ignore the problem. Option D (rebuild the endpoint) would not address the data drift. Therefore, the first step is to investigate the data pipeline.

20
MCQmedium

A company trains a model daily using Amazon SageMaker and uses the model for real-time inference. They want to detect data drift between the training data and the inference data to decide when to retrain. Which AWS service should they use for this purpose?

A.Amazon Athena
B.Amazon SageMaker Model Monitor
C.AWS Glue
D.AWS Lambda
AnswerB

SageMaker Model Monitor is designed to detect data drift and model quality degradation.

Why this answer

Option B is correct because Amazon SageMaker Model Monitor can detect data drift by comparing inference data against a baseline created from training data. Option A is for ETL, not drift detection. Option C is for serverless compute.

Option D is for querying data, not monitoring drift.

21
MCQhard

A team is deploying a model that requires GPU acceleration for inference. They are using an Amazon SageMaker real-time endpoint. The model is a large language model (LLM) that does not fit on a single GPU. Which configuration should they use to minimize latency while fitting the model?

A.Use data parallelism with Horovod to distribute inference across GPUs.
B.Use SageMaker's model parallelism library to shard the model across multiple GPUs in a single instance.
C.Optimize the model with SageMaker Neo to reduce its size.
D.Deploy the model across multiple endpoints and use a load balancer.
AnswerB

Hardware and software support for large model inference.

Why this answer

Option C is correct because SageMaker supports model parallelism, allowing a model to be sharded across multiple GPUs in the same instance. Option A is wrong because SageMaker does not support multi-endpoint model parallelism. Option B is wrong because data parallelism is for training, not inference.

Option D is wrong because SageMaker Neo is for optimization, not model parallelism across GPUs.

22
MCQeasy

A company wants to ensure that only authorized users and services can invoke a SageMaker real-time endpoint. Which AWS service can be used to manage access control?

A.Amazon CloudWatch
B.AWS Identity and Access Management (IAM)
C.AWS CloudTrail
D.AWS Config
AnswerB

IAM policies can grant or deny access to invoke SageMaker endpoints.

Why this answer

Option A is correct because AWS IAM is used to control access to AWS resources, including SageMaker endpoints. Options B, C, and D are for auditing are other purposes.

23
MCQmedium

A team deploys a PyTorch model on Amazon SageMaker for real-time inference. They notice that inference latency is higher than expected. They suspect the serialization format used for input data is inefficient. Which approach would MOST likely reduce latency?

A.Use Amazon SageMaker Batch Transform instead of real-time inference.
B.Change the input serialization format to Protocol Buffers.
C.Enable automatic scaling on the endpoint.
D.Increase the instance type to a compute-optimized instance.
AnswerB

Protocol Buffers reduce serialization time compared to JSON/CSV.

Why this answer

Protocol Buffers (protobuf) are a binary serialization format that is significantly more compact and faster to parse than text-based formats like JSON or CSV. By reducing the size of the input data and the CPU overhead of deserialization, switching to protobuf directly addresses the root cause of high inference latency on SageMaker real-time endpoints.

Exam trap

The trap here is that candidates often confuse throughput improvements (scaling, larger instances) with latency reduction, or mistakenly think Batch Transform can substitute for real-time inference, when the question specifically targets the serialization format as the suspected bottleneck.

How to eliminate wrong answers

Option A is wrong because Batch Transform is designed for offline, asynchronous processing of large datasets and does not reduce latency for real-time inference; it actually increases end-to-end time by batching. Option C is wrong because automatic scaling adjusts the number of instances to handle traffic volume, not the per-request latency caused by serialization inefficiency. Option D is wrong while a compute-optimized instance might improve raw processing speed, it does not fix the underlying serialization bottleneck and is a more expensive, indirect solution compared to changing the serialization format.

24
MCQeasy

Refer to the exhibit. A team has configured data capture for a SageMaker endpoint. The endpoint is returning predictions but no captured data appears in the S3 bucket. What is the most likely cause?

A.The InitialSamplingPercentage is too low.
B.The IAM role for the endpoint does not have s3:PutObject permission.
C.The capture status is 'Configured' but not 'Running'.
D.The endpoint is not receiving any traffic.
AnswerB

Without write permission, captured data cannot be written to S3.

Why this answer

The data capture configuration shows CurrentCaptureStatus as 'Running' and sampling percentage at 50%. Most likely the IAM role attached to the endpoint does not have s3:PutObject permission for the destination bucket.

25
Multi-Selectmedium

A company wants to secure access to a SageMaker real-time endpoint. Which TWO actions should be taken? (Select two.)

Select 2 answers
A.Use an IAM role with sts:AssumeRole for invocation.
B.Attach a resource-based policy to the endpoint.
C.Enable AWS WAF on the endpoint.
D.Use AWS CloudTrail to log all invocations.
E.Configure the endpoint to be private within a VPC and use VPC endpoints.
AnswersB, E

Resource-based policies on SageMaker endpoints allow you to specify which IAM principals can invoke the endpoint.

Why this answer

Options A and D are correct. A: Attaching a resource-based policy allows fine-grained control over who can invoke the endpoint. D: Configuring the endpoint within a VPC and using VPC endpoints improves network security.

B is incorrect because IAM roles with sts:AssumeRole are not typically used for endpoint invocation. C is incorrect because AWS WAF does not integrate with SageMaker endpoints. E is about logging, not access control.

26
Multi-Selecteasy

A machine learning engineer is monitoring a production SageMaker endpoint using Amazon CloudWatch. They want to set up alarms for anomalous behavior. Which TWO CloudWatch metrics are MOST appropriate for detecting a sudden increase in request latency?

Select 2 answers
A.ModelLatency
B.5XXError
C.MemoryUtilization
D.Invocations
E.CPUUtilization
AnswersA, E

Correct. This metric measures the time taken for the model to process a request.

Why this answer

ModelLatency directly measures request latency, and CPUUtilization can indicate resource saturation leading to latency increases.

27
MCQhard

A company runs a real-time inference endpoint with an auto-scaling policy based on average CPU utilization. During a traffic spike, the endpoint scales out but takes several minutes to become healthy, causing increased latency. The endpoint uses a large instance type. Which change would MOST effectively reduce the time to scale out?

A.Switch to a smaller instance type.
B.Use a pre-warmed endpoint with a target tracking scaling policy.
C.Enable SageMaker Inference Recommender to optimize instance type.
D.Implement a canary deployment with a blue/green strategy.
E.Set a lower scaling cooldown period.
AnswerB

Correct. Pre-warmed endpoints keep a minimum number of instances ready, and target tracking proactively scales based on metrics.

Why this answer

Pre-warming the endpoint with a target tracking scaling policy maintains a baseline of ready instances, reducing cold start time.

28
MCQmedium

A financial services company is deploying a model for loan approval. They must ensure that the model's predictions do not show bias against protected groups. They plan to monitor for bias drift after deployment. Which SageMaker feature should they use?

A.SageMaker Model Monitor with data quality monitoring.
B.SageMaker Debugger to capture tensors.
C.SageMaker Ground Truth for fairness labels.
D.SageMaker Clarify with bias drift detection.
AnswerD

Clarify can detect bias in predictions and attributes.

Why this answer

Option C is correct because SageMaker Clarify can detect bias and drift in predictions. Option A (Model Monitor) focuses on data quality, not bias. Option B (Debugger) is for debugging training.

Option D (Ground Truth) is for labeling.

29
MCQhard

A team is deploying a real-time inference endpoint in SageMaker. The model requires access to an S3 bucket containing customer data, which is encrypted with SSE-KMS. The team needs to ensure that the endpoint can decrypt the data. Which IAM role configuration is necessary?

A.Add kms:GenerateDataKey permission to the SageMaker execution role.
B.Attach a policy to the S3 bucket granting s3:GetObject to the KMS key.
C.Add kms:Decrypt permission to the SageMaker execution role for the specific KMS key.
D.Configure the endpoint to assume the S3 bucket's IAM role.
AnswerC

The execution role must be allowed to decrypt using the customer-managed key.

Why this answer

Option A is correct because the SageMaker execution role must have permission to use the KMS key to decrypt the S3 objects. Option B is wrong because the endpoint role needs the decrypt permission, not grant to S3. Option C is insufficient because the role must have kms:Decrypt.

Option D is incorrect because SageMaker does not assume a role from S3.

30
MCQeasy

Refer to the exhibit. A data scientist reviews the CloudWatch Logs from an Amazon SageMaker real-time endpoint. What is the MOST likely root cause of the NaN output?

A.The model weights became corrupted due to a disk write error.
B.The input data contains out-of-range values not seen during training, causing the model to output NaN.
C.The endpoint is overloaded and returning a default NaN response.
D.The model artifact failed to load correctly, resulting in NaN weights.
AnswerB

Unusual input values can lead to numerical instability.

Why this answer

Option B is correct. The unusual input value (-9999.0) suggests data drift or out-of-range input that could cause the model to produce NaN. Option A is wrong because there is no memory error.

Option C is wrong because no latency issue is indicated. Option D is wrong because the log shows the error during inference, not during model loading.

31
MCQhard

A financial services company has deployed a machine learning model using Amazon SageMaker to predict loan default risk. The model is hosted on a real-time endpoint and uses a SageMaker Model Monitor schedule to check for data drift every hour. The monitoring schedule has been running for a month without issues. Starting last week, the data science team noticed that the endpoint's invocation latency has increased by 300% and error rates have spiked to 5% from a baseline of 0.1%. The team suspects the model is receiving out-of-distribution data that is causing longer processing times and occasional timeouts. They have active CloudWatch alarms on latency and error rates but no alarms on data drift. The Model Monitor schedule shows no failures in its status. The team needs to quickly identify whether data drift is the root cause and take corrective action. Which course of action should the team take to diagnose and address the issue?

A.Retrain the model using the latest training data from the last month and deploy a new endpoint to replace the current one.
B.Use the Model Monitor's built-in baseline drift analysis on the captured inference data stored in Amazon S3, and run an Amazon CloudWatch Logs Insights query on the endpoint logs to identify specific input features that have changed distribution.
C.Increase the endpoint's instance count and enable auto-scaling to handle the increased latency and errors.
D.Enable SageMaker Debugger on the endpoint to capture inference tensors and compare them to training tensor distributions.
AnswerB

This directly analyzes for data drift using the already-captured data and logs, enabling precise diagnosis.

Why this answer

Option A is correct because analyzing the captured inference data against the baseline using Model Monitor's built-in drift analysis will directly determine if data drift exists, and the log insights query can pinpoint which features have changed. Option B is wrong because SageMaker Debugger is for training-time debugging, not for inference data drift. Option C is wrong because retraining without diagnosing wastes resources if drift is not the cause.

Option D is wrong because increasing endpoint capacity addresses symptoms but not the root cause, and may not fix errors due to out-of-distribution data.

32
Multi-Selectmedium

A machine learning team is building a CI/CD pipeline for model deployment using Amazon SageMaker. They need to ensure that all model artifacts are encrypted at rest and in transit, and that access to the models is controlled via IAM. Which TWO actions should the team take to meet these requirements? (Choose TWO.)

Select 2 answers
A.Set the SageMaker model's 'EnableNetworkIsolation' parameter to true
B.Enable default encryption on the S3 bucket that stores model artifacts
C.Enable AWS CloudTrail to log all API calls to SageMaker
D.Configure the SageMaker notebook instance to use a KMS key for encryption
E.Use HTTPS endpoints for invoking the SageMaker model
AnswersD, E

KMS encrypts data at rest in SageMaker.

Why this answer

Option D is correct because configuring a SageMaker notebook instance to use a KMS key ensures that data at rest on the notebook's storage volume (e.g., EBS) is encrypted. This directly addresses the requirement for encryption at rest for model artifacts during development. Option E is correct because using HTTPS endpoints for invoking the SageMaker model ensures encryption in transit via TLS, protecting data as it moves between clients and the model endpoint.

Exam trap

The trap here is that candidates often confuse network isolation (Option A) with encryption or access control, or they assume S3 default encryption (Option B) alone satisfies all encryption requirements, ignoring the need for encryption in transit and IAM-based access control for the models themselves.

33
Multi-Selecteasy

A data scientist wants to monitor a deployed model for performance degradation. Which TWO metrics from Amazon CloudWatch should they use to detect issues? (Select two.)

Select 2 answers
A.ModelQuality
B.ModelLatency
C.CpuUtilization
D.Invocation5XXErrors
E.InvocationCount
AnswersB, D

Increased model latency can indicate performance degradation due to inefficient code or resource pressure.

Why this answer

Options A and D are correct. A: ModelLatency can indicate if the model is slowing down due to code changes or resource constraints. D: Invocation5XXErrors indicate server-side failures that may signal degradation.

B is about volume, not quality. C is not a standard CloudWatch metric. E is about instance health, not model performance.

34
MCQhard

Refer to the exhibit. An IAM policy is attached to a user to allow invoking a SageMaker endpoint. A developer tries to call the endpoint from a laptop with IP 203.0.113.5 and receives an access denied error. What is the most likely reason?

A.The resource ARN is incorrect.
B.The condition restricts the IP address to the 10.0.0.0/8 range.
C.The user does not have permission to assume the SageMaker role.
D.The policy does not include access to the API action.
AnswerB

The condition enforces that source IP must be in 10.0.0.0/8, but the laptop IP is not.

Why this answer

The policy includes a condition that restricts the source IP to the 10.0.0.0/8 range. The developer's laptop IP (203.0.113.5) is outside this range, causing access denied.

35
Multi-Selectmedium

An ML engineer is setting up monitoring for a SageMaker endpoint. Which THREE metrics should be monitored to detect performance issues? (Select THREE.)

Select 3 answers
A.Model latency
B.Invocations per second
C.CPUUtilization
D.MemoryUtilization
E.DiskWriteBytes
AnswersA, C, D

High latency indicates performance degradation.

Why this answer

Model latency is a critical metric for detecting performance issues in a SageMaker endpoint because it directly measures the time taken to process inference requests. High latency can indicate resource bottlenecks, model inefficiency, or scaling problems, and it is essential for meeting service-level agreements (SLAs). Monitoring latency helps identify when the endpoint is underprovisioned or when the model itself has degraded in performance.

Exam trap

The trap here is that candidates often confuse throughput metrics (like invocations per second) with performance health indicators, but the question specifically asks for metrics that detect performance issues, not just operational statistics.

36
Multi-Selecteasy

A company is using Amazon SageMaker to host a real-time inference endpoint. They want to restrict access to the endpoint to only a specific VPC and require authentication using AWS IAM. Which TWO configuration steps should they take to achieve this? (Choose TWO.)

Select 2 answers
A.Configure the endpoint to be deployed in a private subnet within the VPC
B.Enable IAM-based authentication for the endpoint
C.Attach a resource-based policy to the endpoint that denies all traffic except from the VPC
D.Place the endpoint behind Amazon CloudFront to act as a proxy
E.Use a public subnet and configure a security group to allow only the company's IP range
AnswersA, B

Private subnet restricts traffic to within the VPC.

Why this answer

Option A is correct because deploying the SageMaker endpoint in a private subnet within the VPC ensures that the endpoint is not publicly accessible and can only be reached from within that VPC. This is achieved by using a VPC interface endpoint (AWS PrivateLink) or by placing the endpoint directly in the VPC, which restricts network traffic to the VPC boundary.

Exam trap

The trap here is that candidates often confuse resource-based policies (like S3 bucket policies) with SageMaker endpoint capabilities, or assume that a security group alone can enforce VPC-only access, when in fact SageMaker requires explicit VPC configuration via PrivateLink or subnet placement.

37
MCQmedium

A company uses Amazon SageMaker to host a real-time inference endpoint for a fraud detection model. The endpoint is deployed with three instances of ml.m5.large. The model processes each request in about 200 ms. Lately, users report occasional timeouts (requests taking >5 seconds). The team suspects model drift or data skew. What is the MOST likely cause and solution?

A.The instances are under-provisioned; switch to ml.m5.xlarge instances.
B.A recent change increased the average input size, causing longer inference time; investigate input preprocessing.
C.The endpoint is experiencing too many concurrent requests; add more instances.
D.Model drift caused the model to become computationally heavier; retrain the model.
AnswerB

Larger inputs can increase inference latency significantly.

Why this answer

Option B is correct because the symptom of occasional timeouts (>5 seconds) on a model that normally processes requests in ~200 ms suggests that a recent change in input data characteristics (e.g., larger payloads or more complex features) is causing sporadic latency spikes. Investigating input preprocessing can identify if data skew or increased input size is overwhelming the model's inference path, which is a common monitoring concern in SageMaker real-time endpoints.

Exam trap

The trap here is that candidates confuse model drift (accuracy degradation) with performance degradation (latency increase), leading them to choose retraining (Option D) instead of investigating input preprocessing changes.

How to eliminate wrong answers

Option A is wrong because switching to ml.m5.xlarge instances would increase compute capacity but does not address the root cause of sporadic timeouts tied to input size changes; under-provisioning would cause consistent high latency, not occasional spikes. Option C is wrong because adding more instances helps with concurrency but not with per-request latency; if the model itself takes longer due to larger inputs, more instances won't reduce the inference time for a single request. Option D is wrong because model drift refers to degradation in prediction accuracy over time, not to an increase in computational heaviness; retraining would not fix latency caused by input preprocessing changes.

38
Multi-Selectmedium

A data science team detects that a deployed model's prediction accuracy is degrading over time due to concept drift. They need to implement a retraining strategy. Which THREE actions are recommended best practices for handling concept drift?

Select 3 answers
A.Automatically roll back to a previous model version upon drift detection.
B.Monitor prediction quality using ground truth labels when available.
C.Retrain the model on a fixed schedule regardless of performance.
D.Incrementally update the model with new data using SageMaker Pipelines.
E.Use SageMaker Model Monitor to detect drift and trigger retraining.
AnswersB, D, E

Correct. Ground truth labels enable direct accuracy monitoring.

Why this answer

Monitoring prediction quality, using drift detection to trigger retraining, and incrementally updating the model are key practices.

39
MCQhard

A company's ML pipeline runs in multiple AWS accounts (dev, test, prod). They want to enforce that only approved models from a central Model Registry can be deployed to the production account. Which combination of services is MOST appropriate to implement this governance?

A.AWS Config, Amazon GuardDuty, and AWS Security Hub.
B.Amazon API Gateway, AWS Step Functions, and Amazon DynamoDB.
C.AWS Service Catalog, AWS KMS, and AWS CloudTrail.
D.AWS Organizations with SCPs, AWS CodePipeline with cross-account actions, and SageMaker Model Registry with approval status.
E.AWS CloudFormation StackSets, Amazon EventBridge, and AWS Lambda.
AnswerD

Correct. SCPs enforce policies, CodePipeline orchestrates deployment, and Model Registry ensures only approved models are deployed.

Why this answer

AWS Organizations SCPs restrict actions, CodePipeline automates cross-account deployment, and Model Registry provides approval gates.

40
MCQeasy

A startup is using SageMaker to train a deep learning model. They use GPU instances for training. The training job takes about 8 hours. The team notices that sometimes the training job fails with an error message indicating that the instance was terminated due to Amazon EBS volume underprovisioned. The team is using the default EBS volume size for the training instance. They want to avoid this error without over-provisioning. What should they do?

A.Mount an Amazon EFS file system to the training instance and store all data there.
B.Switch to compute-optimized (C5) instances to reduce storage usage.
C.Specify a larger EBS volume size in the training job's resource configuration.
D.Configure the training job to use Amazon FSx for Lustre as a scratch file system.
AnswerC

Increasing the volume size ensures sufficient space for data and checkpoints.

Why this answer

Option B is correct because increasing the EBS volume size to accommodate the dataset and intermediate checkpoint files prevents the volume full error. Option A (use compute-optimized instances) doesn't fix storage. Option C (Amazon EFS) is a file system but may add latency and is not directly attached to training instances; requires mount.

Option D (FSx for Lustre) is high-performance but complex and overkill; also requires separate setup.

41
MCQmedium

Refer to the exhibit. A data engineer investigates why a SageMaker endpoint is returning errors. The endpoint configuration has been updated to point to a new model version. What is the MOST likely cause of the error?

A.The endpoint instance type is insufficient.
B.The container image for the new model is not compatible.
C.The IAM role does not have permission to invoke the endpoint.
D.The endpoint is still using the previous configuration.
E.The new model artifact is not properly uploaded to S3.
AnswerD

Correct. The endpoint configuration likely still points to an older model name that does not exist.

Why this answer

The error indicates the model 'my-model-v2' does not exist in SageMaker, suggesting the endpoint configuration still references an older model that was not updated correctly.

42
Multi-Selectmedium

A team deploys a machine learning model using an Amazon SageMaker endpoint. They need to monitor for data drift and model quality issues. Which AWS services or features should they use? (Choose THREE.)

Select 3 answers
A.AWS Glue DataBrew
B.Amazon SageMaker Clarify
C.Amazon SageMaker Ground Truth
D.Amazon CloudWatch Logs and Metrics
E.Amazon SageMaker Model Monitor
AnswersB, D, E

Provides bias and explainability monitoring.

Why this answer

Options A, C, and E are correct. A: SageMaker Model Monitor can monitor data drift and model quality. C: SageMaker Clarify can monitor bias and feature attribution drift.

E: Amazon CloudWatch can collect custom metrics and set alarms, used with Model Monitor. Option B is wrong because SageMaker Ground Truth is for labeling, not monitoring. Option D is wrong because AWS Glue is for ETL, not monitoring deployed models.

43
MCQhard

A company is using a SageMaker notebook instance to develop models. The security team requires that all data in the notebook be encrypted at rest and in transit, and that internet access be restricted. Which configuration meets these requirements?

A.Use a notebook with internet access enabled but attach a security group that blocks all outbound traffic.
B.Use a notebook with a public subnet and a network ACL that denies all inbound traffic.
C.Use a VPC-only notebook with default AWS managed key for EBS encryption.
D.Use a VPC-only notebook instance with a customer-managed KMS key and disable direct internet access.
AnswerD

VPC-only blocks internet, KMS encrypts at rest, HTTPS encrypts in transit.

Why this answer

Option A is correct because a VPC-only notebook with KMS encryption ensures data at rest is encrypted and no internet access. HTTPS is used for in-transit. Option B allows internet access via NAT.

Option C does not encrypt at rest. Option D ignores VPC restrictions.

44
MCQhard

A company deploys a machine learning model as a SageMaker real-time endpoint. They need to implement a mechanism to automatically roll back to the previous model version if performance degrades after a deployment. Which approach should they use?

A.Manually update the endpoint to point to the previous model version
B.Configure the SageMaker endpoint deployment with traffic shifting and set up CloudWatch alarms to trigger automatic rollback
C.Create multiple endpoints and use Amazon Route 53 weighted routing to shift traffic
D.Use AWS CodeDeploy with Amazon EC2 instances behind an Elastic Load Balancer
AnswerB

SageMaker supports canary or linear traffic shifting with automatic rollback based on CloudWatch alarms.

Why this answer

Option B is correct because SageMaker endpoint update with traffic shifting and automatic rollback based on CloudWatch alarms can be configured. Option A requires manual intervention. Option C is complex and not native.

Option D is for EC2, not SageMaker.

45
MCQmedium

After deploying a model to a SageMaker endpoint, the operations team notices high inference latency. They suspect it is due to insufficient instance capacity. Which first step should they take to diagnose the issue?

A.Check AWS CloudTrail logs for API errors.
B.Use Amazon SageMaker Debugger to analyze inference performance.
C.Review Amazon CloudWatch metrics for the endpoint, such as CPUUtilization and Invocations.
D.Retrain the model with more training data.
AnswerC

CloudWatch metrics can indicate resource saturation and latency.

Why this answer

Option C is correct because CloudWatch metrics like Invocations, ModelLatency, and CPUUtilization can help identify if the endpoint is overloaded. Option A (retrain) doesn't address capacity. Option B (CloudTrail) does not provide performance metrics.

Option D (SageMaker Debugger) is for training debugging, not inference.

46
MCQhard

A company's SageMaker real-time endpoint is experiencing high latency under load. The CloudWatch metrics show that the ModelLatency is acceptable, but the OverheadLatency is spiking. What is the most likely cause?

A.The request payload size is too large.
B.The SageMaker endpoint is not in the same VPC as the client.
C.The endpoint is under-provisioned with insufficient instance count.
D.The model inference code is inefficient.
AnswerC

When the endpoint is under-provisioned, SageMaker overhead increases due to queuing and container startup, spiking OverheadLatency.

Why this answer

Option C is correct because OverheadLatency includes SageMaker framework overhead, which increases when the endpoint is scaled improperly. Option A would affect ModelLatency. Option B would increase latency but not specifically OverheadLatency.

Option D would affect network latency but not OverheadLatency.

47
MCQhard

A company operates an e-commerce platform that uses a machine learning model to recommend products to users. The model is deployed on an Amazon SageMaker endpoint with automatic scaling enabled based on average CPU utilization. The model was trained on historical data and is updated weekly. Recently, the platform experienced a flash sale event that caused a sudden spike in traffic. During the event, the endpoint's latency increased dramatically, and many requests timed out. After the event, the team reviews the CloudWatch metrics and notices that the CPU utilization never exceeded 70%, and the scaling policy was triggered but instances took several minutes to become available. The team wants to prevent similar issues in future flash sales. Which course of action would be MOST effective?

A.Use predictive scaling based on historical traffic patterns.
B.Lower the CPU utilization threshold for the scaling policy to 40%.
C.Switch to larger instance types to handle higher CPU loads.
D.Implement scheduled scaling to add capacity ahead of known flash sales.
AnswerD

Scheduled scaling pre-warms instances, avoiding cold start delays.

Why this answer

Option D is correct because scheduled scaling allows you to proactively add capacity ahead of known traffic events like flash sales, eliminating the cold-start delay that occurs when reactive scaling policies (like those based on CPU utilization) must launch new instances. During the flash sale, the scaling policy was triggered but instances took minutes to become available, causing timeouts; scheduled scaling pre-warms the endpoint by adjusting the desired instance count before the traffic spike hits.

Exam trap

The trap here is that candidates assume reactive scaling (lowering thresholds or using predictive scaling) can handle sudden spikes, but the exam tests your understanding that provisioning latency is the bottleneck, and only proactive scheduled scaling can eliminate that delay for known events.

How to eliminate wrong answers

Option A is wrong because predictive scaling relies on historical traffic patterns to forecast future demand, but a flash sale is an irregular, planned event that may not follow those patterns, and predictive scaling still involves a delay in provisioning instances. Option B is wrong because lowering the CPU threshold to 40% would cause the scaling policy to trigger earlier, but it does not address the fundamental issue that new instances take several minutes to become available (cold-start latency), so requests would still time out during that provisioning window. Option C is wrong because switching to larger instance types increases the per-instance capacity but does not eliminate the cold-start delay when scaling out; during a sudden spike, even larger instances would eventually be overwhelmed if the scaling action itself is too slow.

48
MCQeasy

A company requires that all SageMaker notebook instances be created within a private VPC without internet access. Which configuration step is mandatory?

A.Use a SageMaker Studio notebook instead.
B.Configure VPC settings when creating the notebook instance, choosing a private subnet.
C.Enable SageMaker direct internet access.
D.Assign a public IP to the notebook instance.
AnswerB

Selecting a private subnet ensures the notebook instance is launched in the VPC without a public IP, fulfilling the requirement.

Why this answer

Option B is correct because you must choose a private subnet when creating the notebook instance. Option A would enable internet access. Option C is not a valid setting.

Option D would assign a public IP, which would provide internet access.

49
MCQmedium

A data science team deployed a model on Amazon SageMaker and enabled Model Monitor to detect data drift. After a week, they receive alerts indicating that the distribution of a key feature has shifted significantly. However, the model's accuracy on the recent production data remains high. Which action should the team take next?

A.Disable the data drift alert since accuracy is not affected.
B.Increase the sample size for monitoring to reduce false positives.
C.Retrain the model immediately because data drift always degrades performance.
D.Investigate the root cause of the drift as it may be benign or may lead to future degradation.
AnswerD

Investigating helps understand if the drift is meaningful; it could be benign or a leading indicator of future issues.

Why this answer

Option A is correct because data drift does not always immediately impact accuracy; it's important to investigate. Option B is wrong because retraining without investigation may be wasteful. Option C is wrong because drift can become problematic later.

Option D is wrong because sample size is not the issue; the drift is real.

50
MCQeasy

A data science team deploys a regression model using Amazon SageMaker. After one week, the model's prediction accuracy drops significantly. The team needs to detect this degradation automatically and trigger retraining. Which AWS service should they use to monitor the model's performance over time and set up alerts?

A.AWS CloudWatch
B.Amazon SageMaker Model Monitor
C.Amazon Inspector
D.AWS Config
AnswerB

SageMaker Model Monitor tracks model quality metrics and can trigger retraining.

Why this answer

Amazon SageMaker Model Monitor is the correct choice because it is purpose-built to continuously monitor machine learning models deployed on SageMaker endpoints for data drift, feature attribution drift, and prediction quality degradation. It automatically compares live inference data against a baseline, triggers alerts when performance drops, and can be configured to initiate retraining pipelines via AWS Lambda or Step Functions, directly addressing the need to detect accuracy degradation and trigger retraining.

Exam trap

The trap here is that candidates often confuse general-purpose monitoring services like CloudWatch with model-specific monitoring tools, overlooking that SageMaker Model Monitor provides built-in drift detection and retraining triggers tailored for ML models, whereas CloudWatch requires extensive custom scripting to achieve the same functionality.

How to eliminate wrong answers

Option A is wrong because AWS CloudWatch is a general-purpose monitoring service for metrics, logs, and alarms, but it lacks native capabilities to detect model-specific degradation like data drift or prediction accuracy drop without custom code and manual baseline setup. Option C is wrong because Amazon Inspector is a vulnerability management service that scans workloads for software vulnerabilities and unintended network exposure, not for monitoring ML model performance or triggering retraining. Option D is wrong because AWS Config is a service for evaluating, auditing, and recording changes to AWS resource configurations, not for monitoring model prediction accuracy or detecting performance degradation over time.

51
MCQhard

A company wants to restrict access to a SageMaker notebook instance so that only a specific IAM role can open the notebook via JupyterLab. The notebook instance is associated with a lifecycle configuration that installs custom packages. What is the correct way to enforce access control?

A.Set the notebook instance's Direct Internet Access to disabled and use IAM authentication.
B.Grant the specific IAM role permission to call sagemaker:CreatePresignedNotebookInstanceUrl on that notebook instance.
C.Use AWS Systems Manager to proxy SSH access, then use IAM permission.
D.Configure the notebook instance to use a VPC and restrict access via security groups.
AnswerB

This action generates a presigned URL for accessing the notebook, and restricting it to the role enforces access control.

Why this answer

The specific IAM role should be granted permission to call the `sagemaker:CreatePresignedNotebookInstanceUrl` action for that notebook instance ARN. Other options like VPC or Systems Manager do not control who can open the notebook.

52
MCQmedium

A healthcare company is subject to HIPAA and uses SageMaker to train models on patient data. The data is stored in an S3 bucket with server-side encryption using a customer-managed KMS key. The training job uses a custom Docker container that needs to read the data. The security team is concerned about unauthorized access to the data during training. They want to ensure that only the specific training job can access the decryption key. The training runs in a VPC. What should they do?

A.Configure the training job to run in a VPC with an S3 VPC endpoint and attach an IAM role that has kms:Decrypt permission only for the key and only for that role.
B.Place the training job in a private subnet and use a NAT gateway for S3 access.
C.Configure the S3 bucket policy to allow only the SageMaker training role's ARN.
D.Use S3 access points with a policy that restricts access to the training job's IP address.
AnswerA

VPC endpoint keeps traffic in AWS network, and fine-grained IAM ensures only the job can decrypt.

Why this answer

Option C is correct because using a VPC and configuring the training job to use a VPC with a VPC endpoint for S3 and KMS ensures data stays within the VPC. Also, the training role should have strict permissions to the KMS key. Option A (bucket policy) alone is not enough.

Option B (use only private subnet) but still need S3 access. Option D (S3 access points) is not the primary security measure.

53
MCQmedium

A financial services company uses an Amazon SageMaker endpoint for real-time credit scoring. The endpoint is deployed with an ml.c5.2xlarge instance. Recently, the data science team has received complaints from users about slow response times. The team monitors the endpoint using CloudWatch metrics. They observe that the InvocationsPerSecond metric averages 50, the ModelLatency metric averages 200 milliseconds, and the CPUUtilization metric averages 95%. The team has also noticed that the endpoint occasionally returns HTTP 503 (Service Unavailable) errors during peak hours. The team needs to reduce latency and eliminate 503 errors while minimizing cost increase. Which solution should the team implement?

A.Create a SageMaker endpoint with multiple instances behind a load balancer and configure automatic scaling based on CPUUtilization or InvocationsPerSecond
B.Enable SageMaker Data Capture to collect inference data for later analysis to identify slow requests
C.Replace the endpoint instance type with a more powerful compute-optimized instance, such as ml.c5.4xlarge
D.Increase the endpoint invocation timeout from 60 seconds to 120 seconds in the application configuration
AnswerA

Scaling out with multiple instances distributes the load, reducing latency and eliminating 503 errors. Automatic scaling adjusts the number of instances based on demand, optimizing cost.

Why this answer

CPUUtilization at 95% indicates that the instance is overloaded, causing high latency and 503 errors. Scaling out (adding more instances) will distribute the load and reduce latency, and using automatic scaling ensures that the number of instances adjusts to demand, minimizing cost by scaling down when traffic is low. Option A (larger instance) may not be as cost-effective as scaling out, and Option B (enable data capture) would not help latency.

Option D (increase timeout) does not address the root cause of overloading.

54
MCQhard

A financial services company operates a real-time inference endpoint for a fraud detection model on Amazon SageMaker. The model was trained on historical transaction data from 2023. Over the past month, the model's precision has dropped from 92% to 78%, while recall remains high at 95%. The data science team suspects data drift and has already enabled SageMaker Model Monitor with data capture and a baseline from the training data. The latest monitoring report indicates no statistically significant drift in any of the input features. The team also verified that the inference code and model artifact have not changed. Despite the stable feature distributions, the model is misclassifying an increasing number of legitimate transactions as fraudulent (false positives). The business is concerned about the impact on customer experience. What is the best course of action?

A.Replace the model with a more complex algorithm such as a gradient-boosted tree.
B.Retrain the model using the most recent 30 days of transaction data with automated retraining pipelines.
C.Increase the data capture sampling percentage from 10% to 100% for more detailed analysis.
D.Investigate recent ground truth labels to check for label drift or changes in the fraud definition.
AnswerD

Label drift occurs when the underlying relationship between features and labels changes. Collecting and analyzing recent labels can confirm if the fraud criteria have shifted.

Why this answer

The scenario describes a drop in precision without feature drift, which indicates label drift – the relationship between features and labels has changed. The most effective next step is to collect and analyze recent ground truth labels to confirm label drift. Retraining on recent data without addressing the root cause may not help if the new labels are also stale or incorrect.

Increasing data capture rate will not diagnose the issue. Changing the algorithm is unlikely to help without understanding the cause.

55
MCQeasy

A data science team deploys a real-time inference endpoint on Amazon SageMaker. They want to monitor for data drift in the input features over time. Which AWS service should they use to capture and analyze the input data distribution?

A.Amazon Athena
B.AWS CloudTrail
C.Amazon SageMaker Model Monitor
D.Amazon CloudWatch Logs
AnswerC

Model Monitor captures input data and computes statistics to detect drift.

Why this answer

Amazon SageMaker Model Monitor is designed to detect data drift by capturing and analyzing input data distributions. CloudWatch Logs is for logging, CloudTrail for API auditing, and Athena for ad-hoc querying.

56
MCQmedium

A SageMaker training job has been running for several hours but shows no progress. The job is using a custom Docker container. The engineer suspects a bug in the training script. Which tool is BEST to debug the training job without stopping it?

A.Amazon CloudWatch Logs
B.SageMaker Local Mode
C.SageMaker Processing
D.SageMaker Debugger
E.SageMaker Profiler
AnswerD

Correct. Debugger provides real-time debugging, monitoring, and profiling capabilities.

Why this answer

SageMaker Debugger can monitor training in real-time, capture tensors, and detect anomalies without interrupting the job.

57
Multi-Selecthard

A company wants to monitor their machine learning model for bias over time. Which THREE AWS services or features can they use to achieve this? (Choose THREE.)

Select 3 answers
A.Amazon SageMaker Experiments
B.AWS CloudTrail
C.Amazon SageMaker Clarify
D.Amazon SageMaker Model Monitor
E.Amazon SageMaker Pipelines
AnswersC, D, E

Clarify can detect bias and generate bias reports.

Why this answer

Options A, B, and C are correct. SageMaker Clarify can detect bias in training data and predictions. SageMaker Model Monitor can track bias metrics over time if configured.

SageMaker Pipelines can include bias check steps for automated monitoring. Option D is for tracking experiments, not monitoring bias. Option E is for API logging.

58
MCQmedium

A company is using SageMaker endpoints for inference. To reduce costs, they want to use Automatic Scaling. However, they observe that scaling up takes several minutes, causing latency spikes during traffic bursts. What should they do to mitigate this?

A.Optimize the model to reduce inference time.
B.Use larger instance types to handle more requests per instance.
C.Configure the endpoint with a target tracking scaling policy and pre-warm additional instances during expected traffic surges.
D.Set the endpoint to scale down slowly to maintain capacity.
AnswerC

Pre-warming ensures instances are ready, minimizing cold start latency.

Why this answer

Option A is correct because pre-warming the endpoint reduces cold start time. Option B (larger instances) might help but not cost-effective. Option C (optimizing model) reduces computational load but not scaling delay.

Option D (scale down slowly) addresses scale-down, not up.

59
MCQhard

A healthcare company uses Amazon SageMaker to deploy a real-time inference endpoint for a diagnostic model. The endpoint is configured with a single ml.p3.2xlarge instance. The model processes patient data and returns a risk score. Recently, the endpoint has been experiencing intermittent 504 errors along with increased latency. The team uses Amazon CloudWatch to monitor the endpoint's InvocationsPerInstance and ModelLatency metrics. They observe that InvocationsPerInstance is well below the throttling threshold, but ModelLatency shows periodic spikes lasting 5-10 seconds. The endpoint's CPU utilization remains below 60%, but memory utilization occasionally spikes to 90% during those spikes. The team has checked the inference code and found no obvious memory leaks or performance bottlenecks in the custom logic. The model itself is a deep neural network hosted using Apache MXNet. The team suspects that the issue might be related to resource contention or an external dependency. What should the team do FIRST to diagnose and resolve the issue?

A.Implement request batching to increase throughput and reduce the number of inference requests.
B.Increase the instance type to a more memory-intensive instance like ml.p3.8xlarge to handle memory spikes.
C.Set up SageMaker Model Monitor to track data drift and model quality metrics.
D.Enable SageMaker Debugger rules and profiling to monitor memory and CPU utilization at a fine-grained level during inference.
AnswerD

Debugger can provide detailed profiling to pinpoint resource contention or memory issues in the model or framework.

Why this answer

Option B is correct because the symptoms point to a possible memory contention issue, and enabling detailed profiling for memory and CPU can identify the root cause. Option A is wrong because increasing instance size might mask the problem without identifying it. Option C is wrong because request batching can increase memory usage and may worsen the issue.

Option D is wrong because Model Monitor is for data drift, not performance diagnostics.

60
MCQhard

A team deploys a machine learning model using a SageMaker endpoint with an ML.T4 instance. After a week, they notice that the endpoint's CPU utilization is consistently below 10% and latency is low. However, the endpoint is incurring high costs. Which action should the team take to reduce costs while maintaining the ability to serve traffic?

A.Switch to a multi-model endpoint to share instances across models
B.Reduce the number of instances to one
C.Migrate to a SageMaker Serverless Inference endpoint
D.Implement an asynchronous inference endpoint
AnswerC

Serverless endpoints scale to zero when idle, reducing cost.

Why this answer

Option C is correct because a serverless inference endpoint scales to zero when not in use, reducing cost. Option A is wrong because multi-model endpoints still have always-on instances. Option B is wrong because reducing instances may cause throttling.

Option D is wrong because asynchronous inference is for batch, not real-time.

61
MCQmedium

A machine learning model is deployed on SageMaker and its predictions are used in a production application. The model's accuracy has degraded over time. What is the most likely cause?

A.The training data was not shuffled properly.
B.The model was not compiled for inference.
C.The model experienced concept drift.
D.The endpoint instance type is too small.
AnswerC

Concept drift is a common cause of accuracy degradation in production.

Why this answer

Concept drift occurs when the statistical properties of the target variable change over time, leading to accuracy degradation. Training data shuffling and instance size affect training performance and latency, not accuracy post-deployment.

62
MCQmedium

A company's SageMaker endpoint is experiencing increased latency during peak hours. The endpoint uses a single ml.m5.large instance. The deployment is critical and must maintain low latency. Which action is MOST effective to reduce latency without sacrificing cost efficiency?

A.Deploy multiple variants with A/B testing
B.Use Elastic Inference to attach an accelerator
C.Switch to a ml.c5.large instance
D.Add an auto-scaling policy based on request count
E.Enable SageMaker Model Monitor
AnswerD

Auto-scaling adjusts instance count to match demand, reducing latency during spikes while minimizing cost.

Why this answer

Adding auto-scaling based on request count allows the endpoint to handle spikes without over-provisioning, balancing cost and latency.

63
MCQeasy

A data scientist trained a model using SageMaker and wants to automate the retraining process when new data becomes available. Which AWS service is best suited to trigger a SageMaker training job based on an S3 event?

A.AWS Step Functions with a scheduled trigger.
B.Amazon Simple Workflow Service (SWF) decider.
C.Amazon EventBridge with a rule matching S3 object creation.
D.Amazon Simple Queue Service (SQS) with a polling script.
AnswerC

EventBridge can invoke a Lambda function that starts the training job.

Why this answer

Option B is correct because Amazon EventBridge can listen to S3 events (e.g., PutObject) and trigger a Lambda function to start a training job. Option A (SQS) is a queue, not a trigger for direct training jobs. Option C (SWF) is a workflow service not typically used for this simple pattern.

Option D (Step Functions) can orchestrate but is not directly triggered by S3 events without EventBridge.

64
Multi-Selectmedium

A data science team uses SageMaker Studio to collaborate. They need to restrict access to certain SageMaker Studio applications (e.g., only JupyterLab, no RStudio). Which THREE steps should they take? (Choose THREE.)

Select 3 answers
A.Use an S3 bucket policy to prevent access to RStudio application artifacts.
B.Attach an IAM policy to users that denies sagemaker:CreateApp for specific app types.
C.Enable CloudTrail to log user activity and monitor for prohibited app usage.
D.Define a SageMaker Studio domain-level policy that specifies allowed apps.
E.Create a custom lifecycle configuration that disables unauthorized apps.
AnswersB, D, E

IAM policies can deny creation of specific apps like RStudio.

Why this answer

Options A, B, and E are correct. A domain-level policy can restrict apps, a lifecycle configuration can enforce settings at launch, and IAM policies can limit specific applications. Option C (S3 bucket policy) doesn't control Studio apps.

Option D (CloudTrail) is for auditing, not restriction.

65
Multi-Selecteasy

A company uses SageMaker Model Monitor to detect drift. They want to receive notifications when drift is detected. Which TWO services can be used together to send notifications? (Choose TWO.)

Select 2 answers
A.Amazon SNS topics to send email or SMS.
B.Amazon EventBridge to trigger a notification.
C.AWS Lambda to process the drift and send an email via SES.
D.Amazon SQS to queue the notification.
E.Amazon CloudWatch Alarms set on the drift metric.
AnswersA, E

SNS is used for sending notifications.

Why this answer

Options B and D are correct. Amazon CloudWatch Alarm can trigger on a Model Monitor metric, and it can publish to an SNS topic to send email or SMS. Option A (Lambda) could be used but is not a direct notification service.

Option C (SQS) is for queues. Option E (EventBridge) can route events but not directly send notifications; usually triggers Lambda.

66
MCQmedium

A team uses SageMaker Clarify to monitor bias drift in production. They schedule weekly analysis. After a month, Clarify reports a significant increase in a bias metric. What should the team do first?

A.Disable the bias monitor because the metric may be noisy.
B.Immediately retrain the model with a balanced dataset.
C.Increase the frequency of analysis to daily.
D.Review the analysis report to understand which feature and segment contributed to the drift.
AnswerD

The Clarify report provides details on which features and segments are driving the bias, guiding appropriate action.

Why this answer

Option D is correct because reviewing the report helps understand which feature and segment contributed to the drift. Option A is premature without understanding the cause. Option B is ignoring the issue.

Option C does not address the drift source.

67
MCQmedium

A data science team is using Amazon SageMaker to train and deploy a binary classification model. They want to continuously monitor the model for data drift in production. Which combination of AWS services and SageMaker features should they use to implement automated drift detection with minimal operational overhead?

A.SageMaker Debugger and Amazon SNS
B.SageMaker Pipelines and AWS Lambda
C.SageMaker Clarify and AWS Config
D.SageMaker Model Monitor and Amazon CloudWatch
AnswerD

SageMaker Model Monitor detects drift and sends metrics to CloudWatch for alerting.

Why this answer

SageMaker Model Monitor is the native SageMaker feature designed specifically for continuously monitoring deployed models for data drift, bias drift, and feature attribution drift. It automatically captures inference requests and responses, computes statistics, and publishes metrics to Amazon CloudWatch, which can trigger alarms for drift detection. This combination provides automated drift detection with minimal operational overhead because it requires no custom infrastructure or manual scheduling.

Exam trap

The trap here is that candidates confuse SageMaker Debugger (training debugging) with SageMaker Model Monitor (production drift detection), or they overcomplicate the solution by adding unnecessary services like Lambda or Config when the native integration with CloudWatch already provides automated alerting.

How to eliminate wrong answers

Option A is wrong because SageMaker Debugger is used for debugging training jobs (e.g., monitoring gradients, weights, and loss during training), not for monitoring data drift in production inference. Option B is wrong because SageMaker Pipelines is a CI/CD orchestration tool for building and managing ML workflows, not a continuous monitoring service; while AWS Lambda could be used to process drift alerts, the core drift detection capability is missing. Option C is wrong because SageMaker Clarify is designed for bias detection and explainability (SHAP values) on datasets or during training, not for real-time drift monitoring of production endpoints; AWS Config tracks resource configuration changes, not model performance or data drift.

68
MCQhard

Refer to the exhibit. A data scientist uses a SageMaker notebook instance to read a model file from S3 bucket 'my-bucket'. The bucket uses SSE-KMS encryption with a KMS key. The IAM role attached to the notebook has the above policy. However, reading the file fails. What is the MOST likely reason?

A.The resource ARN for S3 does not include the bucket itself (only objects inside).
B.The policy allows s3:GetObject only if server-side encryption is AES256, but the bucket uses KMS.
C.The condition requires encryption to be AES256, which is SSE-S3, but the bucket uses KMS.
D.The kms key ARN is incorrect.
E.The policy allows kms:Decrypt but does not allow kms:GenerateDataKey.
AnswerC

Correct. The condition checks for 'AES256' header, but SSE-KMS uses 'aws:kms', so the condition fails and access is denied.

Why this answer

The S3 condition requires server-side encryption to be AES256 (SSE-S3), but the bucket uses SSE-KMS, so the S3 request does not satisfy the condition, denying access.

69
Multi-Selecteasy

Which TWO actions are recommended best practices for securing an Amazon SageMaker notebook instance? (Select TWO.)

Select 2 answers
A.Use network ACLs to restrict API calls to the SageMaker API.
B.Enable Multi-AZ deployment for the notebook instance.
C.Use AWS KMS to encrypt the notebook instance's storage volume.
D.Associate the notebook instance with a public subnet that has an internet gateway.
E.Disable direct internet access for the notebook instance.
AnswersC, E

KMS encryption protects data at rest.

Why this answer

Option C is correct because encrypting the notebook instance's storage volume with AWS KMS ensures data-at-rest protection, which is a fundamental security best practice. SageMaker notebook instances use Amazon EBS volumes for storage, and KMS encryption safeguards sensitive code, datasets, and model artifacts stored on that volume against unauthorized access.

Exam trap

The trap here is that candidates often confuse network-level controls (network ACLs) with API-level controls (IAM/VPC endpoints), or they mistakenly think Multi-AZ applies to all AWS services, when in fact it is specific to database and high-availability services.

70
MCQeasy

A company uses Amazon Rekognition to moderate user-generated images. They want to set up a monitoring system that alerts the team if the number of inappropriate images flagged by the model exceeds a threshold. Which combination of AWS services should they use?

A.Amazon CloudWatch Logs to store inference logs and create a metric filter.
B.Amazon CloudWatch to publish custom metrics and create an alarm, and AWS Lambda to process images and publish metrics.
C.AWS Config to track resource changes and trigger an SNS notification.
D.Amazon Simple Notification Service (SNS) to send alerts when threshold is exceeded.
AnswerB

Lambda can publish custom metrics to CloudWatch, which can trigger alarms.

Why this answer

Option A is correct because Amazon CloudWatch can publish custom metrics (number of flagged images) and trigger alarms. AWS Lambda can process the images and publish the metric. Option B is incomplete because CloudWatch Logs alone cannot trigger alarms on custom metrics.

Option C is incorrect because SNS alone does not monitor metrics. Option D is incorrect because AWS Config is for configuration tracking.

71
MCQhard

A financial services company uses a custom container on Amazon SageMaker to serve a fraud detection model. The model's inference latency has recently increased, causing timeouts for some requests. The team reviews the SageMaker logs and finds that the container is consuming more memory than allocated. What should the team do to maintain service quality while ensuring cost-effectiveness?

A.Decrease the model's batch size to reduce memory usage
B.Increase the number of instances in the endpoint to distribute the load
C.Implement an auto-scaling policy based on memory utilization
D.Change the instance type to a memory-optimized instance, such as r5.large
AnswerD

Switching to a memory-optimized instance provides more memory per instance, resolving the issue cost-effectively.

Why this answer

The correct answer is D because the root cause is that the container is consuming more memory than allocated, leading to increased latency and timeouts. Switching to a memory-optimized instance like r5.large directly addresses the memory constraint by providing more memory per vCPU, which resolves the performance issue without over-provisioning compute resources. This approach is cost-effective because it targets the specific bottleneck (memory) rather than scaling out or changing unrelated parameters.

Exam trap

The trap here is that candidates often confuse scaling out (adding instances) with scaling up (choosing a larger instance type), and they may incorrectly assume that auto-scaling based on memory utilization will prevent timeouts, when in fact it only reacts after the problem occurs.

How to eliminate wrong answers

Option A is wrong because decreasing the batch size reduces throughput and may lower memory usage per request, but it does not fix the underlying memory allocation issue; it could also increase latency due to more frequent inference calls. Option B is wrong because increasing the number of instances distributes the load but does not solve the per-instance memory shortage; each container would still run out of memory, leading to continued timeouts and higher costs from additional instances. Option C is wrong because implementing auto-scaling based on memory utilization would only add more instances after the memory is already exhausted, causing intermittent failures and unpredictable costs; it does not prevent the memory exhaustion in the first place.

72
MCQeasy

A team wants to automatically retrain a model when new labeled data arrives. Which SageMaker feature can orchestrate this workflow?

A.SageMaker Pipelines
B.SageMaker Model Monitor
C.SageMaker Debugger
D.SageMaker Autopilot
AnswerA

Pipelines can orchestrate a retraining workflow when triggered.

Why this answer

SageMaker Pipelines is a workflow orchestration service that can automate retraining pipelines. Model Monitor detects drift, Debugger debugs training, and Autopilot automates model building.

73
MCQhard

Refer to the exhibit. A SageMaker training job using this IAM role fails with an access denied error when trying to read a file from s3://my-bucket/training-data/model_input.csv. However, a different file at s3://my-bucket/training-data/input/data.csv can be read successfully. What is the most likely reason?

A.The file model_input.csv is encrypted with a KMS key that the role does not have access to.
B.The IAM policy restricts access to objects only under the 'training-data/' prefix using an incorrect condition key.
C.The file name contains special characters that are not encoded correctly.
D.The S3 bucket has a bucket policy that denies access to the specific file.
AnswerB

The condition 's3:prefix' is meant for list operations; for GetObject, it should be 's3:object key' with 'StringLike'. This misconfiguration causes the GetObject request to not match the condition, resulting in access denied for some objects.

Why this answer

Option D is correct because the condition 's3:prefix' evaluates to the object’s key prefix, and 'training-data/model_input.csv' does not start with 'training-data/'? Actually it does, but the condition 'StringEquals' requires an exact match? The s3:prefix condition key is used with 'StringLike' or 'StringEquals' to match the prefix. With 'StringEquals', it must exactly equal 'training-data/', but the object key is 'training-data/model_input.csv' which starts with that prefix. However, the condition is evaluated against the request's s3:prefix value, not the object key.

There is an AWS nuance: s3:prefix condition key checks the prefix used in the request (e.g., in ListObjects), not the object key itself. For GetObject, the condition does not apply because GetObject uses s3:object key. So the policy is misconfigured.

The correct condition for GetObject should be on s3:object key using a StringLike. Therefore, the most likely reason is that the policy condition is incorrectly applied. Option A is plausible but less likely since the other file works.

Option B is about KMS, which would affect both files. Option C is about bucket policy overriding. Option D about file name containing special characters is unrelated.

74
MCQeasy

A company has a SageMaker endpoint that uses a trained model to classify images. The endpoint is experiencing high latency and the team suspects it is due to the model size. Which action can the team take to reduce latency without significantly impacting accuracy?

A.Switch to a compute-optimized instance type
B.Use SageMaker Neo to compile the model for the target instance
C.Reduce the batch size of inference requests
D.Convert the model to ONNX format
AnswerB

Neo optimizes model inference for specific hardware, reducing latency.

Why this answer

SageMaker Neo compiles trained models into an optimized binary for the target hardware, applying techniques like operator fusion, memory layout optimization, and quantization. This reduces model size and inference latency while preserving accuracy, making it the correct choice for addressing high latency caused by model size.

Exam trap

AWS often tests the misconception that converting to an open format like ONNX inherently optimizes performance, when in reality it is just a serialization format and requires a separate compilation step (e.g., Neo) to reduce latency.

How to eliminate wrong answers

Option A is wrong because switching to a compute-optimized instance (e.g., c5) may improve CPU-bound processing but does not reduce model size or memory footprint; the latency issue stems from the model itself, not insufficient compute. Option C is wrong because reducing batch size can lower throughput and increase per-request overhead, potentially worsening latency; it does not address the root cause of model size. Option D is wrong because converting to ONNX format alone does not guarantee latency reduction; ONNX is an interchange format that requires a compatible runtime (e.g., ONNX Runtime) and may still need optimization like Neo to achieve performance gains.

75
Multi-Selecteasy

A team uses SageMaker Ground Truth to create labeled datasets. They need to ensure labeling jobs are cost-effective. Which TWO measures should they take? (Select TWO.)

Select 2 answers
A.Use a smaller instance type for the labeling job.
B.Use a smaller workforce type.
C.Set up a labeling workflow with 'Incremental training'.
D.Enable the 'Consolidated billing' for labeling costs.
E.Use the 'Automated data labeling' feature.
AnswersC, E

Incremental training leverages existing models to reduce labeling needs.

Why this answer

Automated data labeling reduces manual labeling cost by using model predictions to label data, and incremental training reduces the number of items that need manual labeling by starting from an existing model.

Page 1 of 2 · 121 questions totalNext →

Ready to test yourself?

Try a timed practice session using only ML Solution Monitoring, Maintenance and Security questions.