CCNA Deployment and Orchestration of ML Workflows Questions — Page 2 of 2

MCQeasy

A data science team wants to deploy a real-time inference endpoint on Amazon SageMaker for a model that requires low latency (under 100 ms). The model is a small ensemble of three tree-based models, each about 50 MB. The team expects around 1000 requests per minute, with occasional spikes to 5000 requests per minute. Which instance type and deployment strategy would be MOST cost-effective while meeting the latency requirement?

A.Deploy a single model endpoint on an ml.c5.large instance with Auto Scaling configured using a target tracking policy based on invocations per minute

B.Deploy a single model endpoint on an ml.c5.large instance with a Multi-Model endpoint

C.Use SageMaker batch transform with multiple ml.c5.large instances to process all requests offline

D.Deploy a single model endpoint on an ml.c5.xlarge instance with provisioned concurrency

AnswerA

The ml.c5.large provides sufficient compute for the latency requirement, and Auto Scaling scales out during spikes. This is the most cost-effective approach.

Why this answer

Option A is correct because deploying a single model endpoint on an ml.c5.large instance with Auto Scaling based on invocations per minute provides the necessary compute capacity for the expected 1000 requests per minute while scaling up to handle spikes up to 5000 requests per minute. The ml.c5.large instance offers sufficient memory (4 GB) and compute for three 50 MB tree-based models, and the target tracking policy ensures low latency by maintaining a buffer of capacity without over-provisioning, keeping inference under 100 ms.

Exam trap

The trap here is that candidates might confuse provisioned concurrency (a Lambda concept) with SageMaker's scaling options, or incorrectly assume Multi-Model endpoints are suitable for ensemble models, leading to choosing B or D without considering the real-time latency constraint.

How to eliminate wrong answers

Option B is wrong because Multi-Model endpoints are designed to host multiple independent models on a single instance, but here the ensemble is a single model composed of three sub-models that must be loaded together for each inference; using a Multi-Model endpoint would require loading each sub-model separately, increasing latency and complexity. Option C is wrong because SageMaker batch transform is an asynchronous, offline processing method that does not support real-time inference with sub-100 ms latency; it is designed for large-scale batch jobs, not low-latency endpoints. Option D is wrong because provisioned concurrency is a feature for AWS Lambda, not Amazon SageMaker endpoints; SageMaker uses Auto Scaling or manual instance scaling, and an ml.c5.xlarge instance would be over-provisioned for the baseline load, increasing cost unnecessarily.

Practice this question →

MCQhard

A company deploys a model using SageMaker real-time endpoint with auto scaling. They observe that during a traffic spike, the endpoint quickly scales up to 10 instances, but after the spike, it takes a long time to scale down, leading to high costs. The scaling policy is based on a simple average CPU utilization threshold. Which adjustment would optimize the scaling down behavior?

A.Increase the scale-in cooldown period to prevent premature scale-down.

B.Decrease the scale-in cooldown period to allow the endpoint to scale down faster when utilization drops.

C.Use a step scaling policy with a larger step adjustment for scale-in.

D.Change the scaling policy to use memory utilization instead of CPU.

AnswerB

Reducing cooldown enables the Auto Scaling group to remove instances sooner.

Why this answer

The correct answer is B because decreasing the scale-in cooldown period allows the endpoint to respond more quickly to sustained drops in CPU utilization. By default, SageMaker auto scaling uses cooldown periods to prevent rapid fluctuations; a long scale-in cooldown delays the termination of instances after utilization falls, keeping costs high. Reducing this cooldown lets the endpoint scale down faster when the spike subsides, directly addressing the problem.

Exam trap

The trap here is that candidates often confuse cooldown periods with step adjustments, thinking that larger scale-in steps will speed up the process, when in fact the cooldown period controls the timing of when scaling actions can occur.

How to eliminate wrong answers

Option A is wrong because increasing the scale-in cooldown period would make the problem worse, not better—it would cause the endpoint to wait even longer before scaling down, increasing costs. Option C is wrong because step scaling policies control the magnitude of scaling adjustments (e.g., adding or removing multiple instances at once), but they do not affect the timing or delay of scale-in actions; the cooldown period is the key parameter for timing. Option D is wrong because changing the metric to memory utilization does not address the core issue of slow scale-down timing; the problem is with the cooldown period, not the metric choice.

Practice this question →

MCQmedium

An MLOps engineer is setting up a SageMaker endpoint for a model that performs inference on large images. The model is containerized and expects input in a specific format. The team wants to preprocess the images (resize and normalize) before passing them to the model. What is the most efficient way to implement this?

A.Configure SageMaker to use a preprocessing container as the first step of an inference pipeline, followed by the model container.

B.Use Amazon API Gateway to perform request transformation before forwarding to the endpoint.

C.Package the preprocessing logic into the same Docker container as the model.

D.Use a Lambda function as a proxy to preprocess requests before calling the SageMaker endpoint.

AnswerA

Inference pipeline allows separation of concerns and efficient processing.

Why this answer

Option A is correct because SageMaker Inference Pipelines allow you to chain multiple containers in a serial fashion, where the output of one container becomes the input of the next. By placing a preprocessing container as the first step, you can resize and normalize large images before passing them to the model container, which keeps the model container focused on inference and avoids unnecessary data transfer or custom code. This is the most efficient and natively supported approach within SageMaker for multi-step inference workflows.

Exam trap

The trap here is that candidates often choose Option C (packaging everything into one container) because it seems simpler, but they overlook the fact that SageMaker Inference Pipelines are specifically designed for this exact use case and provide better modularity, maintainability, and efficiency.

How to eliminate wrong answers

Option B is wrong because Amazon API Gateway is designed for request routing and transformation at the HTTP level, not for heavy image preprocessing (e.g., resizing and normalization) — it lacks the computational capability and libraries needed for such tasks, and it would introduce latency without any benefit. Option C is wrong because packaging preprocessing logic into the same container as the model violates the separation of concerns principle and makes the container larger and harder to maintain; it also prevents independent scaling or updating of preprocessing steps. Option D is wrong because using a Lambda function as a proxy adds unnecessary cold-start latency and a 6 MB (or 10 MB via extension) payload limit, which is problematic for large images, and it does not integrate as seamlessly with SageMaker's built-in batching or inference pipeline features.

Practice this question →

MCQmedium

An ML team is using SageMaker Model Registry to manage model versions. After training a new model version, they register it with an 'Approved' status. The CI/CD pipeline automatically deploys the latest approved model to a staging endpoint. However, the pipeline fails with an error: 'Cannot deploy model because the model version is not approved.' The model version is clearly approved in the registry. What is the most likely cause?

A.The pipeline is using the model package ARN instead of the model version ARN.

B.The model version is approved but the pipeline uses a different version that is still pending.

C.The SageMaker endpoint configuration does not have the necessary IAM permissions to read the registry.

D.The approval status was set on the model package group, not on the specific model version.

AnswerD

Approval is per model version; if only the group is approved, individual versions may not inherit.

Why this answer

Option D is correct because in SageMaker Model Registry, approval is a property of a specific model version within a model package group, not of the model package group itself. The error indicates the pipeline is likely referencing the model package group ARN or a version that lacks explicit approval, even though the team believes the model is approved. The CI/CD pipeline must use the exact model version ARN that has the 'Approved' status to deploy successfully.

Exam trap

The trap here is that candidates confuse model package group approval with model version approval, assuming that approving the group automatically approves all versions, whereas AWS requires explicit approval on each version individually.

How to eliminate wrong answers

Option A is wrong because using the model package ARN (which refers to the group) would cause a different error, such as 'ModelPackageNotFound' or 'InvalidARN', not a specific 'not approved' error; the pipeline would still need to specify a version. Option B is wrong because the question states the model version is clearly approved in the registry, so the pipeline using a different pending version would imply a misconfiguration in the pipeline's version selection logic, but the error message directly contradicts the approval status of the intended version. Option C is wrong because IAM permissions for the endpoint configuration to read the registry would cause an 'AccessDenied' or authorization error, not a 'not approved' error; the error is about approval status, not permissions.

Practice this question →

MCQhard

A team is deploying a deep learning model on a SageMaker real-time endpoint. The model has high memory requirements, and the team wants to minimize instance cost while ensuring the endpoint can handle up to 10 concurrent requests. They plan to use a single ml.p3.2xlarge instance (8 vCPUs, 61 GB memory). Which SageMaker endpoint configuration will allow the endpoint to handle 10 concurrent requests without errors?

A.Disable ModelServerWorkers to reduce overhead.

B.Set the initial instance count to 1 and configure the container to use multiple ModelServerWorkers.

C.Set the initial variant weight to 10.

D.Set the initial instance count to 10 in the production variant.

AnswerB

Multiple workers allow the instance to handle multiple requests concurrently, up to the CPU/memory limit.

Why this answer

Option B is correct because SageMaker's ModelServerWorkers (MSWs) allow a single container to handle multiple inference requests concurrently by running multiple worker processes. With 8 vCPUs on ml.p3.2xlarge, configuring multiple MSWs (e.g., 8 workers) enables the endpoint to process up to 10 concurrent requests without errors, as each worker can handle one request at a time. This minimizes cost by using a single instance while meeting concurrency requirements.

Exam trap

The trap here is confusing concurrency mechanisms: candidates often think increasing instance count (Option D) is the only way to handle concurrent requests, but SageMaker's ModelServerWorkers allow a single instance to serve multiple requests in parallel, which is more cost-effective.

How to eliminate wrong answers

Option A is wrong because disabling ModelServerWorkers would force the container to use a single worker, limiting concurrency to 1 request at a time, which cannot handle 10 concurrent requests. Option C is wrong because initial variant weight controls traffic distribution across multiple variants, not concurrency or instance count; setting it to 10 does not increase the number of instances or workers. Option D is wrong because setting the initial instance count to 10 would deploy 10 instances, which is unnecessary and costly for handling 10 concurrent requests, and does not address the goal of minimizing cost.

Practice this question →

MCQeasy

A company deployed a machine learning model on an Amazon SageMaker real-time endpoint. Over several weeks, they notice that inference latency has been gradually increasing, especially during peak business hours. The model and instance type have remained unchanged. What is the most likely cause of the increased latency?

A.The inference script is not using batch processing.

B.The SageMaker endpoint auto scaling is not configured to scale out quickly enough under increasing traffic.

C.The model size is too large for the instance type.

D.The endpoint has data capture enabled, causing additional overhead.

AnswerB

If auto scaling policies are too conservative, the endpoint may not add instances fast enough during traffic spikes, leading to increased latency.

Why this answer

Option B is correct because the gradual increase in latency over time, especially during peak hours, suggests that the endpoint may not be scaling out adequately to handle increased traffic. Option A is incorrect because the model size has not changed. Option C is incorrect because data capture does not inherently cause latency increases over time.

Option D is incorrect because batch processing is not used for real-time endpoints.

Practice this question →

MCQeasy

A company is using SageMaker Pipelines to automate a multi-step ML workflow. The pipeline includes data preprocessing, training, and model evaluation. The team wants to ensure that if the evaluation step fails, the pipeline stops and sends an alert to the operations team. Which SageMaker Pipelines feature should they use?

A.Configure an Amazon CloudWatch Events rule to monitor the pipeline execution status and stop it if the evaluation step fails

B.Register the model in the Model Registry only if evaluation passes, and configure the pipeline to stop if registration fails

C.Add a Lambda step after the evaluation step that checks the evaluation metrics and sends an SNS notification if the metrics are below a threshold

D.Use a Condition step to check the evaluation result and route to a Fail step if the result indicates failure

AnswerD

Condition step allows branching; a Fail step terminates the pipeline and can trigger notifications via SNS.

Why this answer

Option D is correct because SageMaker Pipelines provides a built-in Condition step that evaluates a boolean expression (e.g., checking if evaluation metrics meet a threshold) and then routes execution to different steps. If the condition fails, you can direct the pipeline to a Fail step, which immediately stops the pipeline and marks it as failed. This is the native, event-driven way to halt a pipeline based on step output without relying on external services.

Exam trap

The trap here is that candidates often confuse external monitoring (CloudWatch) or post-step actions (Lambda) with native pipeline control flow, missing that SageMaker Pipelines has a dedicated Condition step for conditional branching and halting execution.

How to eliminate wrong answers

Option A is wrong because CloudWatch Events rules can monitor pipeline state changes but cannot stop a running pipeline; they can only trigger notifications or invoke other actions after the fact. Option B is wrong because registering a model in the Model Registry is an optional downstream step, not a mechanism to stop the pipeline; if registration fails, the pipeline would still continue to subsequent steps unless explicitly handled. Option C is wrong because a Lambda step can send SNS notifications but does not have the ability to halt the pipeline execution; it would only alert after the step completes, not prevent further steps from running.

Practice this question →

MCQeasy

A company wants to deploy a machine learning model that was trained on-premises using TensorFlow. The model is a TensorFlow SavedModel. The company uses AWS and wants to minimize operational overhead. Which deployment option meets these requirements?

A.Deploy the model on Amazon ECS using a custom Docker image.

B.Deploy the model as an AWS Lambda function with the TensorFlow runtime.

C.Deploy the model using Amazon SageMaker Studio.

D.Deploy the model using Amazon SageMaker with a TensorFlow inference container.

AnswerD

SageMaker provides pre-built TensorFlow containers and manages the endpoint, reducing operational overhead.

Why this answer

Amazon SageMaker provides a fully managed TensorFlow inference container that directly supports TensorFlow SavedModel format, enabling deployment without any custom infrastructure management. This minimizes operational overhead compared to self-managed options like ECS or Lambda, as SageMaker handles scaling, load balancing, and model updates automatically.

Exam trap

AWS often tests the distinction between SageMaker Studio (an IDE) and SageMaker hosting (deployment endpoints), leading candidates to mistakenly select Studio as a deployment option when it is only for development and experimentation.

How to eliminate wrong answers

Option A is wrong because deploying on Amazon ECS with a custom Docker image requires you to build, maintain, and scale the container infrastructure yourself, increasing operational overhead. Option B is wrong because AWS Lambda has a maximum deployment package size limit (250 MB unzipped) and a 15-minute timeout, making it unsuitable for large TensorFlow SavedModels or inference requests that require significant compute. Option C is wrong because Amazon SageMaker Studio is an integrated development environment (IDE) for building, training, and debugging models, not a deployment target; the actual deployment would still require creating an endpoint, which is covered by Option D.

Practice this question →

Multi-Selecthard

A company is deploying a machine learning model using Amazon SageMaker. The model is a large deep learning model that requires GPU for inference. The company expects unpredictable traffic patterns with occasional bursts. They want to minimize cost while ensuring low latency during bursts. Which TWO actions should they take? (Select TWO.)

Select 2 answers

A.Use a serverless endpoint configuration to automatically scale.

B.Use a multi-model endpoint with a mix of CPU and GPU instances to handle variable traffic.

C.Use Spot instances for the endpoint to reduce cost.

D.Provision multiple on-demand GPU instances behind a load balancer.

E.Use Amazon SageMaker Elastic Inference to attach GPU acceleration to a CPU instance.

AnswersB, E

Multi-model endpoints allow efficient resource utilization and cost savings.

Why this answer

Option B is correct because a multi-model endpoint with a mix of CPU and GPU instances allows the company to host multiple models on the same endpoint, reducing cost by sharing underlying instances. By including GPU instances, the endpoint can handle the GPU-intensive deep learning inference for the large model, while the CPU instances can serve lighter loads or fallback traffic, ensuring low latency during unpredictable bursts without over-provisioning.

Exam trap

The trap here is that candidates often confuse serverless endpoints with GPU support, not realizing that SageMaker serverless endpoints are CPU-only, and they may overlook that multi-model endpoints can mix instance types to balance cost and performance for bursty GPU workloads.

Practice this question →

MCQmedium

An e-commerce company uses Amazon SageMaker to deploy a real-time inference endpoint for product recommendations. The endpoint receives bursty traffic, with occasional spikes. The company wants to minimize cost while ensuring that latency remains under 100 ms. Which approach should the company take?

A.Use an elastic inference accelerator to reduce latency instead of scaling.

B.Use a scheduled scaling plan based on historical traffic patterns.

C.Deploy the model on one large instance to handle peak load.

D.Deploy the model on a multi-model endpoint with automatic scaling and configure a warm-up period for new instances.

AnswerD

Multi-model endpoint with scaling and warm-up can handle bursts cost-effectively.

Why this answer

Option D is correct because a multi-model endpoint with automatic scaling allows multiple models to share a single endpoint, reducing cost while handling bursty traffic. Configuring a warm-up period ensures new instances are fully initialized before receiving traffic, preventing cold-start latency spikes and keeping inference under 100 ms.

Exam trap

The trap here is that candidates confuse latency optimization techniques (like elastic inference) with scaling strategies, overlooking that bursty traffic requires dynamic scaling with warm-up to prevent cold-start latency spikes.

How to eliminate wrong answers

Option A is wrong because elastic inference accelerators reduce per-inference latency but do not address the need to scale out during traffic spikes; they add cost without solving the bursty traffic problem. Option B is wrong because scheduled scaling based on historical patterns cannot react to unpredictable spikes, leading to either over-provisioning or latency violations during unexpected bursts. Option C is wrong because deploying on one large instance creates a single point of failure and is cost-inefficient for bursty traffic; it either underutilizes resources during low traffic or fails to handle peak load without latency degradation.

Practice this question →

MCQmedium

A company has a SageMaker endpoint that was deployed successfully and is in service. However, when the team sends test inferences using the InvokeEndpoint API, they receive a 500 internal server error. The endpoint logs in CloudWatch show a stack trace indicating 'OutOfMemoryError: Java heap space'. The model is a large XGBoost model (2 GB) and the endpoint is using an ml.m5.large instance with 8 GB of memory. What is the MOST likely cause and solution?

A.The endpoint needs to have a smaller batch size configured in the real-time inference request.

B.The instance type has insufficient memory for the model size; use a larger instance type like ml.m5.xlarge (16 GB) or ml.m5.2xlarge.

C.The model is a Transformer model and requires a GPU instance; use ml.g4dn.xlarge instead.

D.The SageMaker container is not compatible with XGBoost; switch to a framework container.

AnswerB

A 2 GB model plus runtime overhead (e.g., Java heap for XGBoost) can exceed 8 GB. Increasing instance memory resolves the out-of-memory error.

Why this answer

The OutOfMemoryError in Java heap space indicates that the model (2 GB) plus the runtime overhead of the XGBoost container and Java-based inference code exceed the available memory on the ml.m5.large instance (8 GB total, but not all is available for the Java heap). The most direct fix is to use a larger instance type, such as ml.m5.xlarge (16 GB) or ml.m5.2xlarge, to provide sufficient heap space for the model and inference operations.

Exam trap

The trap here is that candidates may incorrectly attribute the OutOfMemoryError to batch size or container compatibility, rather than recognizing that the instance's memory is insufficient for the model size and Java heap overhead.

How to eliminate wrong answers

Option A is wrong because batch size configuration is not applicable to real-time InvokeEndpoint requests (which are single inference calls), and reducing batch size would not resolve a Java heap space error caused by model size and overhead. Option C is wrong because the model is explicitly stated as XGBoost, not a Transformer model, and XGBoost runs efficiently on CPU instances; GPU instances are not required. Option D is wrong because SageMaker provides native support for XGBoost via built-in containers, and the error is a memory issue, not a compatibility issue with the container.

Practice this question →

MCQhard

A data scientist is trying to create a SageMaker endpoint configuration with 6 instances of ml.c5.large for a production variant. The creation fails with the error shown in the exhibit. Which action should the data scientist take to resolve this issue?

A.Create two separate endpoint configurations, each with 3 instances, and distribute traffic between them.

B.Request a service quota increase for ml.c5.large for real-time endpoints from the AWS Service Quotas console.

C.Use a different instance type, such as ml.m5.large, which has a higher limit.

D.Delete unused endpoints to free up resources.

AnswerB

Increasing the quota allows provisioning the requested number of instances.

Why this answer

The error indicates that the requested number of instances exceeds the service quota for ml.c5.large for real-time endpoints. AWS enforces default limits on instance counts per instance type per region. Requesting a quota increase via the Service Quotas console is the correct action to raise the limit and allow the deployment of 6 instances.

Exam trap

The trap here is that candidates may confuse service quotas with resource availability, thinking that deleting unused endpoints or splitting configurations will free up capacity, when in fact the quota is a hard limit that must be explicitly increased.

How to eliminate wrong answers

Option A is wrong because creating two separate endpoint configurations does not bypass the service quota; the total instance count across all endpoints still counts against the same quota. Option C is wrong because using a different instance type like ml.m5.large does not inherently have a higher limit; each instance type has its own default quota, and the limit for ml.m5.large may also be insufficient or unknown without checking. Option D is wrong because deleting unused endpoints does not increase the quota for ml.c5.large; it only frees up currently used instances, but the quota itself remains unchanged.

Practice this question →

MCQmedium

A team is using SageMaker Pipelines to automate retraining and deployment. They want to trigger the pipeline automatically when new training data is available in an S3 bucket. Which approach should they use?

A.Create an Amazon EventBridge rule that triggers the pipeline execution on S3 PutObject events

B.Register the pipeline as a model package in SageMaker Model Registry

C.Configure a cron job to run the pipeline every hour

D.Use AWS Step Functions to poll the S3 bucket and start the pipeline when a new object appears

AnswerA

EventBridge can detect S3 events and start pipeline executions.

Why this answer

Option A is correct because Amazon EventBridge can directly capture S3 PutObject events and invoke a SageMaker Pipeline execution as a target. This provides a fully event-driven, serverless integration without polling or manual intervention, aligning with best practices for automating ML workflows when new data arrives.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing Step Functions (Option D) for orchestration, not realizing that EventBridge provides a simpler, event-driven trigger without the need for polling or additional state machines.

How to eliminate wrong answers

Option B is wrong because registering a pipeline as a model package in SageMaker Model Registry is for versioning and managing trained models, not for triggering pipeline executions based on S3 events. Option C is wrong because a cron job runs on a fixed schedule, which is inefficient and may miss data arrivals or run unnecessarily, whereas the requirement is to trigger only when new data appears. Option D is wrong because using AWS Step Functions to poll S3 introduces latency, cost, and complexity compared to the native event-driven approach with EventBridge, which reacts instantly to S3 events.

Practice this question →

MCQeasy

Refer to the exhibit. A user is unable to invoke a SageMaker endpoint. The IAM policy shown is attached to the user. Which permission is missing to allow invocation?

A.sagemaker:InvokeEndpoint

B.sagemaker:DescribeEndpoint

C.sagemaker:CreateEndpoint

D.sagemaker:ListEndpoints

AnswerA

InvokeEndpoint is required to send inference requests.

Why this answer

To invoke a SageMaker endpoint, the user needs the `sagemaker:InvokeEndpoint` permission. The IAM policy shown lacks this action, which is required for making real-time inference requests to the endpoint. Without it, any attempt to call the endpoint via the SDK or CLI will fail with an access denied error.

Exam trap

AWS often tests the distinction between read-only permissions (like `DescribeEndpoint` or `ListEndpoints`) and the specific action required to perform an operation, leading candidates to confuse metadata access with actual invocation capability.

How to eliminate wrong answers

Option B is wrong because `sagemaker:DescribeEndpoint` only allows retrieving metadata about an endpoint, not invoking it for inference. Option C is wrong because `sagemaker:CreateEndpoint` is for creating new endpoints, not for sending inference requests to an existing one. Option D is wrong because `sagemaker:ListEndpoints` only lists endpoints in the account, which does not grant the ability to invoke them.

Practice this question →

MCQeasy

An ML engineer runs the CLI command shown in the exhibit. However, the training job fails immediately with an error: 'Unable to assume role'. What is the most likely cause?

A.The IAM role 'SageMakerExecutionRole' does not have permission to create the training job.

B.The training image in ECR does not exist.

C.The S3 bucket 'my-bucket' does not exist.

D.The IAM role's trust policy does not grant SageMaker permission to assume the role.

AnswerD

Without proper trust policy, SageMaker cannot assume the role, causing immediate failure.

Why this answer

The 'Unable to assume role' error indicates that SageMaker cannot assume the IAM role specified in the CLI command. This is a trust policy issue: the role's trust policy must include SageMaker as a trusted service (i.e., `"Service": "sagemaker.amazonaws.com"`). Without this, SageMaker is not authorized to assume the role, regardless of the role's permissions.

Exam trap

AWS often tests the distinction between IAM role permissions (what the role can do) and trust policies (who can assume the role), leading candidates to mistakenly select a permission-related option when the error is about trust.

How to eliminate wrong answers

Option A is wrong because the error is about assuming the role, not about the role's permissions to create the training job; permission errors would appear as 'AccessDenied' or similar, not 'Unable to assume role'. Option B is wrong because a missing ECR image would cause an error like 'Image not found' or 'RepositoryNotFoundException', not an assume role error. Option C is wrong because a non-existent S3 bucket would result in an error like 'NoSuchBucket' or 'AccessDenied' when SageMaker tries to access it, not an assume role failure.

Practice this question →

Multi-Selecteasy

A company is using Amazon SageMaker to deploy a model for real-time inference. The model requires access to a private S3 bucket that contains reference data. The company wants to ensure that the endpoint can access the S3 bucket without using a public internet connection. Which TWO actions should they take? (Select TWO.)

Select 2 answers

A.Configure the endpoint's security group to allow outbound traffic to the S3 bucket's IP range.

B.Attach the endpoint to a VPC that has a VPC endpoint for S3.

C.Ensure the SageMaker execution role has an IAM policy that grants s3:GetObject access to the bucket.

D.Attach the endpoint to a VPC with an internet gateway and route the S3 traffic through the internet gateway.

E.Attach the endpoint to a VPC with a NAT gateway to route traffic to S3.

AnswersB, C

VPC endpoints allow private connectivity to S3 without internet.

Why this answer

Option B is correct because attaching the SageMaker endpoint to a VPC with a VPC endpoint for S3 (Gateway type) allows the endpoint to access the S3 bucket using AWS's private network, bypassing the public internet. This ensures traffic stays within the AWS backbone, meeting the requirement for no public internet connection. Option C is also correct because the SageMaker execution role must have an IAM policy with s3:GetObject permissions to authorize the read access to the private S3 bucket, which is a prerequisite for any S3 operation.

Exam trap

The trap here is that candidates often confuse VPC endpoints (which keep traffic private) with NAT gateways or internet gateways (which route traffic over the public internet), and they may overlook the mandatory IAM permissions required for S3 access even when using a VPC endpoint.

Practice this question →

Multi-Selectmedium

A data science team deploys a TensorFlow model for real-time inference using the Amazon SageMaker model configuration shown. They observe high latency during the first few requests after deployment. Which TWO actions would reduce cold start latency? (Choose two.)

Select 2 answers

A.Enable data capture on the endpoint

B.Set the SAGEMAKER_PROGRAM environment variable to a more optimized entry point

C.Add a secondary container for model ensemble

D.Configure a Production Variant with an initial instance count greater than zero

E.Use Amazon SageMaker Multi-Model Endpoints

AnswersD, E

Setting an initial instance count ensures that instances are always running, preventing cold start.

Why this answer

Options C and D are correct. Using Multi-Model Endpoints allows the endpoint to stay warm and reduces the time to load a model on demand. Setting an initial instance count greater than zero ensures that the endpoint always has at least one instance running, eliminating cold starts.

Option A (adding another container) increases cold start latency. Option B (changing environment variable) does not affect model loading time. Option E (data capture) adds overhead without reducing cold start latency.

Practice this question →

MCQhard

A machine learning engineer is deploying a PyTorch model for real-time inference on SageMaker. The model requires GPU for low-latency predictions. The deployment fails with the error: 'The primary container does not support the requested instance type.' The instance type is ml.p3.2xlarge. Which action should the engineer take to resolve the issue?

A.Use SageMaker Neo to compile the model for the target instance type

B.Request a service quota increase for the ml.p3.2xlarge instance type

C.Verify that the PyTorch framework version specified in the SageMaker estimator matches a version that supports GPU instances

D.Create a custom inference container and use it with the SageMaker model

AnswerC

Older PyTorch versions may not support GPU; using a supported version resolves the error.

Why this answer

Option C is correct because the error 'The primary container does not support the requested instance type' typically occurs when the specified PyTorch framework version in the SageMaker estimator does not include GPU support for the chosen instance type (ml.p3.2xlarge). SageMaker's prebuilt PyTorch containers are version-specific and only certain versions are compiled with CUDA and GPU libraries; using a version that lacks GPU support causes the container to reject GPU instance types. Verifying and selecting a PyTorch version that explicitly supports GPU instances resolves the mismatch.

Exam trap

The trap here is that candidates often assume the error is due to resource limits (quota) or hardware incompatibility (Neo), rather than recognizing it as a framework version and container image mismatch specific to GPU support.

How to eliminate wrong answers

Option A is wrong because SageMaker Neo compiles models for edge devices or optimized inference on specific hardware, but it does not fix a container-instance type compatibility error; the error occurs before model compilation. Option B is wrong because a service quota increase addresses insufficient capacity or account limits for the instance type, not a container-level compatibility error; the error indicates the container rejects the instance type, not that the instance is unavailable. Option D is wrong because creating a custom inference container is unnecessary when the issue is simply a version mismatch in the prebuilt container; the error can be resolved by selecting a supported PyTorch version without custom container overhead.

Practice this question →

MCQeasy

A company has deployed a SageMaker real-time endpoint for a model that predicts customer churn. The endpoint uses a single ml.m5.large instance. After deployment, the team notices that during peak hours, the endpoint returns 5xx errors for about 20% of requests. The endpoint has not been configured with any scaling policy. The team needs to resolve this issue with minimal cost increase. Which solution should the team implement?

A.Deploy the model to a multi-model endpoint to reduce resource utilization.

B.Enable Auto Scaling for the endpoint with a target tracking policy based on the average InvocationsPerInstance metric.

C.Increase the instance type to ml.m5.xlarge to handle more concurrent requests.

D.Use SageMaker batch transform instead of real-time inference to process peak traffic asynchronously.

AnswerB

Auto Scaling adds instances only when needed, minimizing cost while handling peak load.

Why this answer

Option B is correct because enabling Auto Scaling with a target tracking policy based on the average InvocationsPerInstance metric dynamically adjusts the number of instances in response to traffic spikes, preventing 5xx errors during peak hours without over-provisioning. This approach minimizes cost by scaling only when needed, unlike manual instance upgrades or batch transforms that either increase baseline cost or introduce latency.

Exam trap

The trap here is that candidates often confuse 'scaling up' (increasing instance size) with 'scaling out' (adding more instances), and overlook that Auto Scaling with a target tracking policy is the most cost-effective way to handle variable traffic, as it matches capacity to demand in real time.

How to eliminate wrong answers

Option A is wrong because deploying to a multi-model endpoint reduces resource utilization by sharing a single container across multiple models, but it does not address the root cause of insufficient capacity for a single model under peak load; it may even exacerbate contention. Option C is wrong because increasing the instance type to ml.m5.xlarge provides more compute per instance but incurs a fixed higher cost regardless of traffic, failing the 'minimal cost increase' requirement and not dynamically adapting to variable load. Option D is wrong because SageMaker batch transform is designed for asynchronous, offline inference on large datasets, not for real-time requests; it would introduce unacceptable latency and cannot serve interactive predictions, thus not resolving the immediate 5xx errors during peak hours.

Practice this question →

Multi-Selectmedium

An MLOps engineer is designing a CI/CD pipeline for deploying machine learning models to a production SageMaker endpoint. The pipeline should include automated testing, approval gates, and rollback capability. Which THREE components should be included in the pipeline? (Select THREE.)

Select 3 answers

A.A step to register the model in SageMaker Model Registry.

B.A CloudFormation template to deploy the endpoint infrastructure, enabling rollback via stack update.

C.A separate staging endpoint to validate the model before production deployment.

D.A manual approval step after staging testing.

E.A step to run SageMaker Debugger to monitor training.

AnswersB, C, D

Infrastructure as code allows precise rollback by redeploying a previous CloudFormation stack.

Why this answer

Option B is correct because using a CloudFormation template to deploy the SageMaker endpoint infrastructure enables rollback via stack update. If a deployment fails, CloudFormation can automatically roll back the stack to the previous known good state, ensuring infrastructure consistency and reducing downtime.

Exam trap

The trap here is that candidates confuse model registry steps (Option A) or training monitoring tools (Option E) with deployment pipeline components, but the question specifically asks for components that enable automated testing, approval gates, and rollback capability in the CI/CD pipeline for deploying to a production SageMaker endpoint.

Practice this question →

MCQmedium

An ML team at a financial services company has developed a fraud detection model using Amazon SageMaker. The model is currently deployed to a production endpoint with a single variant using the previous model version. The team wants to deploy a new model version with a canary deployment where 10% of traffic goes to the new version and 90% remains on the old version for 30 minutes before shifting all traffic to the new version if no issues are detected. Which step is essential to achieve this safe rollout?

A.Use the 'Deploy' method on the model object with the 'mode' parameter set to 'canary' within the built-in XGBoost algorithm container.

B.Update the endpoint with a new production variant for the new model version and set the 'InitialVariantWeight' to 10 for the new variant and 90 for the old variant, specifying a 'BlueGreenUpdatePolicy' with a 'TrafficRoutingConfiguration' for canary.

C.Ensure the endpoint is hosted on at least two instances to enable load balancing, then deploy the new model version as a separate variant and manually adjust the endpoint's DNS to split traffic.

D.Deploy the new model as a separate endpoint and use a SageMaker predictor to randomly route 10% of inference requests to the new endpoint.

AnswerB

This configuration uses SageMaker's blue/green deployment with canary traffic shifting, which is the correct approach.

Why this answer

Option C is correct because SageMaker canary deployments are configured by setting the 'BlueGreenUpdatePolicy' in an endpoint update. Option A is incorrect because SageMaker does not support A/B testing through the predictor directly. Option B is incorrect because SageMaker does not provide a built-in canary deployment mode via the built-in algorithms.

Option D is incorrect because while the endpoint must be hosted on multiple instances, that alone does not enable canary routing.

Practice this question →

MCQeasy

A data science team has trained a model using SageMaker and wants to deploy it to a production endpoint with automatic scaling based on request volume. Which SageMaker feature should they use to configure scaling?

A.SageMaker Endpoint Autoscaling

B.SageMaker Debugger

C.SageMaker Model Registry

D.SageMaker Pipelines

AnswerA

Endpoint Autoscaling automatically adjusts the number of instances based on demand.

Why this answer

SageMaker Endpoint Autoscaling is the correct feature because it automatically adjusts the number of instances behind a SageMaker hosted endpoint based on a target metric (e.g., requests per minute, CPU utilization) using Application Auto Scaling. This allows the endpoint to handle varying request volumes without manual intervention, ensuring cost efficiency and performance.

Exam trap

The trap here is that candidates may confuse SageMaker Debugger (a training debugger) or SageMaker Pipelines (a workflow tool) with scaling features, when only Endpoint Autoscaling directly manages production instance count based on request volume.

How to eliminate wrong answers

Option B (SageMaker Debugger) is wrong because it is a monitoring and debugging tool for training jobs, not for scaling production endpoints. Option C (SageMaker Model Registry) is wrong because it is a catalog for versioning and managing trained models, not a scaling mechanism. Option D (SageMaker Pipelines) is wrong because it is a workflow orchestration service for building and automating ML pipelines, not for configuring endpoint scaling.

Practice this question →

MCQhard

An ML team uses SageMaker Pipelines to automate retraining. After a pipeline failure, they need to reprocess only the failed step without rerunning the entire pipeline. What should they do?

A.Create a new pipeline version for each run.

B.Use SageMaker Model Monitor to detect drift and trigger retraining.

C.Use SageMaker Pipelines Cache with step-level caching.

D.Manually rerun the pipeline with updated parameters.

AnswerC

Caching enables the pipeline to skip completed steps and resume from the failed step.

Why this answer

SageMaker Pipelines Cache with step-level caching allows you to reuse outputs from previous successful runs of unchanged steps. When a pipeline fails, only the failed step and any downstream steps that depend on it need to be re-executed, because cached results from prior successful steps are automatically retrieved. This avoids rerunning the entire pipeline, saving time and compute resources.

Exam trap

The trap here is that candidates confuse SageMaker Pipelines Cache with Model Monitor's drift detection, assuming that monitoring automatically handles retraining failures, when in fact caching is the correct mechanism for step-level reuse.

How to eliminate wrong answers

Option A is wrong because creating a new pipeline version for each run does not address step-level reuse; it creates an entirely new pipeline execution history, forcing a full rerun. Option B is wrong because SageMaker Model Monitor is designed for detecting data drift and model quality degradation, not for caching or resuming failed pipeline steps. Option D is wrong because manually rerunning the pipeline with updated parameters still executes all steps from scratch, ignoring any previously successful step outputs.

Practice this question →

MCQeasy

A startup is building a serverless inference API using AWS Lambda. They have a TensorFlow model that is 400 MB in size. They packaged the model and inference code into a Lambda function using a container image. When they test the function with a small input, it consistently times out after 3 seconds. The Lambda function has 512 MB of memory and a timeout of 30 seconds. The business requirement is that inference must complete in less than 5 seconds under normal conditions. What is the most likely cause of the slow performance, and which change should they make?

A.The function timeout is too low; increase the timeout to 60 seconds.

B.The function is experiencing a cold start; use provisioned concurrency to keep the container warm.

C.The Lambda function memory is insufficient for the model size; increase memory to 1024 MB or higher.

D.Use a Lambda function with a GPU container to accelerate inference.

AnswerC

Lambda allocates CPU proportionally to memory. More memory speeds up computation and reduces swapping.

Why this answer

The most likely cause is that the Lambda function's memory (512 MB) is insufficient to load the 400 MB TensorFlow model into memory, causing excessive swapping or out-of-memory errors that drastically slow inference. Increasing memory to 1024 MB or higher provides more CPU and memory resources, allowing the model to fit and inference to complete within the required 5 seconds.

Exam trap

The trap here is that candidates confuse cold start latency with runtime performance issues, assuming provisioned concurrency (Option B) fixes all slow Lambda functions, when in fact memory/CPU insufficiency is the root cause for large model inference.

How to eliminate wrong answers

Option A is wrong because the function already has a 30-second timeout, and the issue is not timeout-related—the function consistently times out after 3 seconds due to resource constraints, not because the timeout is too low. Option B is wrong because provisioned concurrency addresses cold starts (initialization latency), but the problem here is runtime performance after the function is already warm; the 3-second timeout occurs consistently, not just on first invocation. Option D is wrong because Lambda does not support GPU containers; GPU acceleration is not available in AWS Lambda, and the inference time is dominated by memory/CPU bottlenecks, not lack of GPU.

Practice this question →

100

MCQhard

Refer to the exhibit. A company configures a SageMaker Model Monitor Data Quality monitoring schedule as shown. The schedule runs every hour. However, the team notices that the monitoring job fails intermittently with an AccessDenied error when accessing the S3 bucket for output. The IAM role SageMakerMonitorRole has permissions to write to s3://my-bucket/monitor-output. What is the MOST likely cause of the failure?

A.The S3UploadMode is set to Continuous, which is only supported for batch transform jobs.

B.The monitoring job runs in a VPC that does not have an S3 VPC endpoint, and the bucket policy denies requests from outside the VPC.

C.The cron expression is invalid; it should use rate(1 hour) instead.

D.The baseline constraints and statistics files are missing from the S3 bucket.

AnswerB

VPC restrictions can cause AccessDenied even if the IAM role allows.

Why this answer

The intermittent AccessDenied error when SageMaker Model Monitor attempts to write to the S3 output bucket strongly indicates a network or policy restriction. If the monitoring job is configured to run inside a VPC (common for security compliance) and that VPC lacks an S3 VPC endpoint, traffic to S3 traverses the public internet. If the S3 bucket policy explicitly denies requests from outside the VPC (using a condition like `aws:SourceVpce` or `aws:SourceVpc`), then jobs running inside the VPC without an endpoint will be denied access intermittently, especially if the job's execution role is assumed from within the VPC.

Exam trap

AWS often tests the interaction between VPC networking and S3 bucket policies, where candidates overlook that a VPC without an S3 endpoint will cause AccessDenied errors even if the IAM role has full S3 permissions, because the bucket policy itself blocks non-VPC-endpoint traffic.

How to eliminate wrong answers

Option A is wrong because S3UploadMode is not a valid parameter for SageMaker Model Monitor; it is a concept for batch transform jobs, and the error is about access permissions, not upload mode. Option C is wrong because the cron expression `cron(0 * * * ? *)` is valid for hourly execution and is the correct format for SageMaker schedules; `rate(1 hour)` is used for EventBridge rules, not for SageMaker monitoring schedule expressions. Option D is wrong because missing baseline files would cause a different error (e.g., `NoSuchKey` or validation failure), not an intermittent AccessDenied error, and the error specifically points to S3 write access.

Practice this question →

101

MCQhard

A team uses SageMaker Neo to compile a model for deployment on a target device. After compilation, they deploy the compiled model to a SageMaker endpoint using the Neo-optimized container. The endpoint fails to start with error "RuntimeError: Unable to load model". What could be the issue?

A.The compiled model was not uploaded to the correct S3 path.

B.The Neo compilation job failed silently.

C.The endpoint instance type does not support Neo.

D.The target device architecture during compilation does not match the endpoint instance architecture.

AnswerD

Neo models are compiled for specific architectures; mismatch causes load failure.

Why this answer

Option D is correct because SageMaker Neo compiles a model for a specific target architecture (e.g., ARM, x86, GPU). When deploying the compiled model to a SageMaker endpoint, the endpoint instance type must have a CPU or accelerator architecture that matches the target device specified during compilation. If they do not match, the Neo-optimized runtime cannot load the compiled binary, resulting in a 'RuntimeError: Unable to load model'.

Exam trap

AWS often tests the misconception that Neo compilation is a generic optimization that works on any endpoint instance, when in fact the target architecture must exactly match the deployment instance's hardware.

How to eliminate wrong answers

Option A is wrong because if the compiled model were not uploaded to the correct S3 path, the endpoint would fail with an 'Unable to find model artifact' or S3 access error, not a runtime model loading error. Option B is wrong because if the Neo compilation job failed silently, no compiled model artifact would be produced, and the deployment would fail earlier with a missing artifact error, not a runtime load error. Option C is wrong because all SageMaker endpoint instance types support Neo-optimized containers; Neo does not restrict which instance types can host compiled models—the restriction is on the architecture match between the compilation target and the endpoint instance.

Practice this question →

102

MCQmedium

An ML team is deploying a model using SageMaker. The model requires GPU inference and must be available in multiple AWS regions for low latency. The team has created a multi-model endpoint with GPU instances. After deployment, they notice high latency spikes when a new model is loaded. What is the most likely cause?

A.The team is using a multi-model endpoint, which loads models on demand; loading a model into GPU memory causes latency spikes.

B.The endpoint is configured with a single production variant, causing all traffic to overload one instance.

C.The endpoint is using the wrong instance type that lacks sufficient GPU memory.

D.The model is too large for the specified container memory, causing swap to disk.

AnswerA

Multi-model endpoints load and unload models from memory, causing latency spikes when a new model is accessed.

Why this answer

A multi-model endpoint (MME) loads models on demand from Amazon S3 into the instance's memory. When a new model is requested and not already cached, SageMaker must download the model artifacts and load them into GPU memory, which is a time-consuming operation that causes a latency spike for the first inference request. This cold-start behavior is inherent to MMEs and explains the observed spikes.

Exam trap

The trap here is that candidates may confuse multi-model endpoint cold-start latency with general endpoint misconfiguration (like instance type or variant count), but the key clue is the timing of the spikes—only when a new model is loaded—which directly points to the on-demand loading behavior of MMEs.

How to eliminate wrong answers

Option B is wrong because a single production variant does not inherently cause latency spikes when loading new models; it would cause consistent high latency under load, not spikes tied to model loading. Option C is wrong because the question states the team is using GPU instances, and insufficient GPU memory would cause out-of-memory errors or failures, not latency spikes. Option D is wrong because swap to disk would cause severe performance degradation for all inferences, not just when a new model is loaded, and SageMaker containers typically do not use swap for GPU memory.

Practice this question →

103

MCQeasy

A team wants to automate the retraining and deployment of an ML model whenever new labeled data arrives in S3. The workflow includes data preprocessing, training, evaluation, and conditional deployment. Which AWS service is best suited for orchestrating this end-to-end pipeline?

A.AWS Step Functions with Lambda functions for each step.

B.AWS Glue workflows with triggers based on S3 events.

C.AWS CodePipeline with source from S3 and build from CodeBuild.

D.Amazon SageMaker Pipelines triggered by S3 events via EventBridge.

AnswerD

SageMaker Pipelines is designed for ML workflows and supports S3 event triggers.

Why this answer

Amazon SageMaker Pipelines is purpose-built for ML workflows, offering native integration with SageMaker for training, evaluation, and conditional deployment steps. Triggered by S3 events via Amazon EventBridge, it automates the end-to-end pipeline from data preprocessing to conditional model deployment without requiring custom orchestration code.

Exam trap

The trap here is that candidates often choose AWS Step Functions (Option A) because it is a general-purpose orchestrator, but they overlook that SageMaker Pipelines provides tighter integration with ML-specific steps and reduces custom code overhead.

How to eliminate wrong answers

Option A is wrong because AWS Step Functions with Lambda functions would require you to manually implement each ML step (e.g., training, evaluation) and manage SageMaker API calls, lacking native ML-specific features like built-in model evaluation and conditional deployment logic. Option B is wrong because AWS Glue workflows are designed for ETL and data preparation, not for orchestrating ML training, evaluation, and deployment steps; they lack native support for SageMaker training jobs or model endpoints. Option C is wrong because AWS CodePipeline is a CI/CD service for application code, not optimized for ML workflows; it does not natively handle model evaluation, conditional deployment, or SageMaker-specific resources like training jobs and endpoints.

Practice this question →

104

MCQhard

A large enterprise has multiple SageMaker endpoints serving models for different business units. Each endpoint uses a separate instance type and scaling policy. The enterprise wants to implement a unified monitoring and logging solution to track endpoint health, latency, and errors across all endpoints. They also want to set up alerts when the error rate exceeds 5% over a 5-minute period. The solution must be centralized and use AWS-native services. Which solution should the team implement?

A.Enable SageMaker Model Monitor data capture on each endpoint and stream captured data to Amazon Kinesis for analysis.

B.Use AWS CloudTrail to audit all API calls to SageMaker and set up alarms on error responses.

C.Use Amazon CloudWatch Logs to collect logs from each endpoint, and use a Lambda function to parse logs and calculate error rates, then publish custom metrics.

D.Use Amazon CloudWatch dashboards to aggregate metrics from all endpoints, and create a composite alarm based on the Sum of 5xx error counts across endpoints.

AnswerD

CloudWatch natively aggregates metrics and composite alarms can alert on the combined error rate.

Why this answer

Option D is correct because Amazon CloudWatch can natively ingest SageMaker endpoint metrics (e.g., 5xx error counts, latency, invocation counts) without additional configuration. By creating a CloudWatch dashboard, you aggregate metrics from all endpoints into a single view, and a composite alarm using the Sum statistic across endpoints over a 5-minute period directly triggers when the error rate exceeds 5%. This approach is fully centralized, uses only AWS-native services, and requires no custom code or data streaming.

Exam trap

The trap here is that candidates confuse SageMaker Model Monitor (data quality) with endpoint monitoring (operational health), or assume CloudWatch Logs are required when SageMaker endpoints already emit rich metrics directly to CloudWatch.

How to eliminate wrong answers

Option A is wrong because SageMaker Model Monitor is designed for detecting data drift and quality issues in the input data, not for tracking endpoint health, latency, or error rates; it captures inference data to Amazon S3, not to Kinesis, and does not provide real-time error rate alerts. Option B is wrong because AWS CloudTrail records API calls (e.g., CreateEndpoint, InvokeEndpoint) but does not capture the actual inference request/response payloads or error rates; it cannot measure latency or 5xx errors per invocation. Option C is wrong because SageMaker endpoints do not natively emit logs to CloudWatch Logs for inference requests; they emit metrics directly to CloudWatch, so a Lambda function parsing logs would be unnecessary and would require custom instrumentation to generate logs, violating the 'AWS-native' requirement.

Practice this question →

105

MCQmedium

A data scientist is using Amazon SageMaker Studio to develop a model. The training job is taking longer than expected. The data scientist suspects that the data is being downloaded from Amazon S3 each time the training starts. What is the BEST way to reduce data loading time?

A.Use SageMaker Pipe Input mode to stream data directly from S3.

B.Enable S3 transfer acceleration and cache the data in S3.

C.Use a larger instance type with more network bandwidth.

D.Use Amazon FSx for Lustre to mount a high-performance file system.

AnswerA

Pipe mode streams data without downloading, reducing start time.

Why this answer

SageMaker Pipe input mode streams data directly from S3 into the training algorithm without first downloading it to the training instance's local storage. This eliminates the bottleneck of copying entire datasets, reducing startup time and disk usage. It is the most direct and efficient way to address the issue of repeated downloads from S3.

Exam trap

The trap here is that candidates often choose a 'bigger instance' (Option C) as a brute-force fix, overlooking that Pipe mode fundamentally changes the data access pattern to eliminate the download bottleneck entirely.

How to eliminate wrong answers

Option B is wrong because S3 Transfer Acceleration speeds up uploads to S3 over long distances, not downloads during training, and caching in S3 does not change the fact that data must still be transferred to the instance. Option C is wrong because while a larger instance with more network bandwidth may reduce transfer time, it does not eliminate the fundamental overhead of downloading the entire dataset to local storage before training begins. Option D is wrong because Amazon FSx for Lustre provides a high-performance file system that can be mounted to SageMaker, but it still requires data to be loaded from S3 into the file system (e.g., via `lustre` import), adding complexity and not directly solving the repeated download issue as efficiently as Pipe mode.

Practice this question →

106

MCQeasy

A machine learning engineer needs to deploy a TensorFlow model to Amazon SageMaker and wants to use the built-in TensorFlow Serving container. What should the engineer provide in the model archive?

A.A frozen graph of the TensorFlow model.

B.A tar.gz file containing the TensorFlow SavedModel.

C.Model artifacts and a Python inference script.

D.A Dockerfile and model artifacts.

AnswerB

SageMaker's TensorFlow serving container expects a SavedModel packaged as tar.gz.

Why this answer

The built-in TensorFlow Serving container in Amazon SageMaker expects a TensorFlow SavedModel packaged in a tar.gz archive. This is because TensorFlow Serving natively loads models from the SavedModel format, which includes the model's computational graph, weights, and assets in a standardized directory structure. Providing a tar.gz of the SavedModel ensures compatibility with the container's default serving stack without requiring custom inference code.

Exam trap

AWS often tests the misconception that a frozen graph (Option A) is sufficient for TensorFlow Serving, but the exam expects candidates to know that TensorFlow Serving specifically requires the SavedModel format with its directory structure, not just a single protobuf file.

How to eliminate wrong answers

Option A is wrong because a frozen graph (typically a .pb file) is a legacy TensorFlow format that lacks the full SavedModel structure (e.g., variables and assets), and TensorFlow Serving does not natively support frozen graphs as a direct input; it requires the SavedModel format. Option C is wrong because a Python inference script is unnecessary when using the built-in TensorFlow Serving container, which handles inference automatically via the SavedModel; custom scripts are only needed for bring-your-own-container scenarios. Option D is wrong because a Dockerfile is not part of the model archive; SageMaker's built-in containers are pre-built, and providing a Dockerfile would indicate a custom container approach, which contradicts the requirement to use the built-in TensorFlow Serving container.

Practice this question →

107

Multi-Selecteasy

A company wants to deploy its trained model to edge devices such as cameras and IoT devices. The model must run efficiently with low latency and minimal memory footprint. Which THREE actions should the company take to prepare the model for edge deployment? (Choose THREE.)

Select 3 answers

A.Use SageMaker Edge Manager to package and manage the model on devices.

B.Quantize the model to reduce precision and memory footprint.

C.Increase the model's complexity to improve accuracy on edge devices.

D.Use SageMaker Neo to compile the model for the target edge hardware.

E.Deploy the model directly as a SageMaker endpoint and have the edge devices call it over the internet.

AnswersA, B, D

Edge Manager provides tools for model packaging, deployment, and monitoring on edge.

Why this answer

SageMaker Edge Manager is purpose-built to package, optimize, and manage machine learning models on edge devices. It provides model packaging, runtime monitoring, and over-the-air updates, ensuring the model runs efficiently with low latency and minimal memory footprint on resource-constrained hardware like cameras and IoT devices.

Exam trap

AWS often tests the misconception that edge deployment can rely on cloud endpoints for inference, but the correct approach is to optimize and run the model locally on the device to achieve low latency and offline operation.

Practice this question →

108

MCQhard

During deployment of a Hugging Face model, the endpoint logs show this error. Which step was likely missed?

A.The inference container does not include the transformers library; the team should use a pre-built Hugging Face container.

B.The IAM role does not have permissions to download additional libraries.

C.The model artifact was not packaged correctly; the inference script is missing.

D.The endpoint configuration specifies the wrong instance type.

AnswerA

Hugging Face containers are pre-built with transformers and other dependencies.

Why this answer

The error indicates that the inference container cannot find the `transformers` library, which is required to load and run the Hugging Face model. By using a pre-built Hugging Face container from AWS, the team ensures that all necessary dependencies (like `transformers`, `tokenizers`, and `torch`) are pre-installed and compatible with the SageMaker inference environment. Option A is correct because the most likely missed step was selecting a generic container instead of the purpose-built Hugging Face container.

Exam trap

The trap here is that candidates confuse runtime dependency issues (missing Python libraries) with infrastructure or configuration problems (IAM permissions, instance types, or packaging), leading them to select a plausible-sounding but incorrect option like B or C.

How to eliminate wrong answers

Option B is wrong because IAM role permissions control access to AWS services (e.g., S3, ECR) and cannot prevent the container from downloading Python libraries at runtime; missing libraries are a container image issue, not an IAM issue. Option C is wrong because the error message specifically mentions a missing Python module (`transformers`), not a missing inference script or packaging error; if the inference script were missing, the error would be about a missing entry point or handler function. Option D is wrong because the instance type affects compute capacity and pricing, not the availability of Python libraries inside the container; an incorrect instance type would cause resource errors (e.g., memory or GPU), not an `ImportError`.

Practice this question →

109

MCQmedium

A team notices that inference requests to their SageMaker endpoint are failing with '504 Gateway Timeout' for large payloads. What change should be made?

A.Enable data capture on the endpoint

B.Increase the endpoint's invocation timeout

C.Deploy a shadow endpoint for testing

D.Switch to a multi-model endpoint

AnswerB

Increasing the invocation timeout allows more time for large payloads to be processed.

Why this answer

A 504 Gateway Timeout indicates that the SageMaker endpoint's invocation timeout (default 60 seconds) was exceeded while processing a large payload. Increasing the invocation timeout allows the endpoint more time to complete inference for large payloads, resolving the timeout error.

Exam trap

The trap here is that candidates confuse a 504 timeout with a 413 payload too large error, leading them to incorrectly consider multi-model endpoints or data capture instead of adjusting the invocation timeout.

How to eliminate wrong answers

Option A is wrong because enabling data capture logs inference requests and responses but does not affect the endpoint's timeout behavior or ability to handle large payloads. Option C is wrong because deploying a shadow endpoint is used for A/B testing or canary deployments, not for resolving timeout issues on the existing endpoint. Option D is wrong because switching to a multi-model endpoint improves resource utilization for multiple models but does not change the per-invocation timeout limit.

Practice this question →

110

MCQmedium

A company deploys a model on Amazon SageMaker for real-time inference. The inference latency is too high. The model is a large deep learning model. The company wants to reduce latency without significantly impacting accuracy. Which approach should the company consider?

A.Increase the batch size for inference.

B.Use a smaller instance type to reduce inference time.

C.Use SageMaker Inference Recommender to test different instance types and optimizations.

D.Enable SageMaker Model Monitor to detect performance issues.

AnswerC

Inference Recommender helps find the optimal configuration for low latency.

Why this answer

SageMaker Inference Recommender is designed specifically to automate load testing and benchmarking across various instance types and model optimizations (e.g., Elastic Inference, GPU acceleration, serialization formats). It provides latency and throughput metrics to identify the optimal configuration for reducing inference latency while maintaining accuracy, making it the correct choice for a large deep learning model with high latency.

Exam trap

AWS often tests the misconception that reducing instance size or increasing batch size directly reduces latency, when in fact these actions typically increase latency or degrade throughput for real-time inference.

How to eliminate wrong answers

Option A is wrong because increasing batch size typically increases throughput but also increases per-request latency, as the model must process more data before returning results, which is counterproductive for real-time inference. Option B is wrong because using a smaller instance type generally reduces computational capacity, leading to longer inference times and higher latency, not lower. Option D is wrong because SageMaker Model Monitor is for detecting data drift, model quality degradation, and bias over time, not for optimizing inference performance or reducing latency.

Practice this question →

111

MCQhard

Refer to the exhibit. An AWS IAM policy is attached to a role used by a CI/CD pipeline to deploy SageMaker endpoints. The pipeline attempts to create an endpoint configuration with a VPC subnet that is not subnet-0123456789abcdef0. What will happen when the pipeline tries to create the endpoint configuration?

A.The action will be denied because the Deny statement explicitly blocks CreateEndpointConfig when the subnet does not match.

B.The action will be allowed because the CreateEndpoint statement allows all endpoints.

C.The action will be allowed only if the endpoint configuration uses a VPC with multiple subnets.

D.The action will be allowed because the policy lacks a Deny on the subnet condition for the endpoint resource.

AnswerA

An explicit Deny overrides any Allow, and the condition is not met.

Why this answer

Option A is correct because the IAM policy includes a Deny statement with a condition that explicitly blocks the `CreateEndpointConfig` action when the subnet specified in the request does not match `subnet-0123456789abcdef0`. Since the pipeline is attempting to create an endpoint configuration with a different subnet, the Deny statement overrides any Allow statements, resulting in the action being denied.

Exam trap

The trap here is that candidates may assume an Allow statement for `CreateEndpoint` would permit the action, but they overlook that an explicit Deny on the `CreateEndpointConfig` action with a subnet condition takes precedence, causing the request to fail.

How to eliminate wrong answers

Option B is wrong because the policy contains a Deny statement that explicitly restricts the subnet condition for `CreateEndpointConfig`, so the Allow on `CreateEndpoint` does not override the Deny; IAM Deny statements always take precedence. Option C is wrong because the policy does not grant any special permission for multiple subnets; the Deny condition applies regardless of the number of subnets used. Option D is wrong because the policy does include a Deny on the subnet condition for the `sagemaker:CreateEndpointConfig` action, not for the endpoint resource, so the action is blocked.

Practice this question →

112

Multi-Selectmedium

A company deploys a model on SageMaker that serves predictions to a web application. The model's performance degrades over time due to data drift. The company wants to set up continuous monitoring. Which TWO actions should the company take to monitor and retrain the model effectively? (Choose TWO.)

Select 2 answers

A.Manually review model performance monthly and retrain if necessary.

B.Configure an Amazon EventBridge rule to start a retraining pipeline when the Model Monitor detects violations.

C.Enable SageMaker Model Monitor to capture inference data and run monitoring schedules.

D.Use Amazon CloudWatch Logs Insights to query inference logs for anomalies.

E.Deploy the model on multiple endpoints with A/B testing to compare performance.

AnswersB, C

EventBridge can react to Model Monitor violation events to trigger automatic retraining.

Why this answer

Option B is correct because Amazon EventBridge can be configured to trigger a retraining pipeline automatically when SageMaker Model Monitor detects data drift or other violations, enabling a closed-loop monitoring and retraining system. Option C is correct because SageMaker Model Monitor must first be enabled to capture inference data and run monitoring schedules, which is the prerequisite for detecting drift and triggering automated actions.

Exam trap

The trap here is that candidates may confuse general monitoring tools like CloudWatch Logs Insights with the specialized, model-aware monitoring capabilities of SageMaker Model Monitor, or they may overlook that EventBridge automation requires Model Monitor to be enabled first.

Practice this question →

113

MCQmedium

Your company uses SageMaker batch transform to process a large dataset (5 TB) of customer transactions every night. The batch transform job uses a single ml.c5.4xlarge instance and takes about 6 hours to complete. However, the job recently started failing with an error message: 'Timed out waiting for transformation to complete. The maximum job duration is 3600 seconds.' You check the input data and notice that one of the input files is a single large JSON file of 50 GB, while the rest are smaller files. The job is configured with a batch strategy of 'MultiRecord' and a maximum payload size of 6 MB. What is the most likely cause of the timeout and which fix should you apply?

A.Set the batch strategy to 'SingleRecord' so that each record is processed individually.

B.Split the large JSON file into smaller files (e.g., 100 MB each) before feeding to the batch transform job.

C.Increase the job timeout to 7200 seconds.

D.Increase the number of instances to 5 in the batch transform job.

AnswerB

SageMaker batch transform splits input on file boundaries; small files allow parallel processing and stay within time limits.

Why this answer

The batch transform job is timing out because the single 50 GB JSON file cannot be processed within the default 3600-second (1-hour) timeout. With a 'MultiRecord' batch strategy and a 6 MB maximum payload size, SageMaker must split the large file into many small batches, but the job still tries to read the entire file sequentially, causing excessive processing time. Splitting the large file into smaller files (e.g., 100 MB each) allows SageMaker to parallelize and complete the transform within the timeout.

Exam trap

AWS often tests the misconception that increasing instances or timeout alone can solve performance bottlenecks caused by a single large input file, when in fact SageMaker batch transform processes each file on a single instance and requires file-level splitting for parallelism.

How to eliminate wrong answers

Option A is wrong because setting the batch strategy to 'SingleRecord' would process each record individually, which would increase the number of API calls and likely worsen the timeout issue, not resolve it. Option C is wrong because increasing the job timeout to 7200 seconds only masks the underlying problem of the oversized file; the job may still fail due to resource constraints or eventually hit other limits. Option D is wrong because increasing the number of instances does not help when a single massive file cannot be split across instances—SageMaker batch transform assigns each file to a single instance, so the 50 GB file would still be processed by one instance, causing the same timeout.

Practice this question →

114

MCQmedium

A company is deploying a large number of small models (each < 100 MB) for different customers. They want to minimize costs and management overhead while serving traffic that varies significantly. Which SageMaker endpoint type should they choose?

A.A batch transform job

B.A multi-model endpoint on a GPU instance

C.A multi-variant endpoint to route traffic to different model versions

D.A serverless endpoint

AnswerB

MME allows hosting many models on one instance, reducing costs.

Why this answer

A multi-model endpoint (MME) on a GPU instance is the best choice because it allows you to host multiple small models (< 100 MB each) on a single endpoint, sharing the underlying GPU instance to reduce costs. SageMaker MME dynamically loads and unloads models based on traffic, which minimizes management overhead and handles variable traffic patterns efficiently without provisioning separate endpoints per model.

Exam trap

The trap here is that candidates confuse 'multi-model endpoint' (hosting many models on one endpoint) with 'multi-variant endpoint' (routing traffic to different versions of the same model), leading them to select option C incorrectly.

How to eliminate wrong answers

Option A is wrong because batch transform jobs are designed for offline, asynchronous inference on large datasets, not for serving real-time traffic that varies significantly. Option C is wrong because a multi-variant endpoint is used to route traffic between different versions (variants) of the same model for A/B testing or gradual rollouts, not to host multiple distinct models per customer. Option D is wrong because serverless endpoints automatically scale to zero but have a maximum payload size of 6 MB and a maximum invocation duration of 60 seconds, making them unsuitable for GPU-accelerated inference or models that require GPU instances.

Practice this question →

115

MCQmedium

A data science team has trained a PyTorch model using Amazon SageMaker and wants to deploy it with a custom inference container that includes a pre-processing step. The team needs to minimize latency and ensure the pre-processing runs only once per request. Which SageMaker real-time inference option should they use?

A.Deploy the model on a multi-model endpoint and include pre-processing in the model code.

B.Use a batch transform job with a pre-processing script.

C.Package pre-processing and inference in a single container with a custom entry point.

D.Create a SageMaker inference pipeline with two containers: one for pre-processing and one for inference.

AnswerD

An inference pipeline chains containers sequentially, allowing pre-processing to run once per request with low latency.

Why this answer

Option D is correct because a SageMaker inference pipeline allows you to chain two containers in a single endpoint, where the first container handles pre-processing and the second runs inference. This ensures that pre-processing runs exactly once per request, minimizing latency by avoiding redundant processing and keeping the request within the same HTTP connection.

Exam trap

AWS often tests the distinction between a single-container approach (Option C) and a multi-container pipeline (Option D), where candidates mistakenly think a single custom container is simpler and sufficient, but the pipeline is required to guarantee that pre-processing runs exactly once per request and to allow independent scaling or updates of the pre-processing logic.

How to eliminate wrong answers

Option A is wrong because a multi-model endpoint hosts multiple models on the same container, but it does not support a separate pre-processing step; any pre-processing would be embedded in the model code and run per model load, not once per request, and it cannot guarantee a separate container for pre-processing. Option B is wrong because a batch transform job is designed for asynchronous, offline processing of large datasets, not for real-time inference with low latency requirements. Option C is wrong because packaging pre-processing and inference in a single container with a custom entry point runs both steps sequentially per request, but it does not leverage SageMaker's built-in pipeline orchestration, and if the pre-processing logic changes, the entire container must be rebuilt, whereas a pipeline allows independent updates.

Practice this question →

116

MCQhard

A financial services company has a SageMaker pipeline that trains a fraud detection model daily. The pipeline consists of three steps: preprocessing (using a Spark script), training (XGBoost), and evaluation. The evaluation step calculates the F1 score and compares it to a threshold of 0.95. If the F1 score is below 0.95, the pipeline should fail and notify the team via email. The team implemented this using a Condition step that checks if the F1 score is greater than or equal to 0.95. If true, the pipeline proceeds to register the model; if false, the pipeline fails. However, the team notices that even when the F1 score is 0.94, the pipeline continues to the registration step. The evaluation script outputs the F1 score as a float with two decimal places in a JSON file. The Condition step uses the expression: $.evaluation.metrics.f1_score >= 0.95. What is the most likely cause of the issue?

A.The evaluation step must be split into two steps: one for evaluation and one for condition check

B.The evaluation script outputs the F1 score as a string, and string comparison '0.94' >= '0.95' evaluates to true because it is lexicographically compared

C.The Condition step cannot be used to check metric values; it can only check step status

D.The threshold should be set to 0.95 but the Condition step uses a less than or equal operator

AnswerB

If the F1 score is a string, the comparison may be lexicographic; '0.94' is not >= '0.95' lexicographically, but the actual cause could be that the script outputs the score as a string and the condition fails to parse it as a number, causing unexpected behavior. The most likely fix is to ensure numeric output.

Why this answer

The most likely cause is that the evaluation script outputs the F1 score as a string (e.g., "0.94") rather than a numeric value. In AWS SageMaker Pipelines, the Condition step evaluates expressions using JSONPath, and when comparing two values, if one is a string, the comparison is performed lexicographically (character by character). Lexicographically, the string "0.94" is considered greater than or equal to "0.95" because '9' > '5' after the decimal point, causing the condition to pass incorrectly.

Exam trap

AWS often tests the subtle distinction between numeric and string comparisons in AWS Step Functions and SageMaker Pipelines, where candidates assume that a value that looks like a number will be compared numerically, but the actual behavior depends on the data type in the JSON output.

How to eliminate wrong answers

Option A is wrong because splitting the evaluation step into two steps would not fix the root cause—the issue is a data type mismatch, not a step separation problem. Option C is wrong because the Condition step can absolutely check metric values using JSONPath expressions; it is not limited to checking step status. Option D is wrong because the operator used (>=) is correct for the intended logic (pass if F1 >= 0.95); the issue is that the comparison is lexicographic due to string values, not that the operator is wrong.

Practice this question →

117

Multi-Selecteasy

A data science team is deploying a model on Amazon SageMaker and wants to protect the endpoint from unauthorized access. Which TWO methods can the team use to secure the endpoint? (Choose TWO.)

Select 2 answers

A.Configure the endpoint to be deployed within a VPC and control traffic using security groups and network ACLs.

B.Use a resource-based IAM policy on the endpoint to restrict invocation.

C.Place an Amazon API Gateway in front of the endpoint with AWS WAF.

D.Attach a security group directly to the SageMaker endpoint.

E.Use an IAM policy that requires authentication for the sagemaker:InvokeEndpoint action.

AnswersA, E

Deploying inside a VPC allows network-level access control.

Why this answer

Option A is correct because deploying a SageMaker endpoint within a VPC allows you to control inbound and outbound traffic using security groups and network ACLs, effectively restricting network-level access to the endpoint. This is a fundamental network security measure that prevents unauthorized network traffic from reaching the endpoint.

Exam trap

The trap here is that candidates often confuse resource-based IAM policies (which are not supported for SageMaker endpoints) with identity-based policies, or they assume that attaching a security group directly to an endpoint is possible without deploying it in a VPC.

Practice this question →

118

MCQhard

A company uses SageMaker endpoints with auto-scaling based on CPU utilization. During a flash sale, latency increases despite low CPU. What should be done?

A.Use a custom metric such as memory utilization or request count for auto-scaling

B.Increase the instance size

C.Disable auto-scaling and use a larger instance

D.Switch to GPU instances

AnswerA

Custom metrics can better capture the actual load and scale appropriately.

Why this answer

Option A is correct because CPU utilization is a poor scaling metric for inference workloads that are I/O or memory-bound. During a flash sale, increased request concurrency can cause queuing and latency spikes even when CPU is low. Using a custom metric like request count per instance or memory utilization directly reflects the load on the inference endpoint, enabling the Application Auto Scaling target tracking policy to scale out proactively before latency degrades.

Exam trap

The trap here is that candidates assume CPU utilization is always the best scaling metric for compute-bound workloads, but the MLA-C01 exam specifically tests the understanding that inference endpoints can be I/O-bound, making request count or memory utilization more appropriate for auto-scaling.

How to eliminate wrong answers

Option B is wrong because increasing the instance size does not address the root cause—auto-scaling is not triggering due to an inappropriate metric; it merely shifts the bottleneck to a larger instance without solving the scaling policy issue. Option C is wrong because disabling auto-scaling removes elasticity entirely, which is counterproductive for handling unpredictable traffic spikes like a flash sale; a static larger instance will either be over-provisioned or still suffer latency under extreme load. Option D is wrong because GPU instances are designed for compute-heavy workloads like deep learning inference, not for resolving latency caused by request queuing or I/O bottlenecks; they add cost without fixing the scaling metric problem.

Practice this question →

119

MCQmedium

A data science team has trained a model using SageMaker and wants to deploy it for real-time inference with automatic scaling based on request latency. The deployment must handle unpredictable traffic spikes without manual intervention. Which combination of SageMaker features should the team use?

A.Create a SageMaker endpoint with an Application Auto Scaling target tracking policy based on the SageMakerVariantInvocationsPerInstance metric

B.Deploy the model on a multi-model endpoint and manually adjust the number of instances via the AWS Management Console

C.Deploy the model on an Elastic Inference accelerator and use AWS Auto Scaling with a scheduled policy

D.Create a batch transform job with a scheduled Lambda function to trigger scaling

AnswerA

SageMaker endpoints support Application Auto Scaling with target tracking on invocations per instance, handling spikes.

Why this answer

Option A is correct because it uses a SageMaker endpoint with an Application Auto Scaling target tracking policy based on the SageMakerVariantInvocationsPerInstance metric. This allows the endpoint to automatically scale the number of instances in response to changes in request latency, as the metric directly reflects the load per instance. The target tracking policy adjusts capacity to maintain a target value for the metric, handling unpredictable traffic spikes without manual intervention.

Exam trap

The trap here is that candidates may confuse automatic scaling with manual adjustments or batch processing, or mistakenly think that Elastic Inference or scheduled policies can handle real-time, unpredictable traffic spikes, when only a target tracking policy on a SageMaker endpoint provides the required dynamic, latency-aware scaling.

How to eliminate wrong answers

Option B is wrong because manually adjusting instances via the AWS Management Console does not provide automatic scaling, which is required to handle unpredictable traffic spikes without manual intervention. Option C is wrong because Elastic Inference accelerators are used to reduce the cost of deep learning inference by attaching a fraction of GPU power to an instance, not for scaling based on latency; AWS Auto Scaling with a scheduled policy is not suitable for unpredictable spikes as it relies on predefined schedules. Option D is wrong because a batch transform job is designed for offline, asynchronous inference on large datasets, not for real-time inference, and a scheduled Lambda function cannot dynamically scale based on real-time latency metrics.

Practice this question →

120

MCQmedium

A company uses SageMaker for training and inference. They have a model that retrains weekly. After each retraining, the model is evaluated on a held-out test set. If the evaluation metrics meet a threshold, the model is registered as 'Approved' in the SageMaker Model Registry. The team manually deploys the approved model to a production endpoint. They want to automate this deployment process to reduce manual errors. However, the deployment should only proceed if the new model passes a canary test in a staging environment. Which combination of AWS services should the team use to achieve this?

A.AWS CodeDeploy with a blue/green deployment strategy.

B.SageMaker Pipelines with a conditional deployment step that includes a canary test.

C.AWS Lambda to deploy to staging, then automatically promote to production if staging tests pass.

D.Amazon EKS with a custom inference container and use ArgoCD for automated deployments.

AnswerB

Pipelines natively support conditional logic, canary deployments via weighted endpoints, and automatic rollback.

Why this answer

SageMaker Pipelines natively supports conditional execution steps, allowing you to add a canary test step that evaluates the new model in a staging environment before automatically promoting it to production. This directly addresses the requirement for automated deployment gated by a canary test, without needing external orchestration services.

Exam trap

The trap here is that candidates may overthink the solution and choose a generic CI/CD tool like CodeDeploy or Lambda, missing that SageMaker Pipelines already provides a fully managed, ML-specific orchestration with conditional deployment and canary testing capabilities.

How to eliminate wrong answers

Option A is wrong because AWS CodeDeploy with blue/green deployment is a general-purpose deployment service for EC2, Lambda, or ECS, not integrated with SageMaker Model Registry or SageMaker endpoints, and lacks native canary testing for ML models. Option C is wrong because using AWS Lambda to deploy to staging and then promote to production would require custom code to manage the canary test logic, state tracking, and rollback, which is less reliable and maintainable than SageMaker Pipelines' built-in conditional steps. Option D is wrong because Amazon EKS with ArgoCD is designed for Kubernetes container orchestration, not for managing SageMaker endpoints or Model Registry, and introduces unnecessary complexity for a SageMaker-native workflow.

Practice this question →

121

MCQhard

A company has a SageMaker endpoint running a model that provides real-time recommendations. Recently, the model's accuracy has degraded due to data drift. The team wants to automatically retrain the model when a drift metric exceeds a threshold and deploy the new model without downtime. Which architecture should the team implement?

A.Use SageMaker Model Monitor to collect drift metrics, and have a data scientist manually analyze the metrics and trigger retraining via the SageMaker console

B.Use SageMaker Model Monitor to trigger an Amazon EventBridge event that starts a SageMaker Pipeline, which retrains the model, registers it in the Model Registry, and then updates the existing endpoint with a new production variant

C.Schedule a daily SageMaker Pipeline that retrains the model and deploys it using a new endpoint, then updates the application to point to the new endpoint

D.Use SageMaker Model Monitor to publish drift metrics to Amazon CloudWatch, and create a CloudWatch alarm that triggers an AWS Lambda function to retrain and deploy the model

AnswerB

EventBridge triggers pipeline on drift; pipeline retrains, registers, and uses production variant to shift traffic gradually with no downtime.

Why this answer

Option B is correct because it uses SageMaker Model Monitor to detect data drift and emit an EventBridge event, which triggers a SageMaker Pipeline to retrain the model, register it in the Model Registry, and then update the existing endpoint with a new production variant. This architecture enables automatic retraining and zero-downtime deployment by leveraging the endpoint's production variants for a blue/green deployment.

Exam trap

AWS often tests the distinction between automatic drift-triggered retraining with zero-downtime deployment (Option B) versus scheduled retraining or manual intervention, and candidates may overlook the need to update the existing endpoint rather than creating a new one.

How to eliminate wrong answers

Option A is wrong because it relies on manual analysis and triggering, which does not meet the requirement for automatic retraining. Option C is wrong because scheduling a daily pipeline ignores the data drift trigger and deploys a new endpoint instead of updating the existing one, causing downtime or requiring application changes to point to the new endpoint. Option D is wrong because while it uses CloudWatch alarms and Lambda for automation, it lacks the integration with SageMaker Model Registry and the ability to update the existing endpoint with a new production variant, potentially causing downtime or manual intervention.

Practice this question →

122

Multi-Selecteasy

A company wants to deploy a trained model to a SageMaker endpoint with automatic scaling based on traffic. Which TWO configurations are required? (Choose two.)

Select 2 answers

A.Use a multi-model endpoint

B.Enable data capture

C.Set up an Application Auto Scaling policy

D.Configure a lifecycle configuration

E.Create a CloudWatch alarm

AnswersC, E

Auto Scaling policy defines how to scale the endpoint.

Why this answer

Option C is correct because Application Auto Scaling is the AWS service that automatically adjusts the number of instances for a SageMaker endpoint based on demand. You define a scaling policy (e.g., target tracking, step scaling) that tells Auto Scaling when to add or remove instances, which is essential for handling variable traffic without manual intervention.

Exam trap

The trap here is that candidates often confuse 'required configurations for scaling' with 'optional features that improve monitoring or cost efficiency,' leading them to select data capture or multi-model endpoints instead of recognizing that a CloudWatch alarm is the trigger mechanism for the scaling policy.

Practice this question →

123

MCQmedium

A team is deploying a machine learning model using Amazon SageMaker. They need to serve predictions with sub-100ms latency for a real-time application. The model is a large ensemble that requires 4 GB of memory. The team expects traffic of 100 requests per second initially, but it may double during peak hours. Which instance type and deployment configuration should the team choose to minimize cost while meeting the latency requirement?

A.Deploy on one ml.c5.large instance with an Application Auto Scaling target tracking policy based on memory utilization

B.Deploy on one ml.t2.medium instance with an Application Auto Scaling target tracking policy based on CPU utilization

C.Deploy on one ml.p3.2xlarge instance with provisioned concurrency

D.Deploy on two ml.m5.large instances behind a load balancer with manual scaling

AnswerA

ml.c5.large has 4 GB memory, suitable; one instance can handle 100 RPS; auto-scaling handles peak.

Why this answer

Option A is correct because the ml.c5.large instance provides 4 GB of memory, which meets the model's requirement, and its compute-optimized nature ensures low-latency inference. Using Application Auto Scaling with a target tracking policy based on memory utilization allows the instance to scale out during traffic spikes (up to 200 requests per second) while minimizing cost by running a single instance during normal load.

Exam trap

The trap here is that candidates often choose GPU instances (like p3) for any 'large' model, but the question specifies memory and latency requirements, not GPU compute needs, and they overlook that burstable instances (t2) cannot sustain low latency under continuous load due to CPU credit exhaustion.

How to eliminate wrong answers

Option B is wrong because the ml.t2.medium instance has only 4 GB of memory but uses burstable CPU (t2 series), which cannot sustain sub-100ms latency under sustained load due to CPU credit exhaustion, especially at 100-200 requests per second. Option C is wrong because the ml.p3.2xlarge instance is a GPU-accelerated instance designed for training or high-throughput batch inference, not for real-time low-latency serving; it is over-provisioned and costly for this memory-bound ensemble model, and provisioned concurrency is a Lambda feature, not applicable to SageMaker. Option D is wrong because deploying two ml.m5.large instances (each with 8 GB memory) behind a load balancer with manual scaling is over-provisioned for the initial 100 requests per second, increasing cost unnecessarily, and manual scaling cannot dynamically handle peak traffic without manual intervention.

Practice this question →

124

MCQhard

A company is deploying a real-time inference endpoint for a natural language processing model using Amazon SageMaker. The model requires GPU acceleration and must handle variable traffic patterns, including sudden spikes. The team wants to minimize costs while maintaining low latency during spikes. Which endpoint configuration strategy should they use?

A.Use a single large GPU instance with provisioned concurrency.

B.Use a serverless endpoint with GPU support.

C.Use a single GPU instance in multiple Availability Zones with an Application Load Balancer.

D.Use a multi-model endpoint on a GPU instance with Auto Scaling based on invocation count.

AnswerD

Multi-model endpoints share instances across models, and Auto Scaling adjusts capacity for spikes.

Why this answer

Option D is correct because a multi-model endpoint on a GPU instance with Auto Scaling based on invocation count allows multiple models to share a single GPU, maximizing utilization and reducing cost. Auto Scaling based on invocation count dynamically adjusts the number of instances to handle traffic spikes while maintaining low latency, as it scales out quickly when the invocation count exceeds a threshold.

Exam trap

The trap here is that candidates assume serverless endpoints support GPU acceleration, but SageMaker serverless endpoints are CPU-only, making Option B invalid despite its cost-saving appeal.

How to eliminate wrong answers

Option A is wrong because a single large GPU instance with provisioned concurrency does not scale to handle sudden spikes; provisioned concurrency pre-warms instances but does not add more instances during a spike, leading to latency increases or throttling. Option B is wrong because serverless endpoints with GPU support are not available in SageMaker; serverless endpoints only support CPU instances, so they cannot meet the GPU acceleration requirement. Option C is wrong because using a single GPU instance in multiple Availability Zones with an Application Load Balancer does not provide horizontal scaling; it only adds redundancy across zones, but a single instance cannot handle spikes in traffic without Auto Scaling to add more instances.

Practice this question →