MLA-C01 Deployment and Orchestration of ML Workflows — 20 Questions

Question 1

A data science team has trained a PyTorch model using Amazon SageMaker and wants to deploy it with a custom inference container that includes a pre-processing step. The team needs to minimize latency and ensure the pre-processing runs only once per request. Which SageMaker real-time inference option should they use?

Accepted Answer

Create a SageMaker inference pipeline with two containers: one for pre-processing and one for inference.. Option D is correct because a SageMaker inference pipeline allows you to chain two containers in a single endpoint, where the first container handles pre-processing and the second runs inference. This ensures that pre-processing runs exactly once per request, minimizing latency by avoiding redundant processing and keeping the request within the same HTTP connection.

Answer

Deploy the model on a multi-model endpoint and include pre-processing in the model code.

Answer

Use a batch transform job with a pre-processing script.

Answer

Package pre-processing and inference in a single container with a custom entry point.

Question 2

A company is deploying a real-time inference endpoint for a natural language processing model using Amazon SageMaker. The model requires GPU acceleration and must handle variable traffic patterns, including sudden spikes. The team wants to minimize costs while maintaining low latency during spikes. Which endpoint configuration strategy should they use?

Accepted Answer

Use a multi-model endpoint on a GPU instance with Auto Scaling based on invocation count.. Option D is correct because a multi-model endpoint on a GPU instance with Auto Scaling based on invocation count allows multiple models to share a single GPU, maximizing utilization and reducing cost. Auto Scaling based on invocation count dynamically adjusts the number of instances to handle traffic spikes while maintaining low latency, as it scales out quickly when the invocation count exceeds a threshold.

Answer

Use a single large GPU instance with provisioned concurrency.

Answer

Use a serverless endpoint with GPU support.

Answer

Use a single GPU instance in multiple Availability Zones with an Application Load Balancer.

Question 3

A machine learning engineer is deploying a model using AWS Lambda for inference. The model is a small scikit-learn classifier with a size of 50 MB. The Lambda function is invoked by an API Gateway REST API. The engineer notices that cold starts are causing high latency. Which action would most effectively reduce cold start latency without increasing costs significantly?

Accepted Answer

Configure provisioned concurrency for the Lambda function.. Option C is correct because provisioned concurrency pre-initializes the Lambda execution environment, keeping it warm and ready to handle requests immediately. This eliminates the cold start overhead for the first request, directly reducing latency without incurring the ongoing costs of a larger memory allocation or the complexity of EFS/container management.

Answer

Store the model in Amazon EFS and load it at runtime.

Answer

Increase the Lambda function memory to the maximum of 10,240 MB.

Answer

Package the model in a container image and deploy using Lambda container support.

Question 4

A company uses Amazon SageMaker to train and deploy machine learning models. The security team requires that all data in transit between the training job and S3 be encrypted, and that no data traverses the public internet. Which configuration should the company use?

Accepted Answer

Create a VPC with S3 VPC endpoints, attach a VPC-only policy to the SageMaker execution role, and enable KMS encryption for training jobs.. Option A is correct because it ensures that data in transit between SageMaker and S3 stays within the AWS network and is encrypted. By creating a VPC with S3 VPC endpoints, traffic uses AWS private IPs and never traverses the public internet. Attaching a VPC-only policy to the SageMaker execution role restricts the training job to only use VPC endpoints, and enabling KMS encryption for the training job ensures data is encrypted in transit (via TLS) and at rest.

Answer

Use an S3 bucket with SSE-S3 encryption and restrict bucket access to a VPC.

Answer

Enable default encryption on the S3 bucket and use HTTPS for all SageMaker endpoints.

Answer

Create a VPC with a NAT gateway, and configure SageMaker to use the VPC and enforce HTTPS.

Question 5

A team is deploying a deep learning model on a SageMaker real-time endpoint. The model has high memory requirements, and the team wants to minimize instance cost while ensuring the endpoint can handle up to 10 concurrent requests. They plan to use a single ml.p3.2xlarge instance (8 vCPUs, 61 GB memory). Which SageMaker endpoint configuration will allow the endpoint to handle 10 concurrent requests without errors?

Accepted Answer

Set the initial instance count to 1 and configure the container to use multiple ModelServerWorkers.. Option B is correct because SageMaker's ModelServerWorkers (MSWs) allow a single container to handle multiple inference requests concurrently by running multiple worker processes. With 8 vCPUs on ml.p3.2xlarge, configuring multiple MSWs (e.g., 8 workers) enables the endpoint to process up to 10 concurrent requests without errors, as each worker can handle one request at a time. This minimizes cost by using a single instance while meeting concurrency requirements.

Answer

Disable ModelServerWorkers to reduce overhead.

Answer

Set the initial variant weight to 10.

Answer

Set the initial instance count to 10 in the production variant.

Question 6

A company wants to deploy a machine learning model that was trained on-premises using TensorFlow. The model is a TensorFlow SavedModel. The company uses AWS and wants to minimize operational overhead. Which deployment option meets these requirements?

Accepted Answer

Deploy the model using Amazon SageMaker with a TensorFlow inference container.. Amazon SageMaker provides a fully managed TensorFlow inference container that directly supports TensorFlow SavedModel format, enabling deployment without any custom infrastructure management. This minimizes operational overhead compared to self-managed options like ECS or Lambda, as SageMaker handles scaling, load balancing, and model updates automatically.

Answer

Deploy the model on Amazon ECS using a custom Docker image.

Answer

Deploy the model as an AWS Lambda function with the TensorFlow runtime.

Answer

Deploy the model using Amazon SageMaker Studio.

Question 7

A team is using AWS Step Functions to orchestrate a machine learning workflow that includes data preprocessing, training, and model evaluation. The team wants to run the workflow whenever new data arrives in an S3 bucket. Which approach should they use to trigger the Step Functions workflow?

Accepted Answer

Configure the S3 bucket to send events to Amazon EventBridge, and create an EventBridge rule that targets the Step Functions state machine.. Option D is correct because Amazon S3 can send event notifications directly to Amazon EventBridge, and EventBridge rules can target AWS Step Functions state machines as a target. This provides a fully managed, serverless integration that allows the Step Functions workflow to be triggered automatically whenever new data arrives in the S3 bucket, without needing intermediate polling or custom code.

Answer

Configure the S3 bucket to send an event notification directly to the Step Functions state machine.

Answer

Use S3 event notifications to send a message to an Amazon SQS queue, and have a Lambda function poll the queue to start the execution.

Answer

Use a CloudWatch Logs metric filter to trigger the Step Functions execution.

Question 8

A company is deploying a machine learning model using Amazon SageMaker. The model is a large deep learning model that requires GPU for inference. The company expects unpredictable traffic patterns with occasional bursts. They want to minimize cost while ensuring low latency during bursts. Which TWO actions should they take? (Select TWO.)

Accepted Answer

Use a multi-model endpoint with a mix of CPU and GPU instances to handle variable traffic.. Option B is correct because a multi-model endpoint with a mix of CPU and GPU instances allows the company to host multiple models on the same endpoint, reducing cost by sharing underlying instances. By including GPU instances, the endpoint can handle the GPU-intensive deep learning inference for the large model, while the CPU instances can serve lighter loads or fallback traffic, ensuring low latency during unpredictable bursts without over-provisioning.

Answer

Use a serverless endpoint configuration to automatically scale.

Answer

Use Spot instances for the endpoint to reduce cost.

Answer

Provision multiple on-demand GPU instances behind a load balancer.

Question 9

An MLOps engineer is designing a CI/CD pipeline for deploying machine learning models to a production SageMaker endpoint. The pipeline should include automated testing, approval gates, and rollback capability. Which THREE components should be included in the pipeline? (Select THREE.)

Accepted Answer

A CloudFormation template to deploy the endpoint infrastructure, enabling rollback via stack update.. Option B is correct because using a CloudFormation template to deploy the SageMaker endpoint infrastructure enables rollback via stack update. If a deployment fails, CloudFormation can automatically roll back the stack to the previous known good state, ensuring infrastructure consistency and reducing downtime.

Answer

A step to register the model in SageMaker Model Registry.

Answer

A step to run SageMaker Debugger to monitor training.

Question 10

A company is using Amazon SageMaker to deploy a model for real-time inference. The model requires access to a private S3 bucket that contains reference data. The company wants to ensure that the endpoint can access the S3 bucket without using a public internet connection. Which TWO actions should they take? (Select TWO.)

Accepted Answer

Attach the endpoint to a VPC that has a VPC endpoint for S3.. Option B is correct because attaching the SageMaker endpoint to a VPC with a VPC endpoint for S3 (Gateway type) allows the endpoint to access the S3 bucket using AWS's private network, bypassing the public internet. This ensures traffic stays within the AWS backbone, meeting the requirement for no public internet connection. Option C is also correct because the SageMaker execution role must have an IAM policy with s3:GetObject permissions to authorize the read access to the private S3 bucket, which is a prerequisite for any S3 operation.

Answer

Configure the endpoint's security group to allow outbound traffic to the S3 bucket's IP range.

Answer

Attach the endpoint to a VPC with an internet gateway and route the S3 traffic through the internet gateway.

Answer

Attach the endpoint to a VPC with a NAT gateway to route traffic to S3.

Question 11

A data scientist is trying to create a SageMaker endpoint configuration with 6 instances of ml.c5.large for a production variant. The creation fails with the error shown in the exhibit. Which action should the data scientist take to resolve this issue?

Accepted Answer

Request a service quota increase for ml.c5.large for real-time endpoints from the AWS Service Quotas console.. The error indicates that the requested number of instances exceeds the service quota for ml.c5.large for real-time endpoints. AWS enforces default limits on instance counts per instance type per region. Requesting a quota increase via the Service Quotas console is the correct action to raise the limit and allow the deployment of 6 instances.

Answer

Create two separate endpoint configurations, each with 3 instances, and distribute traffic between them.

Answer

Use a different instance type, such as ml.m5.large, which has a higher limit.

Answer

Delete unused endpoints to free up resources.

Question 12

A machine learning engineer has configured a SageMaker Model Monitor schedule for data quality monitoring as shown in the exhibit. The schedule is set to run hourly. However, the engineer notices that the monitoring jobs are not producing output in the specified S3 bucket. What is the most likely cause?

Accepted Answer

The DataAnalysisStartTime and DataAnalysisEndTime are set to a past date, so no data is analyzed.. Option B is correct because the DataAnalysisStartTime and DataAnalysisEndTime parameters define the time window for which SageMaker Model Monitor analyzes data. When both are set to a past date that has already passed, the monitoring job finds no new data to analyze within that window, resulting in no output being written to the S3 bucket. The schedule runs hourly, but the analysis window is fixed to a historical period, so each execution produces no results.

Answer

The output_path is incorrectly placed; it should be under the MonitoringOutputConfig.

Answer

The MonitoringType should be 'ModelQuality' to enable data quality monitoring.

Answer

The cron expression is incorrectly formatted for an hourly schedule.

Question 13

A data science team has trained a model using SageMaker and wants to deploy it for real-time inference with automatic scaling based on request latency. The deployment must handle unpredictable traffic spikes without manual intervention. Which combination of SageMaker features should the team use?

Accepted Answer

Create a SageMaker endpoint with an Application Auto Scaling target tracking policy based on the SageMakerVariantInvocationsPerInstance metric. Option A is correct because it uses a SageMaker endpoint with an Application Auto Scaling target tracking policy based on the SageMakerVariantInvocationsPerInstance metric. This allows the endpoint to automatically scale the number of instances in response to changes in request latency, as the metric directly reflects the load per instance. The target tracking policy adjusts capacity to maintain a target value for the metric, handling unpredictable traffic spikes without manual intervention.

Answer

Deploy the model on a multi-model endpoint and manually adjust the number of instances via the AWS Management Console

Answer

Deploy the model on an Elastic Inference accelerator and use AWS Auto Scaling with a scheduled policy

Answer

Create a batch transform job with a scheduled Lambda function to trigger scaling

Question 14

A machine learning engineer is deploying a PyTorch model for real-time inference on SageMaker. The model requires GPU for low-latency predictions. The deployment fails with the error: 'The primary container does not support the requested instance type.' The instance type is ml.p3.2xlarge. Which action should the engineer take to resolve the issue?

Accepted Answer

Verify that the PyTorch framework version specified in the SageMaker estimator matches a version that supports GPU instances. Option C is correct because the error 'The primary container does not support the requested instance type' typically occurs when the specified PyTorch framework version in the SageMaker estimator does not include GPU support for the chosen instance type (ml.p3.2xlarge). SageMaker's prebuilt PyTorch containers are version-specific and only certain versions are compiled with CUDA and GPU libraries; using a version that lacks GPU support causes the container to reject GPU instance types. Verifying and selecting a PyTorch version that explicitly supports GPU instances resolves the mismatch.

Answer

Use SageMaker Neo to compile the model for the target instance type

Answer

Request a service quota increase for the ml.p3.2xlarge instance type

Answer

Create a custom inference container and use it with the SageMaker model

Question 15

A company is using SageMaker Pipelines to automate a multi-step ML workflow. The pipeline includes data preprocessing, training, and model evaluation. The team wants to ensure that if the evaluation step fails, the pipeline stops and sends an alert to the operations team. Which SageMaker Pipelines feature should they use?

Accepted Answer

Use a Condition step to check the evaluation result and route to a Fail step if the result indicates failure. Option D is correct because SageMaker Pipelines provides a built-in Condition step that evaluates a boolean expression (e.g., checking if evaluation metrics meet a threshold) and then routes execution to different steps. If the condition fails, you can direct the pipeline to a Fail step, which immediately stops the pipeline and marks it as failed. This is the native, event-driven way to halt a pipeline based on step output without relying on external services.

Answer

Configure an Amazon CloudWatch Events rule to monitor the pipeline execution status and stop it if the evaluation step fails

Answer

Register the model in the Model Registry only if evaluation passes, and configure the pipeline to stop if registration fails

Answer

Add a Lambda step after the evaluation step that checks the evaluation metrics and sends an SNS notification if the metrics are below a threshold

Question 16

A team is deploying a machine learning model using Amazon SageMaker. They need to serve predictions with sub-100ms latency for a real-time application. The model is a large ensemble that requires 4 GB of memory. The team expects traffic of 100 requests per second initially, but it may double during peak hours. Which instance type and deployment configuration should the team choose to minimize cost while meeting the latency requirement?

Accepted Answer

Deploy on one ml.c5.large instance with an Application Auto Scaling target tracking policy based on memory utilization. Option A is correct because the ml.c5.large instance provides 4 GB of memory, which meets the model's requirement, and its compute-optimized nature ensures low-latency inference. Using Application Auto Scaling with a target tracking policy based on memory utilization allows the instance to scale out during traffic spikes (up to 200 requests per second) while minimizing cost by running a single instance during normal load.

Answer

Deploy on one ml.t2.medium instance with an Application Auto Scaling target tracking policy based on CPU utilization

Answer

Deploy on one ml.p3.2xlarge instance with provisioned concurrency

Answer

Deploy on two ml.m5.large instances behind a load balancer with manual scaling

Question 17

A company has a SageMaker endpoint running a model that provides real-time recommendations. Recently, the model's accuracy has degraded due to data drift. The team wants to automatically retrain the model when a drift metric exceeds a threshold and deploy the new model without downtime. Which architecture should the team implement?

Accepted Answer

Use SageMaker Model Monitor to trigger an Amazon EventBridge event that starts a SageMaker Pipeline, which retrains the model, registers it in the Model Registry, and then updates the existing endpoint with a new production variant. Option B is correct because it uses SageMaker Model Monitor to detect data drift and emit an EventBridge event, which triggers a SageMaker Pipeline to retrain the model, register it in the Model Registry, and then update the existing endpoint with a new production variant. This architecture enables automatic retraining and zero-downtime deployment by leveraging the endpoint's production variants for a blue/green deployment.

Answer

Use SageMaker Model Monitor to collect drift metrics, and have a data scientist manually analyze the metrics and trigger retraining via the SageMaker console

Answer

Schedule a daily SageMaker Pipeline that retrains the model and deploys it using a new endpoint, then updates the application to point to the new endpoint

Answer

Use SageMaker Model Monitor to publish drift metrics to Amazon CloudWatch, and create a CloudWatch alarm that triggers an AWS Lambda function to retrain and deploy the model

Question 18

A machine learning team needs to deploy a model that was built using scikit-learn. They want to use SageMaker for hosting. Which approach should they take?

Accepted Answer

Package the model artifacts and use the SageMaker built-in scikit-learn container for inference. Option D is correct because SageMaker provides a pre-built, optimized Docker container for scikit-learn that supports inference. By packaging the model artifacts (e.g., a joblib or pickle file) and deploying them using the built-in container, the team avoids the overhead of custom container creation while ensuring compatibility with SageMaker's hosting infrastructure, including automatic scaling and load balancing.

Answer

Create a Jupyter notebook that loads the model and runs predictions on the SageMaker notebook instance

Answer

Create a custom Docker container with scikit-learn and deploy it on SageMaker

Answer

Launch a SageMaker training job with the model and use the training instance as an endpoint

Question 19

Which TWO of the following are best practices for deploying machine learning models on SageMaker? (Select TWO.)

Accepted Answer

Use separate production and staging endpoints to test new models before full rollout. Option B and Option D are correct. Option A is wrong because model should be in S3, not EBS. Option C is wrong because you should use the SageMaker Model Registry for versioning. Option E is wrong because CloudWatch Logs are enabled by default, not disabled.

Answer

Store model artifacts in Amazon EBS volumes attached to the endpoint instances

Answer

Manually track model versions using tags because SageMaker Model Registry is not available for deployment

Answer

Disable CloudWatch Logs to reduce costs during inference

Question 20

A media company uses SageMaker to host a real-time video recommendation model. The model is deployed on a single ml.c5.xlarge endpoint. During a major live event, traffic surges to 10 times the normal load, and the endpoint becomes unresponsive, causing high latency and errors. The team had set up an Application Auto Scaling target tracking policy based on CPU utilization with a target of 70%. However, scaling did not trigger quickly enough. After the event, the team reviews CloudWatch metrics and notices that CPU utilization never exceeded 70% during the surge, but memory utilization peaked at 95%. The model is memory-bound. The team wants to ensure the endpoint scales automatically before performance degrades during future events. What should the team do?

Accepted Answer

Change the target tracking metric to memory utilization and set a target of 70%. Option A is correct because the model is memory-bound, and the current CPU-based target tracking policy failed to trigger scaling since CPU utilization never exceeded 70% during the surge. By switching to a memory utilization metric with a target of 70%, scaling will activate based on the actual resource constraint (memory), preventing performance degradation before the endpoint becomes unresponsive.

Answer

Increase the target CPU utilization to 90% so that scaling triggers at higher load

Answer

Change the endpoint instance type to ml.c5.4xlarge to provide more memory per instance

Answer

Create a scheduled scaling policy to add instances during the known event time