MLA-C01 Deployment and Orchestration of ML Workflows Practice Test 2 — 15 Questions

Question 1

A team is deploying a machine learning model using Amazon SageMaker. They need to serve predictions with sub-100ms latency for a real-time application. The model is a large ensemble that requires 4 GB of memory. The team expects traffic of 100 requests per second initially, but it may double during peak hours. Which instance type and deployment configuration should the team choose to minimize cost while meeting the latency requirement?

Accepted Answer

Deploy on one ml.c5.large instance with an Application Auto Scaling target tracking policy based on memory utilization. Option A is correct because the ml.c5.large instance provides 4 GB of memory, which meets the model's requirement, and its compute-optimized nature ensures low-latency inference. Using Application Auto Scaling with a target tracking policy based on memory utilization allows the instance to scale out during traffic spikes (up to 200 requests per second) while minimizing cost by running a single instance during normal load.

Answer

Deploy on one ml.t2.medium instance with an Application Auto Scaling target tracking policy based on CPU utilization

Answer

Deploy on one ml.p3.2xlarge instance with provisioned concurrency

Answer

Deploy on two ml.m5.large instances behind a load balancer with manual scaling

Question 2

A company has a SageMaker endpoint running a model that provides real-time recommendations. Recently, the model's accuracy has degraded due to data drift. The team wants to automatically retrain the model when a drift metric exceeds a threshold and deploy the new model without downtime. Which architecture should the team implement?

Accepted Answer

Use SageMaker Model Monitor to trigger an Amazon EventBridge event that starts a SageMaker Pipeline, which retrains the model, registers it in the Model Registry, and then updates the existing endpoint with a new production variant. Option B is correct because it uses SageMaker Model Monitor to detect data drift and emit an EventBridge event, which triggers a SageMaker Pipeline to retrain the model, register it in the Model Registry, and then update the existing endpoint with a new production variant. This architecture enables automatic retraining and zero-downtime deployment by leveraging the endpoint's production variants for a blue/green deployment.

Answer

Use SageMaker Model Monitor to collect drift metrics, and have a data scientist manually analyze the metrics and trigger retraining via the SageMaker console

Answer

Schedule a daily SageMaker Pipeline that retrains the model and deploys it using a new endpoint, then updates the application to point to the new endpoint

Answer

Use SageMaker Model Monitor to publish drift metrics to Amazon CloudWatch, and create a CloudWatch alarm that triggers an AWS Lambda function to retrain and deploy the model

Question 3

A machine learning team needs to deploy a model that was built using scikit-learn. They want to use SageMaker for hosting. Which approach should they take?

Accepted Answer

Package the model artifacts and use the SageMaker built-in scikit-learn container for inference. Option D is correct because SageMaker provides a pre-built, optimized Docker container for scikit-learn that supports inference. By packaging the model artifacts (e.g., a joblib or pickle file) and deploying them using the built-in container, the team avoids the overhead of custom container creation while ensuring compatibility with SageMaker's hosting infrastructure, including automatic scaling and load balancing.

Answer

Create a Jupyter notebook that loads the model and runs predictions on the SageMaker notebook instance

Answer

Create a custom Docker container with scikit-learn and deploy it on SageMaker

Answer

Launch a SageMaker training job with the model and use the training instance as an endpoint

Question 4

Which TWO of the following are best practices for deploying machine learning models on SageMaker? (Select TWO.)

Accepted Answer

Use separate production and staging endpoints to test new models before full rollout. Option B and Option D are correct. Option A is wrong because model should be in S3, not EBS. Option C is wrong because you should use the SageMaker Model Registry for versioning. Option E is wrong because CloudWatch Logs are enabled by default, not disabled.

Answer

Store model artifacts in Amazon EBS volumes attached to the endpoint instances

Answer

Manually track model versions using tags because SageMaker Model Registry is not available for deployment

Answer

Disable CloudWatch Logs to reduce costs during inference

Question 5

A media company uses SageMaker to host a real-time video recommendation model. The model is deployed on a single ml.c5.xlarge endpoint. During a major live event, traffic surges to 10 times the normal load, and the endpoint becomes unresponsive, causing high latency and errors. The team had set up an Application Auto Scaling target tracking policy based on CPU utilization with a target of 70%. However, scaling did not trigger quickly enough. After the event, the team reviews CloudWatch metrics and notices that CPU utilization never exceeded 70% during the surge, but memory utilization peaked at 95%. The model is memory-bound. The team wants to ensure the endpoint scales automatically before performance degrades during future events. What should the team do?

Accepted Answer

Change the target tracking metric to memory utilization and set a target of 70%. Option A is correct because the model is memory-bound, and the current CPU-based target tracking policy failed to trigger scaling since CPU utilization never exceeded 70% during the surge. By switching to a memory utilization metric with a target of 70%, scaling will activate based on the actual resource constraint (memory), preventing performance degradation before the endpoint becomes unresponsive.

Answer

Increase the target CPU utilization to 90% so that scaling triggers at higher load

Answer

Change the endpoint instance type to ml.c5.4xlarge to provide more memory per instance

Answer

Create a scheduled scaling policy to add instances during the known event time

Question 6

A financial services company has a SageMaker pipeline that trains a fraud detection model daily. The pipeline consists of three steps: preprocessing (using a Spark script), training (XGBoost), and evaluation. The evaluation step calculates the F1 score and compares it to a threshold of 0.95. If the F1 score is below 0.95, the pipeline should fail and notify the team via email. The team implemented this using a Condition step that checks if the F1 score is greater than or equal to 0.95. If true, the pipeline proceeds to register the model; if false, the pipeline fails. However, the team notices that even when the F1 score is 0.94, the pipeline continues to the registration step. The evaluation script outputs the F1 score as a float with two decimal places in a JSON file. The Condition step uses the expression: $.evaluation.metrics.f1_score >= 0.95. What is the most likely cause of the issue?

Accepted Answer

The evaluation script outputs the F1 score as a string, and string comparison '0.94' >= '0.95' evaluates to true because it is lexicographically compared. The most likely cause is that the evaluation script outputs the F1 score as a string (e.g., "0.94") rather than a numeric value. In AWS SageMaker Pipelines, the Condition step evaluates expressions using JSONPath, and when comparing two values, if one is a string, the comparison is performed lexicographically (character by character). Lexicographically, the string "0.94" is considered greater than or equal to "0.95" because '9' > '5' after the decimal point, causing the condition to pass incorrectly.

Answer

The evaluation step must be split into two steps: one for evaluation and one for condition check

Answer

The Condition step cannot be used to check metric values; it can only check step status

Answer

The threshold should be set to 0.95 but the Condition step uses a less than or equal operator

Question 7

A data science team needs to deploy a PyTorch model for real-time inference with low latency. The model requires GPU acceleration. Which SageMaker endpoint configuration should they use?

Accepted Answer

Create a real-time endpoint using an ml.p3.2xlarge instance. Option D is correct because real-time SageMaker endpoints with GPU instances like ml.p3.2xlarge are specifically designed for low-latency, synchronous inference with GPU acceleration. PyTorch models requiring GPU must use instance types that support NVIDIA CUDA, and the ml.p3 family provides the necessary GPU compute for real-time predictions.

Answer

Create a multi-model endpoint using ml.m5.large instances

Answer

Create a serverless endpoint with memory set to 6144 MB

Answer

Create a batch transform job using an ml.c5.xlarge instance

Question 8

A team is using SageMaker Pipelines to automate retraining and deployment. They want to trigger the pipeline automatically when new training data is available in an S3 bucket. Which approach should they use?

Accepted Answer

Create an Amazon EventBridge rule that triggers the pipeline execution on S3 PutObject events. Option A is correct because Amazon EventBridge can directly capture S3 PutObject events and invoke a SageMaker Pipeline execution as a target. This provides a fully event-driven, serverless integration without polling or manual intervention, aligning with best practices for automating ML workflows when new data arrives.

Answer

Register the pipeline as a model package in SageMaker Model Registry

Answer

Configure a cron job to run the pipeline every hour

Answer

Use AWS Step Functions to poll the S3 bucket and start the pipeline when a new object appears

Question 9

A financial services company deploys a fraud detection model on a SageMaker real-time endpoint. The inference logic includes a pre-processing step that requires access to a DynamoDB table for user metadata. The model container is a custom Docker image. How should the team grant the endpoint access to DynamoDB?

Accepted Answer

Create an IAM role with DynamoDB read access and assign it to the SageMaker endpoint as the execution role. Option C is correct because SageMaker endpoints require an IAM execution role to be assigned at creation time. This role defines the permissions the endpoint's container has when making AWS API calls, such as reading from DynamoDB. By attaching a policy with DynamoDB read access to this execution role, the endpoint securely obtains temporary credentials via the AWS STS service, eliminating the need to hardcode or manage long-term credentials.

Answer

Store IAM credentials in the container image as environment variables

Answer

Attach an IAM instance profile to the underlying EC2 instance

Answer

Retrieve temporary credentials from AWS Secrets Manager within the container code

Question 10

An ML team wants to deploy a model that was trained using XGBoost in SageMaker. They want to use the built-in XGBoost algorithm container for inference. Which inference option requires the least custom code?

Accepted Answer

Deploy to a real-time endpoint using the built-in XGBoost container. Option B is correct because the built-in XGBoost container in SageMaker is pre-configured with the XGBoost serving stack, including the necessary inference code and dependencies. Deploying a model trained with XGBoost to a real-time endpoint using this container requires no custom inference script or Docker image, only the model artifact and endpoint configuration. This minimizes custom code to just the SageMaker SDK calls for creating the model and endpoint.

Answer

Create a custom Docker container with XGBoost and deploy to an endpoint

Answer

Attach Elastic Inference to a generic container

Answer

Use SageMaker Python SDK to download the model and run local inference

Question 11

A company is deploying a large number of small models (each < 100 MB) for different customers. They want to minimize costs and management overhead while serving traffic that varies significantly. Which SageMaker endpoint type should they choose?

Accepted Answer

A multi-model endpoint on a GPU instance. A multi-model endpoint (MME) on a GPU instance is the best choice because it allows you to host multiple small models (< 100 MB each) on a single endpoint, sharing the underlying GPU instance to reduce costs. SageMaker MME dynamically loads and unloads models based on traffic, which minimizes management overhead and handles variable traffic patterns efficiently without provisioning separate endpoints per model.

Answer

A batch transform job

Answer

A multi-variant endpoint to route traffic to different model versions

Answer

A serverless endpoint

Question 12

During a blue/green deployment of a SageMaker endpoint, the team notices that traffic is not being fully shifted to the new variant after the update. The endpoint has two variants with equal initial weights (50% each). The team wants to shift 100% traffic to the new variant. What is the most likely cause?

Accepted Answer

The new variant's model container is failing health checks, so traffic is not routed to it. Option B is correct because SageMaker endpoints route traffic only to variants that pass health checks. If the new variant's model container fails health checks (e.g., due to a misconfigured inference script or incompatible dependencies), SageMaker will not send any traffic to it, regardless of the weight setting. This explains why traffic remains stuck at 50% on the old variant despite the intended shift to 100%.

Answer

The new variant is using a different instance type that is not supported in the same endpoint

Answer

The new variant's weight was set to 100 but the maximum weight per variant is 50

Answer

The endpoint's load balancer is misconfigured and not forwarding traffic to the new variant

Question 13

An ML engineer needs to deploy a model as an AWS Lambda function for serverless inference. The model is a scikit-learn pipeline serialized as a pickle file. What is the best way to include the model in the Lambda deployment?

Accepted Answer

Create a Lambda layer with the model file and use it in the function. Option A is correct because Lambda layers allow you to package and include large dependencies, such as a serialized scikit-learn pipeline, separately from your function code. Layers are extracted into the /opt directory and are available across function invocations without cold-start overhead from downloading, making them the most efficient and best-practice approach for bundling static model artifacts in serverless inference.

Answer

Use API Gateway to proxy requests to the model stored in S3

Answer

Store the model in S3 and download it on every invocation

Answer

Mount an EFS file system containing the model

Question 14

A company uses SageMaker Pipelines to train and register models. They want to automate the deployment of approved models from the model registry to a staging endpoint. Which service should they use to orchestrate the deployment workflow?

Accepted Answer

AWS Step Functions. AWS Step Functions is the correct choice because it is a serverless orchestration service designed to coordinate multiple AWS services into flexible, event-driven workflows. For SageMaker Pipelines, Step Functions can trigger model deployment from the registry to a staging endpoint by chaining actions like invoking a Lambda function for approval checks, calling SageMaker's CreateEndpoint API, and handling rollback logic on failure.

Answer

AWS CloudFormation

Answer

Amazon EventBridge

Answer

AWS CodePipeline

Question 15

A team is deploying a TensorFlow model on a SageMaker real-time endpoint with automatic scaling. They set the scaling policy to target an average CPU utilization of 50%. However, during traffic spikes, the endpoint experiences high latency and 503 errors. The instance type is ml.c5.large. What should the team do to resolve this while minimizing cost?

Accepted Answer

Add a scaling policy based on the number of concurrent requests per instance. Option D is correct because scaling based on CPU utilization alone is often insufficient for inference workloads where latency is the primary concern. By adding a scaling policy based on the number of concurrent requests per instance, the team can proactively scale out before CPU saturation occurs, reducing latency and eliminating 503 errors. SageMaker's automatic scaling supports multiple target tracking metrics, and using concurrent requests per instance aligns more closely with the actual demand on the model serving container.

Answer

Pre-warm the endpoint by keeping a fixed number of additional instances

Answer

Increase the scale-in cooldown period to avoid frequent downsizing

Answer

Change the instance type to a larger one like ml.c5.xlarge to handle the spikes