MLA-C01 Deployment and Orchestration of ML Workflows Practice Test 7 — 15 Questions

Question 1

A data science team needs to deploy a trained PyTorch model for real-time inference with sub-100ms latency. The model fits on a single GPU. Which SageMaker inference option is MOST cost-effective while meeting the latency requirement?

Accepted Answer

SageMaker real-time endpoint on ml.g4dn.xlarge. SageMaker real-time endpoints provide dedicated, persistent instances that can handle synchronous inference with sub-100ms latency. The ml.g4dn.xlarge instance includes a single NVIDIA T4 GPU, which is sufficient for the model size and offers the lowest cost among GPU instances that meet the latency requirement. This option balances performance and cost for real-time, low-latency inference.

Answer

SageMaker Batch Transform

Answer

SageMaker Async Inference

Answer

SageMaker Serverless Inference

Question 2

A company has 200 small PyTorch models that are each used infrequently but need to be available for real-time inference. To minimize costs, they want to host all models on a single endpoint. Which SageMaker feature should they use?

Accepted Answer

Multi-model endpoint (MME). Multi-model endpoints allow hosting hundreds of models on a single endpoint, automatically loading/unloading models based on traffic. Multi-container endpoints are for different containers, not multiple models. Batch and asynchronous are not real-time.

Answer

Multi-container endpoint

Answer

Batch Transform job

Answer

Asynchronous inference endpoint

Question 3

A machine learning engineer deploys a new model version to a SageMaker endpoint with production variants. They want to gradually shift traffic from the old model to the new model, monitoring for errors, and automatically roll back if the error rate exceeds 5%. Which deployment pattern should they use?

Accepted Answer

Canary deployment with CloudWatch alarms. Canary deployments gradually shift traffic and allow automated rollback based on CloudWatch alarms. Blue/green switches all at once. A/B testing is for comparing variants. Shadow testing mirrors traffic but doesn't serve the new model to users.

Answer

A/B testing with traffic splitting

Answer

Blue/green deployment

Answer

Shadow testing

Question 4

A team uses SageMaker Pipelines to automate retraining. They want to skip the training step if the data has not changed since the last run. Which feature should they enable?

Accepted Answer

Step caching. Step caching in SageMaker Pipelines allows you to reuse the output from a previous execution of a step if its input data and configuration parameters have not changed. By enabling caching on the training step, the pipeline automatically skips re-executing that step when the data is identical, saving time and cost. This directly addresses the requirement to skip retraining when data has not changed.

Answer

Parameterized executions

Answer

Lineage tracking

Answer

Condition step with a custom check

Question 5

A company needs to deploy a large language model (LLM) on SageMaker with the Triton Inference Server to maximize GPU utilization and reduce latency. They have an NVIDIA A100 GPU. Which SageMaker inference option supports Triton?

Accepted Answer

SageMaker real-time endpoint using a Triton Inference Server container. SageMaker real-time endpoints support the Triton Inference Server through a pre-built container that integrates with NVIDIA A100 GPUs, enabling dynamic batching and concurrent model execution to maximize GPU utilization and reduce latency. Triton is designed for high-throughput inference on GPU hardware, making it the correct choice for this scenario.

Answer

SageMaker Batch Transform with Triton

Answer

SageMaker Serverless Inference with a custom container

Answer

SageMaker Neo compiled model on a CPU endpoint

Question 6

A data scientist wants to version and manage trained models, require approval before deployment, and enable cross-account deployment. Which SageMaker feature provides these capabilities?

Accepted Answer

SageMaker Model Registry. SageMaker Model Registry is the correct choice because it provides a centralized catalog for versioning trained models, supports approval workflows (e.g., pending, approved, rejected) to gate deployment, and enables cross-account deployment by sharing model package ARNs across AWS accounts via AWS Resource Access Manager (RAM) or cross-account IAM roles. This directly satisfies all three requirements: versioning, approval before deployment, and cross-account deployment.

Answer

SageMaker Neo

Answer

SageMaker Pipelines

Answer

Amazon Elastic Inference

Question 7

A company runs a batch inference job on 10 TB of image data stored in S3. Each image needs to be processed by a GPU-accelerated model. The job is not time-sensitive and cost is the primary concern. Which SageMaker option is MOST appropriate?

Accepted Answer

SageMaker Batch Transform with GPU instance and spot instances. Option B is correct because Batch Transform with GPU spot instances is the most cost-effective choice for a non-time-sensitive, large-scale batch inference job on 10 TB of data. Spot instances offer up to 90% cost savings over on-demand, and Batch Transform natively handles splitting the dataset, distributing work across instances, and writing results to S3 without requiring a persistent endpoint.

Answer

SageMaker Serverless Inference

Answer

SageMaker Async Inference with GPU

Answer

SageMaker real-time endpoint on GPU instances

Question 8

A team uses SageMaker Pipelines to train and register a model. They want to conditionally run a hyperparameter tuning step only if the data quality check passes. Which pipeline step type should they use to branch the execution?

Accepted Answer

ConditionStep. The ConditionStep allows comparing values and branching to different steps. If data quality passes, the tuning step runs; otherwise, the pipeline stops or runs an alternative step. Other steps do not provide conditional branching.

Answer

TuningStep

Answer

TrainingStep

Answer

TransformStep

Question 9

An ML engineer needs to orchestrate a multi-step workflow that includes data preprocessing on Spark, model training on SageMaker, and deployment to a production endpoint. They require tight integration with other AWS services and the ability to add custom logic. Which AWS service should they use alongside SageMaker?

Accepted Answer

AWS Step Functions. AWS Step Functions is the correct choice because it provides a serverless workflow orchestration service that can coordinate multi-step ML pipelines involving Spark on AWS Glue or EMR, SageMaker training jobs, and endpoint deployments. It offers tight integration with over 200 AWS services via direct SDK integrations, supports custom logic through Lambda functions, and includes built-in error handling, retries, and parallel execution — making it ideal for complex, heterogeneous ML workflows that extend beyond SageMaker's native capabilities.

Answer

AWS CloudFormation

Answer

SageMaker Pipelines

Answer

Amazon EventBridge

Question 10

A company wants to trigger a model retraining pipeline whenever new training data arrives in an S3 bucket. They also need to send a notification to a Slack channel when the retraining completes. Which TWO AWS services should they use to implement this event-driven workflow? (Select TWO.)

Accepted Answer

AWS Lambda. AWS Lambda is correct because it can be triggered directly by S3 events (e.g., s3:ObjectCreated) to invoke the model retraining pipeline. Amazon EventBridge is correct because it can capture completion events from the retraining pipeline (e.g., SageMaker training job state changes) and route them to a target like a Slack webhook via Lambda or SNS, enabling the notification workflow.

Answer

Amazon SQS

Answer

AWS CloudTrail

Answer

SageMaker Model Registry

Question 11

A machine learning team needs to deploy a PyTorch model that has been compiled with SageMaker Neo to improve inference performance on edge devices. Which TWO statements about SageMaker Neo are correct? (Select TWO.)

Accepted Answer

Neo reduces model inference latency through optimization techniques. SageMaker Neo optimizes models for specific hardware targets (e.g., ARM, Intel, NVIDIA) and reduces latency. It does not require training frameworks; it compiles trained models. It does not automatically scale endpoints. It is not limited to built-in algorithms.

Answer

Neo requires the model to be trained on SageMaker

Answer

Neo can only compile models trained with SageMaker built-in algorithms

Answer

Neo automatically scales SageMaker endpoints based on demand

Question 12

A company is using SageMaker to serve a model for real-time predictions. They want to test a new model version by routing a small percentage of live traffic to it while the rest goes to the current model. They also need to compare performance metrics. Which TWO actions should they take? (Select TWO.)

Accepted Answer

Monitor the performance of both variants using SageMaker CloudWatch metrics. Option D is correct because Amazon CloudWatch provides built-in metrics for SageMaker endpoints, including latency, invocation counts, and error rates, which can be monitored per production variant. This allows the company to compare the performance of the new model version against the current model in real time. Option E is correct because SageMaker endpoints support multiple production variants, and you can set an initial traffic weight (e.g., 5%) to route a small percentage of live traffic to the new model while the rest goes to the existing variant.

Answer

Deploy the new model to a separate endpoint and use Route 53 to split traffic

Answer

Compile the new model with SageMaker Neo before deployment

Answer

Use SageMaker Batch Transform to evaluate the new model

Question 13

An ML engineer is designing a SageMaker Pipeline for model training and registration. They need to ensure that the pipeline can be re-run with different datasets without manual intervention, and that the steps are only re-executed if inputs have changed. Which THREE features should they configure? (Select THREE.)

Accepted Answer

Enable step caching to reuse outputs when inputs are unchanged. Pipeline parameters allow passing different inputs. Step caching reuses step outputs when inputs are identical. Using Parameterized execution is synonymous with using parameters. Lineage tracking is not for skipping steps. Condition steps are for branching, not caching. Model Registry is for versioning.

Answer

Add a Condition step to manually check for data changes

Answer

Configure lineage tracking to record the origin of models

Question 14

A company wants to deploy a containerized inference application that includes a custom pre-processing script and a TensorFlow model on SageMaker. They need the ability to independently scale the pre-processing and model serving components. Which TWO SageMaker features support this? (Select TWO.)

Accepted Answer

Multiple real-time endpoints with an Application Load Balancer. Multi-container endpoints allow running multiple containers (e.g., pre-processing and model) on the same endpoint, but scaling is per endpoint instance. Multi-model endpoints are for multiple models, not multiple containers. To independently scale, use separate endpoints with a load balancer, or use AWS App Mesh (not a SageMaker feature). The correct answer here is that multi-container endpoints run multiple containers, and for independent scaling you can use multiple endpoints.

Answer

SageMaker inference pipeline

Answer

SageMaker Batch Transform

Answer

Multi-model endpoint

Question 15

A team is deploying a model using SageMaker real-time endpoint with an ml.m5.large instance. They notice high latency under peak load. They want to reduce latency without increasing instance size. Which THREE actions could help? (Select THREE.)

Accepted Answer

Quantize the model to reduce its size. SageMaker Neo compiles the model for the target hardware, reducing latency. Elastic Inference attaches GPU acceleration to a CPU instance. Model quantization reduces model size and speeds up inference. Increasing instance count does not reduce per-request latency (it increases throughput). Changing to a GPU instance increases instance size.

Answer

Increase the number of instances in the endpoint

Answer

Change the instance type to ml.g4dn.xlarge