A data science team has trained a PyTorch model using Amazon SageMaker and wants to deploy it with a custom inference container that includes a pre-processing step. The team needs to minimize latency and ensure the pre-processing runs only once per request. Which SageMaker real-time inference option should they use?
Trap 1: Deploy the model on a multi-model endpoint and include…
Multi-model endpoints host multiple models on the same instance but do not support chaining containers for pre-processing.
Trap 2: Use a batch transform job with a pre-processing script.
Batch transform is for asynchronous batch predictions, not real-time inference.
Trap 3: Package pre-processing and inference in a single container with a…
While possible, this doesn't leverage SageMaker's pipeline optimization and may be harder to maintain.
- A
Deploy the model on a multi-model endpoint and include pre-processing in the model code.
Why wrong: Multi-model endpoints host multiple models on the same instance but do not support chaining containers for pre-processing.
- B
Use a batch transform job with a pre-processing script.
Why wrong: Batch transform is for asynchronous batch predictions, not real-time inference.
- C
Package pre-processing and inference in a single container with a custom entry point.
Why wrong: While possible, this doesn't leverage SageMaker's pipeline optimization and may be harder to maintain.
- D
Create a SageMaker inference pipeline with two containers: one for pre-processing and one for inference.
An inference pipeline chains containers sequentially, allowing pre-processing to run once per request with low latency.