A team is deploying a model that requires GPU acceleration for inference. They are using an Amazon SageMaker real-time endpoint. The model is a large language model (LLM) that does not fit on a single GPU. Which configuration should they use to minimize latency while fitting the model?
Hardware and software support for large model inference.
Why this answer
Option C is correct because SageMaker supports model parallelism, allowing a model to be sharded across multiple GPUs in the same instance. Option A is wrong because SageMaker does not support multi-endpoint model parallelism. Option B is wrong because data parallelism is for training, not inference.
Option D is wrong because SageMaker Neo is for optimization, not model parallelism across GPUs.