Courseiva — IT Certification Practice Questions

Machine Learning Implementation and Operations

medium

A company is using Amazon SageMaker to host a real-time inference endpoint for a natural language processing model. The endpoint is configured with an ml.m5.large instance. After deployment, the company observes that the inference latency is higher than expected, and the endpoint is experiencing CPU utilization near 100% during peak hours. The model is a PyTorch model that uses a transformer architecture. The company wants to reduce latency without increasing cost significantly. Which approach should the company take?