MLS-C01

Study mode — explanations shown

1

Modeling

medium

A machine learning engineer is deploying a sentiment analysis model using Amazon SageMaker. The model is a BERT-based transformer that takes up to 512 tokens. The engineer notices that inference latency is high (over 500 ms per request) on a single ml.c5.xlarge instance. The application requires latency under 100 ms. The model has already been optimized using half-precision (FP16). Which action should the engineer take to reduce latency?

0 of 75 answered