A data scientist is deploying a fine-tuned Mistral model on Amazon Bedrock. After deployment, inference latency is too high for real-time applications. Which configuration change can reduce latency without significantly impacting output quality?
Generating fewer tokens speeds up inference, and for many use cases 256 tokens is sufficient.
Why this answer
Reducing the max tokens limit decreases the number of generated tokens, directly reducing latency. Lowering temperature or using a larger model may not help or may degrade quality.