Your company uses Azure OpenAI Service to generate product descriptions. You need to reduce costs while maintaining response quality for common requests. What should you implement?
Caching reduces costs without sacrificing quality for common requests.
Why this answer
Semantic caching stores responses for semantically similar prompts, allowing Azure OpenAI Service to return cached results for common requests without incurring per-token inference costs. This directly reduces costs while preserving response quality, as the cached responses are identical to what the model would generate. Unlike other options, it avoids degrading output quality or requiring architectural changes.
Exam trap
The trap here is that candidates confuse semantic caching with simple request caching or assume batching is supported, overlooking that semantic caching specifically reuses responses for similar prompts without degrading quality, while other options either reduce quality or are technically infeasible.
How to eliminate wrong answers
Option A is wrong because using a smaller model variant (e.g., GPT-3.5 instead of GPT-4) reduces response quality and capability, contradicting the requirement to maintain response quality. Option B is wrong because Azure OpenAI Service does not support batching multiple requests into a single API call; each request is processed independently, and batching would require custom orchestration without cost savings. Option D is wrong because reducing max_tokens for all requests truncates responses, degrading quality for common requests that may require longer outputs, and does not address cost reduction for repeated or similar prompts.