A team is optimizing a RAG pipeline for OCI Generative AI. They observe that the model's responses are verbose and often include irrelevant details from the retrieved chunks, reducing user satisfaction. They have already tuned the prompt template. What is the most effective next step?
Re-ranking scores each chunk for relevance to the query, filtering out noise.
Why this answer
Option B is correct because implementing a re-ranking step with a cross-encoder model directly addresses the problem of verbose and irrelevant responses. Cross-encoders evaluate the query-document pair jointly, producing a fine-grained relevance score that filters out noisy or off-topic chunks before they reach the generation model. This improves the quality of the context provided to the LLM, reducing verbosity and irrelevance without requiring retraining or altering the retrieval threshold.
Exam trap
Cisco often tests the misconception that adjusting retrieval parameters (threshold or count) is sufficient to fix relevance issues, when in fact a dedicated re-ranking step is needed to refine the quality of the context passed to the generation model.
How to eliminate wrong answers
Option A is wrong because instruction tuning is a resource-intensive process that modifies the generation model itself, requiring a curated dataset and significant compute; it is not a lightweight next step and does not directly address the retrieval quality issue. Option C is wrong because simply reducing the number of retrieved chunks from 5 to 3 may discard relevant information while still allowing irrelevant chunks to pass through; it does not improve the relevance ranking of the chunks that are kept. Option D is wrong because increasing the similarity threshold from 0.7 to 0.85 may cause the retrieval step to miss relevant chunks that have lower cosine similarity scores, potentially reducing recall and still not filtering out irrelevant chunks that happen to score above the threshold.