This chapter covers Retrieval Augmented Generation (RAG), a technique that combines information retrieval with generative AI to produce accurate, up-to-date, and grounded responses. For the AI-900 exam, RAG appears under objective 5.3 (Generative AI) and typically accounts for 5-10% of questions. Understanding RAG is essential because it addresses a key limitation of LLMs—their reliance on static training data—by enabling them to access external knowledge bases at inference time.
Jump to a section
Imagine a legal researcher (the LLM) who has memorized general legal principles from law school (pre-training data). When asked a specific question about a current case, the researcher does not rely solely on memory, which may be outdated or incomplete. Instead, the researcher goes to a law library (the external knowledge base) that contains up-to-date statutes, case law, and legal briefs. The researcher first retrieves relevant documents from the library (retrieval step) and then reads them to formulate a precise answer (generation step). Critically, the researcher never modifies the library's documents—they remain unchanged. The researcher's final answer is a synthesis of the retrieved information and the researcher's own reasoning. If the library is poorly organized or contains irrelevant documents, the answer will be flawed. This mirrors RAG: a retrieval system fetches relevant chunks from a vector database, and the LLM uses those chunks as context to generate a grounded answer, without altering the stored data.
What is Retrieval Augmented Generation (RAG)?
Retrieval Augmented Generation (RAG) is an architectural pattern that enhances large language models (LLMs) by providing them with relevant external information retrieved from a knowledge base at the time of a query. Instead of generating responses solely from the model's internal parameters (which are frozen after training), RAG retrieves pertinent documents or text chunks and inserts them into the prompt as context. This allows the model to reference current, specific, or proprietary information without retraining.
RAG was introduced by Lewis et al. in 2020 (arXiv:2005.11401) and has become a standard approach for grounding LLMs. The core idea is to separate knowledge storage from reasoning: the LLM handles language understanding and generation, while a separate retrieval system handles access to factual knowledge.
Why RAG Exists
LLMs have several inherent limitations that RAG addresses:
Static knowledge: LLMs are trained on a fixed corpus and have a knowledge cutoff date. They cannot know about events after that date unless retrained.
Hallucination: When asked about obscure or specific topics, LLMs may fabricate plausible-sounding but incorrect information.
Lack of attribution: LLMs typically do not provide sources for their claims, making it hard to verify accuracy.
Inability to access private data: An organization's internal documents, databases, or APIs cannot be included in a public model's training data.
RAG solves these by allowing the model to retrieve relevant, up-to-date, and verifiable information from a dedicated knowledge base before generating an answer.
How RAG Works Internally
The RAG pipeline consists of two main phases: indexing (preparation) and retrieval-generation (inference).
#### Indexing Phase
Document Ingestion: Source documents (PDFs, web pages, databases, etc.) are collected. They are typically split into smaller chunks (e.g., 256-512 tokens) to improve retrieval granularity.
Embedding Creation: Each chunk is passed through an embedding model (e.g., text-embedding-ada-002 from OpenAI) to produce a dense vector representation. These vectors capture semantic meaning.
Vector Store Population: The embeddings are stored in a vector database (e.g., Azure Cognitive Search, Pinecone, Weaviate) along with the original text chunk and metadata. The vector index supports efficient similarity search.
#### Retrieval and Generation Phase
Query Embedding: When a user submits a query, the same embedding model converts the query into a vector.
Similarity Search: The vector database performs a nearest neighbor search (e.g., cosine similarity, Euclidean distance) to find the top-k most similar chunks (k is typically 3-10).
Context Assembly: The retrieved chunks are concatenated into a prompt template that includes the original query and the retrieved context. A typical prompt structure:
Context:
{retrieved_chunk_1}
{retrieved_chunk_2}
...
Question: {user_query}
Answer based on the context above. If the context does not contain enough information, say "I don't know."Generation: The LLM processes the augmented prompt and generates a response. Because the context is provided in the prompt, the model can ground its answer in the retrieved information.
Key Components, Values, and Defaults
Embedding Model: Converts text to vectors. Common choices: OpenAI text-embedding-ada-002 (1536 dimensions), Sentence-BERT (384-768 dimensions). In Azure, use Azure OpenAI embedding models.
Vector Database: Stores embeddings and performs similarity search. Azure Cognitive Search (with vector support), Cosmos DB (with vector index), or third-party like Pinecone.
Chunk Size: Typically 256-1024 tokens. Smaller chunks improve precision; larger chunks provide more context. Default often 512 tokens.
Chunk Overlap: 10-20% of chunk size to avoid losing context at boundaries. E.g., 512 tokens with 50 token overlap.
Top-k: Number of retrieved chunks. Default often 3-5. Higher k increases context but may introduce noise.
Similarity Metric: Cosine similarity is most common. Others: Euclidean distance, dot product.
Prompt Template: Must instruct the model to use context and avoid hallucination. A typical system message: "You are a helpful assistant. Use the provided context to answer the question. If you don't know, say you don't know."
Configuration and Verification
In Azure, RAG can be implemented using Azure OpenAI Service with Azure Cognitive Search as a vector store. The configuration involves:
Deploy an embedding model (e.g., text-embedding-ada-002) in Azure OpenAI.
Create an Azure Cognitive Search index with a vector field for embeddings.
Use the Azure OpenAI SDK or REST API to perform retrieval and generation.
Example Python code snippet (using LangChain):
from langchain.vectorstores import AzureSearch
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import AzureOpenAI
# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vector_store = AzureSearch(azure_search_endpoint="...",
azure_search_key="...",
index_name="rag-index",
embedding_function=embeddings.embed_query)
# Set up retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 5})
# Create LLM
llm = AzureOpenAI(deployment_name="gpt-35-turbo", model_name="gpt-35-turbo")
# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
# Query
response = qa_chain.run("What is the capital of France?")To verify, check that retrieved chunks are relevant and the generated answer includes citations or references.
Interaction with Related Technologies
RAG often works alongside:
Azure Cognitive Search: Provides hybrid search (keyword + vector) for better retrieval.
Azure OpenAI Service: Hosts the LLM and embedding models.
LangChain / Semantic Kernel: Orchestration frameworks that simplify RAG pipeline construction.
Azure Cosmos DB: Can store vector embeddings for real-time applications.
Azure Logic Apps / Power Automate: For automated document ingestion.
RAG is complementary to fine-tuning. Fine-tuning adapts the model's behavior, while RAG provides dynamic knowledge. They can be combined: a fine-tuned model can be used as the generator in a RAG pipeline.
Performance Considerations
Latency: Retrieval adds 100-500ms depending on vector database and network. Embedding generation also adds time.
Cost: Embedding and LLM calls incur token costs. Caching frequent queries can reduce expense.
Accuracy: Depends on chunking strategy, embedding model quality, and top-k selection. Poor retrieval leads to poor answers.
Security: Ensure the knowledge base only contains authorized content. Access control is critical.
Advanced RAG Patterns
Hybrid Search: Combines keyword (BM25) and vector search for better relevance.
Re-ranking: After initial retrieval, use a cross-encoder model to re-rank chunks and select the most relevant.
Query Transformation: Rewrite the user query to improve retrieval (e.g., decomposition, expansion).
Multi-hop RAG: For complex questions, retrieve iteratively, each time using the previous answer to inform the next retrieval.
Exam-Relevant Details
The AI-900 exam expects you to know that RAG retrieves information from an external source and uses it as context for the LLM.
You should be able to identify RAG as a solution to hallucination and out-of-date knowledge.
Understand that RAG does NOT modify the model's weights; it only modifies the input prompt.
Know that vector embeddings are used to represent text semantically.
Be aware that Azure Cognitive Search can serve as a vector store.
Common Pitfalls
Confusing RAG with fine-tuning: RAG is prompt-level augmentation; fine-tuning is weight-level.
Thinking RAG updates the model: No, the model remains unchanged.
Assuming retrieval is perfect: Poor chunking or embedding can cause irrelevant retrieval.
Ignoring prompt engineering: The prompt must instruct the model to use context; otherwise, the model may ignore it.
Conclusion
RAG is a powerful technique to ground LLMs with external knowledge. For the AI-900 exam, focus on the core concept, components (embedding, vector store, LLM), and benefits (reduced hallucination, up-to-date info, access to private data).
Ingest and chunk documents
Source documents are collected and split into smaller chunks (e.g., 512 tokens each) to enable precise retrieval. Overlap (e.g., 10-20%) is applied between chunks to preserve context at boundaries. Each chunk is assigned a unique ID and metadata (source, date, etc.). This step is crucial because chunk size directly affects retrieval quality: too large chunks may contain irrelevant information, too small may lack context.
Generate embeddings for each chunk
Each text chunk is passed through an embedding model (e.g., OpenAI text-embedding-ada-002) to produce a dense vector of floating-point numbers (e.g., 1536 dimensions). The vector captures semantic meaning: similar chunks have similar vectors. The embedding model is deterministic for the same input. The vectors are stored in a vector database along with the chunk text and metadata.
Store embeddings in vector database
The embedding vectors and associated text are inserted into a vector database like Azure Cognitive Search or Pinecone. The database builds an index (e.g., HNSW, IVF) that enables fast approximate nearest neighbor search. The index is optimized for cosine similarity or other distance metrics. The database also supports filtering by metadata (e.g., date range, source).
Embed user query
When a user submits a question, the same embedding model used during indexing converts the query into a vector. This ensures that the query and document chunks are in the same embedding space. The query embedding is typically computed in real-time (a few milliseconds).
Perform similarity search
The query vector is sent to the vector database, which performs a nearest neighbor search to find the top-k most similar document chunks (e.g., k=5). The search uses cosine similarity by default. The database returns the chunk texts, their similarity scores, and metadata. The retrieved chunks are ordered by relevance.
Construct augmented prompt
The retrieved chunks are inserted into a prompt template that includes the user query and instructions to use the context. A typical system message: "You are a helpful assistant. Use the following pieces of context to answer the question. If you don't know, say you don't know." The prompt size is limited by the LLM's context window (e.g., 4096 tokens for GPT-3.5).
Generate response with LLM
The augmented prompt is sent to the LLM (e.g., GPT-3.5, GPT-4). The model generates a response that synthesizes the retrieved context with its own language capabilities. Because the context is provided, the model is less likely to hallucinate. The response can include citations to the retrieved sources if instructed.
Enterprise Scenario 1: Customer Support Chatbot
A large telecommunications company deploys a customer support chatbot using RAG. The knowledge base contains thousands of support articles, troubleshooting guides, and product manuals. When a customer asks about a specific modem issue, the chatbot retrieves the most relevant articles and generates a step-by-step solution. The system is configured with Azure Cognitive Search as the vector store, chunk size of 512 tokens, and top-k=5. Latency is under 2 seconds. A common problem is that outdated articles are retrieved if the index is not refreshed regularly. To mitigate, the team sets up a weekly pipeline to re-ingest new articles and remove deprecated ones. Performance scales to 1000 queries per second with multiple replicas.
Enterprise Scenario 2: Legal Document Analysis
A law firm uses RAG to help attorneys review case law and statutes. The knowledge base contains millions of legal documents. The firm uses hybrid search (keyword + vector) to ensure precise retrieval of specific legal citations. Chunking is done at the paragraph level (300 tokens) to maintain legal context. The system is deployed on Azure with Cosmos DB as the vector store for low-latency access. A key challenge is handling confidential data; the vector store is encrypted and access is restricted via Azure Active Directory. Misconfiguration could occur if the embedding model is not aligned with legal terminology, leading to poor retrieval. The team fine-tunes the embedding model on legal text to improve accuracy.
Enterprise Scenario 3: Medical Diagnosis Support
A healthcare organization builds a RAG-based assistant for doctors. The knowledge base includes medical journals, drug databases, and clinical guidelines. The system retrieves relevant studies and drug interactions when a doctor queries a patient's symptoms. Chunk size is set to 256 tokens to capture precise medical facts. Top-k is 10 to ensure comprehensive coverage. The system uses Azure OpenAI with GPT-4 and a custom prompt that instructs the model to cite sources. A critical consideration is that the model must not provide medical advice directly; the prompt includes a disclaimer. If the retrieval fails to find relevant information, the model is instructed to say "I don't know" rather than guess. This prevents harmful hallucinations.
AI-900 Exam Focus on RAG
Objective Code: 5.3 – Describe capabilities of generative AI models, including RAG.
The AI-900 exam tests your understanding of RAG at a conceptual level. You are not expected to implement RAG, but you must know:
What RAG is: A technique that retrieves relevant information from an external knowledge base and uses it as context for an LLM to generate a grounded response.
Why RAG is used: To reduce hallucination, provide up-to-date information, and incorporate private/proprietary data without retraining the model.
Key components: Embedding model, vector database, LLM.
How it differs from fine-tuning: RAG does not change model weights; fine-tuning does.
Benefits: Improved accuracy, reduced hallucination, access to real-time data.
Common Wrong Answers and Why Candidates Choose Them
"RAG retrains the model on new data" – Candidates confuse RAG with fine-tuning. Remember: RAG only modifies the prompt, not the model.
"RAG stores the entire knowledge base in the model's parameters" – Misunderstanding that RAG embeds knowledge into weights. Actually, knowledge remains external.
"RAG is only useful for chatbots" – While common, RAG is used in many applications (search, Q&A, content generation).
"RAG eliminates the need for prompt engineering" – Prompt templates are still critical to instruct the model to use context.
Specific Numbers and Terms That Appear on the Exam
Top-k: Typically 3-5 (the number of retrieved chunks).
Chunk size: Often 256-512 tokens.
Embedding dimensions: 1536 for text-embedding-ada-002.
Vector store: Azure Cognitive Search is the primary example on Azure.
Similarity metric: Cosine similarity is most common.
RAG vs. fine-tuning: Know the difference.
Edge Cases and Exceptions
If the retrieved context is irrelevant, the LLM may still hallucinate. RAG does not guarantee correctness; it only provides relevant context.
RAG can be combined with fine-tuning (e.g., fine-tune the LLM on domain-specific language, then use RAG for specific facts).
The retrieval step can be replaced with a database query (SQL) or API call – the core idea is external knowledge injection.
How to Eliminate Wrong Answers
If an answer says RAG modifies the model's weights, it is wrong.
If an answer says RAG requires retraining, it is wrong.
If an answer says RAG only works with Azure, it is wrong – RAG is platform-agnostic.
If an answer says RAG stores data in the prompt permanently, it is wrong – the context is only for that specific query.
Focus on the core concept: RAG retrieves external information and adds it to the prompt to ground the LLM's response.
RAG retrieves external information and adds it to the LLM's prompt to ground responses.
RAG reduces hallucination and provides up-to-date, verifiable answers.
Key components: embedding model, vector database, LLM.
RAG does not modify model weights; fine-tuning does.
Top-k typically 3-5 retrieved chunks; chunk size 256-512 tokens.
Azure Cognitive Search is a common vector store in Azure RAG solutions.
The embedding model converts text to dense vectors for semantic similarity search.
These come up on the exam all the time. Here's how to tell them apart.
Retrieval Augmented Generation (RAG)
Modifies only the input prompt, not model weights.
Provides access to dynamic, up-to-date information.
Requires a separate retrieval system (vector store).
Lower cost per query (no training).
Easier to update knowledge (just refresh the index).
Fine-tuning
Updates model weights through additional training.
Knowledge is static after training (unless retrained).
No external retrieval needed; knowledge is internalized.
Higher upfront cost (training).
Updating knowledge requires full retraining.
Mistake
RAG retrains the model on new data.
Correct
RAG does not modify the model's weights. It only adds retrieved information to the prompt at inference time. The model remains unchanged.
Mistake
RAG stores all knowledge in the model's parameters.
Correct
Knowledge is stored externally in a vector database. The model only sees the retrieved chunks in the prompt, not the entire database.
Mistake
RAG and fine-tuning are the same thing.
Correct
Fine-tuning updates model weights through additional training. RAG leaves weights unchanged and augments the input with external data. They are complementary, not identical.
Mistake
RAG eliminates the need for prompt engineering.
Correct
Prompt engineering is still critical. The prompt must instruct the model to use the context; otherwise, the model may ignore it or hallucinate.
Mistake
RAG guarantees 100% accurate answers.
Correct
RAG reduces hallucination but does not eliminate it. If retrieval returns irrelevant or incorrect documents, the model may produce a wrong answer.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
RAG adds external information to the prompt at inference time; fine-tuning updates the model's weights through training. RAG is best for incorporating dynamic or private data, while fine-tuning adapts model behavior or style. They can be used together.
Yes, typically a vector database stores embeddings of document chunks and performs similarity search. Azure Cognitive Search, Pinecone, and Weaviate are common examples. However, any system that can retrieve relevant text (e.g., SQL database with full-text search) can be used, but vector search is most effective for semantic similarity.
By providing relevant context in the prompt, the LLM has factual information to base its answer on. The model is instructed to use the context and not to make up information. However, if the retrieved context is wrong, the answer may still be incorrect.
Yes, RAG is model-agnostic. Any LLM that accepts a prompt can be used. The key is to structure the prompt to include the retrieved context and instruct the model to use it.
Embeddings convert text into numerical vectors that capture semantic meaning. They enable efficient similarity search: the query and documents are compared in vector space to find the most relevant chunks.
It depends on the application. For real-time data (e.g., news), updates may be hourly. For static data (e.g., company policies), weekly or monthly. The index should be refreshed whenever the source documents change.
Hybrid search combines keyword-based (e.g., BM25) and vector-based retrieval. It improves relevance by matching exact terms as well as semantic meaning. Azure Cognitive Search supports hybrid search.
You've just covered Retrieval Augmented Generation (RAG) — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?