This chapter covers Retrieval-Augmented Generation (RAG) and grounding for generative AI responses, a critical technique for ensuring factual accuracy and reducing hallucinations in LLM outputs. For the GCDL exam, this topic appears in approximately 10-15% of questions under Domain 3 (Data Analytics and AI), Objective 3.3. You will be tested on the architecture, components, and benefits of RAG, as well as how grounding with enterprise data sources like Vertex AI Search and BigQuery enables trustworthy AI applications. Understanding the distinction between grounding and fine-tuning is essential.
Jump to a section
Imagine you are a brilliant but forgetful scholar (the LLM) who has memorized many textbooks up to a certain date. A visitor asks a question about a new regulation passed last week. If you rely only on memory, you might guess incorrectly or say 'I don't know.' Instead, you have a librarian (the RAG system) who, upon hearing the question, immediately runs to a bookshelf (the vector database) containing only up-to-date documents. The librarian finds the most relevant pages (retrieval), brings them to you, and you read the specific passage before answering (grounding). The visitor sees only your final answer, not the librarian or the open books. Crucially, the librarian does not rewrite the books—they just fetch the right pages. If the books contain outdated or conflicting information, your answer will reflect that. This is exactly how RAG works: the LLM's internal knowledge is augmented by external, retrievable data at inference time, without retraining.
What is RAG and Why Does It Exist?
Retrieval-Augmented Generation (RAG) is a technique that combines a retrieval system with a generative language model to produce responses grounded in external knowledge sources. The core problem RAG solves is that large language models (LLMs) have a fixed parametric knowledge cutoff—they only know what they were trained on, which may be months or years old. Additionally, LLMs can hallucinate, fabricating plausible-sounding but incorrect information. RAG addresses this by fetching relevant documents from a trusted, up-to-date knowledge base at inference time and conditioning the LLM's response on that retrieved context.
RAG was formally introduced by Lewis et al. in 2020 (arXiv:2005.11401). Since then, it has become the standard architecture for enterprise AI applications that require factual accuracy, such as customer support, legal document analysis, and medical Q&A. The GCDL exam expects you to understand that RAG does not modify the LLM's weights—it only augments the input prompt with retrieved information.
How RAG Works Internally: The Mechanism
The RAG pipeline consists of four main stages: indexing, retrieval, augmentation, and generation.
Indexing Stage (Offline): - Documents from enterprise sources (e.g., PDFs, internal wikis, BigQuery tables) are preprocessed. - Each document is split into chunks of typically 256-512 tokens (configurable). Overlapping chunks (e.g., 10% overlap) ensure no context is lost at boundaries. - Each chunk is converted into a vector embedding using a text embedding model (e.g., Google's textembedding-gecko or Vertex AI's embedding API). The embedding dimension is commonly 768 or 1024. - The embeddings are stored in a vector database such as Cloud Firestore, AlloyDB with pgvector, or Vertex AI Vector Search. An index is built for fast approximate nearest neighbor (ANN) search, often using algorithms like HNSW (Hierarchical Navigable Small World).
Retrieval Stage (Online): - When a user query arrives, the same embedding model converts the query into a vector. - The vector database performs a similarity search (cosine similarity or dot product) to find the top-k most relevant document chunks. Common k values are 3-10. - The retrieved chunks are ranked by similarity score. A relevance threshold (e.g., 0.7) may be applied to discard low-confidence results.
Augmentation Stage: - The original user query is combined with the retrieved chunks to form a new prompt. A typical template:
Context:
{retrieved_chunk_1}
{retrieved_chunk_2}
...
Question: {user_query}
Answer based on the context above. If the context does not contain the answer, say "I don't know."The prompt may also include system instructions to enforce grounded behavior, such as "Only use information from the provided context."
Generation Stage: - The augmented prompt is sent to the LLM (e.g., Gemini, PaLM 2, or any model deployed on Vertex AI). - The LLM generates a response conditioned on the retrieved context. Because the context contains the factual information, the LLM is less likely to hallucinate. - The response can be further post-processed to extract citations or highlight the source documents.
Key Components, Values, and Defaults
Embedding Model: Google's textembedding-gecko@001 produces 768-dimensional embeddings. The textembedding-gecko@003 model supports 768 or 384 dimensions. Default batch size for embeddings is 5.
Chunk Size: Typically 256 tokens for dense text, 512 for longer documents. Overlap is often 10-20%.
Top-K Retrieval: Default k=5 in Vertex AI Search. For custom RAG, k=3-10 is common.
Similarity Metric: Cosine similarity is the default for text embeddings. Distance thresholds vary by embedding model.
Vector Database: Vertex AI Vector Search uses ScaNN (Scalable Nearest Neighbors) for ANN search. It supports real-time updates with a latency of <100ms for indexing.
Context Window: The LLM's context window limits the total number of tokens in the augmented prompt. For example, Gemini 1.5 Pro has a 1 million token context window, allowing large amounts of retrieved context.
Configuration and Verification
In Vertex AI, you can set up a RAG system using the Vertex AI Agent Builder or programmatically via the SDK. Key steps:
Create a data store:
gcloud ai datastores create my-datastore \
--location=us-central1 \
--display-name="My RAG Data Store"Import documents:
gcloud ai datastores documents import my-datastore \
--location=us-central1 \
--gcs-path=gs://my-bucket/documents/ \
--content-type=text/htmlQuery with grounding:
from vertexai.preview.generative_models import GenerativeModel
model = GenerativeModel("gemini-1.5-pro-001")
response = model.generate_content(
"What is the refund policy?",
tools=[grounding_tool] # Configured with the data store
)To verify retrieval quality, you can examine the retrieved chunks using the get_grounding_metadata method or by inspecting logs in Cloud Logging.
Interaction with Related Technologies
Vertex AI Search: Provides managed RAG with out-of-the-box document indexing and retrieval. It supports structured and unstructured data, including PDFs, HTML, and BigQuery tables. It automatically handles chunking and embedding.
BigQuery: Can serve as a source for grounding by querying tables directly. Vertex AI Agent Builder can connect to BigQuery to retrieve rows as context. This is especially useful for real-time data like inventory levels or transaction records.
Vertex AI Agent Builder: Enables building conversational agents that use RAG. It integrates with Dialogflow CX and allows you to define fallback intents that trigger retrieval.
Model Garden: You can deploy open-source models (e.g., Llama 2, Falcon) and use them with RAG. However, the GCDL exam focuses on Google's managed services.
RAG vs. Fine-Tuning
| Aspect | RAG | Fine-Tuning | |--------|-----|-------------| | Data Freshness | Real-time updates to knowledge base | Requires retraining | | Model Weights | Unchanged | Modified | | Hallucination | Reduced by grounding | May still hallucinate on unseen data | | Cost | Lower (no training) | Higher (training compute) | | Use Case | Frequently changing data | Stable, domain-specific behavior |
The exam often tests this distinction. Remember: RAG is for factual accuracy and freshness; fine-tuning is for style, tone, or domain adaptation.
Grounding with Enterprise Data
Grounding is the broader concept of anchoring an LLM's response to verifiable sources. RAG is a specific implementation of grounding. In Google Cloud, grounding can be achieved through: - Vertex AI Search: Grounds responses using enterprise search results. - BigQuery Grounding: Queries BigQuery tables in real-time and injects results into the prompt. - Custom Grounding: Using Vertex AI Vector Search with your own embeddings.
Key exam point: Grounding does NOT require retraining the model. It is a prompt-engineering technique applied at inference time.
Limitations and Considerations
Retrieval Quality: If the retrieval step fails to find relevant documents, the LLM may fall back to its parametric knowledge, potentially hallucinating. Setting a relevance threshold helps but may cause unanswered queries.
Latency: Retrieval adds 100-500ms to response time. Caching can reduce this.
Context Window: The combined prompt (query + retrieved chunks) must fit within the model's context window. For large contexts, chunking strategy is critical.
Cost: Each query incurs embedding cost (for the query) and LLM generation cost. Vertex AI charges per token for both.
Exam-Tested Numbers and Terms
Top-K default: 5
Embedding dimension: 768 (textembedding-gecko@001)
Supported document types: HTML, PDF, TXT, JSON (Vertex AI Search)
Grounding source: Vertex AI Search, BigQuery, Cloud Storage
Latency for Vector Search: ~10ms for query embedding, ~50ms for retrieval
Step-by-Step: End-to-End RAG Query Flow
User submits query – The application receives a natural language question.
Query embedding – The query is sent to the embedding model (e.g., textembedding-gecko) which returns a vector.
Vector search – The query vector is used to search the vector index, returning the top-k nearest document chunks with similarity scores.
Threshold filtering – Retrieved chunks with similarity below a threshold (e.g., 0.7) are discarded. If no chunks remain, the system may return a fallback response.
Prompt construction – The remaining chunks are inserted into a prompt template along with the user query and system instructions.
LLM generation – The augmented prompt is sent to the LLM, which generates a response grounded in the provided context.
Citation extraction – The response may include citations referencing the source documents (e.g., document ID, page number).
Response delivery – The final answer is returned to the user, optionally with source links.
Commands for Verification
To test retrieval quality:
from vertexai.language_models import TextEmbeddingModel
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
embeddings = model.get_embeddings(["What is the return policy?"])
print(embeddings[0].values[:5]) # First 5 dimensionsTo query a grounded model:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/gemini-1.5-pro:generateContent \
-d '{
"contents": [{"parts": [{"text": "What is the warranty period?"}]}],
"tools": [{"retrieval": {"vertexAiSearch": {"datastore": "projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/MY_DATA_STORE"}}}]
}'User submits a query
The user types a natural language question into the application interface (e.g., a chatbot). The query is captured as a string. This step triggers the RAG pipeline. The application must ensure the query is within token limits (typically 4,096 tokens for the embedding model). The query is sent to the backend service, which logs the interaction for monitoring.
Query embedding generation
The query string is passed to a text embedding model (e.g., textembedding-gecko@001). The model converts the query into a 768-dimensional vector. This is a synchronous API call with a typical latency of 10-50ms. The embedding captures semantic meaning, enabling similarity search. The embedding is returned as a list of floats. The same model must be used for both indexing and querying to ensure consistent vector space.
Vector similarity search
The query vector is sent to the vector database (e.g., Vertex AI Vector Search). The database performs an approximate nearest neighbor (ANN) search using the ScaNN algorithm. It returns the top-k document chunks (default k=5) with the highest cosine similarity scores. The search typically completes in under 50ms for indices with up to 1 billion vectors. Each result includes the chunk text, metadata (e.g., document ID, page number), and similarity score between 0 and 1.
Relevance threshold filtering
The retrieved chunks are filtered by a similarity threshold (commonly 0.7). Chunks with scores below the threshold are discarded because they may be irrelevant and could cause the LLM to hallucinate. If no chunks pass the threshold, the system may return a default response like 'I don't know' or fall back to the LLM's parametric knowledge (depending on configuration). This step is optional but strongly recommended for production systems.
Prompt construction and augmentation
The system constructs a new prompt by combining the user query with the filtered chunks. A typical template uses a system instruction to restrict the LLM to the provided context. The prompt must fit within the LLM's context window (e.g., 32,000 tokens for Gemini 1.5 Pro). If the combined tokens exceed the limit, the system may truncate the least relevant chunks or use a sliding window approach. The augmented prompt is then sent to the LLM for generation.
LLM generates grounded response
The LLM (e.g., Gemini 1.5 Pro) processes the augmented prompt and generates a response token by token. Because the context contains the factual information, the LLM is less likely to hallucinate. The generation uses a temperature setting (often 0.2 for factual tasks) to reduce randomness. The response may include citations (e.g., [1], [2]) referencing the source chunks. The generation time depends on response length, typically 1-5 seconds.
Enterprise Scenario 1: Customer Support for a Telecom Company A large telecom provider deploys a RAG-powered chatbot on their website to answer customer questions about plans, billing, and technical issues. The knowledge base consists of thousands of PDFs (plan descriptions, troubleshooting guides) and a BigQuery table of real-time service outages. The chatbot uses Vertex AI Search to index the PDFs and BigQuery grounding to fetch current outage data. When a customer asks 'Why is my internet down?', the system retrieves relevant troubleshooting steps from PDFs and checks the BigQuery table for any known outages in the customer's area. The response includes the outage status and steps to reset the modem. Misconfiguration: If the BigQuery query is too slow (>500ms), the chatbot may respond with outdated information. The solution uses a cached snapshot of the outage table updated every 30 seconds.
Enterprise Scenario 2: Legal Document Analysis for a Law Firm A law firm uses RAG to help lawyers quickly find relevant precedents in a corpus of 500,000 legal documents. Each document is chunked into 512-token segments with 10% overlap to preserve context. The embeddings are stored in Vertex AI Vector Search with an HNSW index configured for 100 neighbors. Lawyers ask questions like 'What was the ruling in cases similar to Smith v. Jones regarding negligence?' The RAG system retrieves the top-5 most similar case excerpts and the LLM summarizes the key points with citations. Performance consideration: The vector index is rebuilt nightly to incorporate new cases. During peak hours, the system handles 100 queries/second with an average retrieval latency of 30ms. Common pitfall: If the chunk overlap is too small (e.g., 0%), key sentences at chunk boundaries are lost, causing the LLM to miss critical context.
Enterprise Scenario 3: Internal Knowledge Base for a Global Retailer A retail company with 50,000 employees uses a RAG system to answer HR policy questions. The knowledge base includes employee handbooks, benefit documents, and a FAQ page. The system is built using Vertex AI Agent Builder with Dialogflow CX. When an employee asks 'How many vacation days do I have?', the system retrieves the relevant policy document and also queries the employee's personal data from a BigQuery table (anonymized via a view). The response is grounded in both the policy and the employee's specific accrual. Security consideration: The BigQuery table uses row-level security to ensure employees only see their own data. Misconfiguration: If the grounding tool is not properly scoped, the LLM might leak sensitive information across employees. The solution uses a separate grounding source per user session.
Objective 3.3: Explain how to ground generative AI responses using enterprise data. The GCDL exam tests your understanding of RAG as the primary method for grounding. Key concepts:
RAG vs. Fine-Tuning: The exam loves to test the difference. Remember: RAG does NOT change model weights. Fine-tuning does. RAG is for factual accuracy; fine-tuning is for style/behavior. A common wrong answer says 'RAG requires retraining the model.' This is false.
Components of RAG: You must know the four stages: indexing, retrieval, augmentation, generation. Be able to identify which stage is responsible for what. For example, 'embedding generation' occurs during indexing and retrieval.
Vertex AI Search: This is the managed service for grounding. Know that it supports HTML, PDF, TXT, and JSON. It automatically chunks and embeds documents. Default top-k is 5.
BigQuery Grounding: The exam may ask how to ground with structured data. BigQuery grounding queries tables in real-time. It is ideal for dynamic data like inventory or pricing.
Hallucination Reduction: Grounding reduces but does not eliminate hallucinations. If the retrieved context is irrelevant or incorrect, the LLM may still hallucinate. The exam may ask about mitigating this: use relevance thresholds and high-quality data.
Trap Answers: - 'RAG improves model accuracy by fine-tuning on new data.' (Wrong: RAG does not fine-tune.) - 'Grounding requires a vector database only.' (Wrong: BigQuery grounding does not use vectors.) - 'Vertex AI Search can only index unstructured text.' (Wrong: It also supports structured data via BigQuery.) - 'The embedding model must be the same as the generative model.' (Wrong: They can be different; e.g., use textembedding-gecko with Gemini.)
Edge Cases: - What happens when no relevant documents are retrieved? The LLM may fall back to its parametric knowledge. The exam tests that you can configure a fallback response. - Context window overflow: If the retrieved chunks exceed the LLM's context window, the system must truncate. The exam may ask about strategies like chunking or summarization.
Numbers to Memorize: - Default top-k: 5 - Embedding dimension: 768 (textembedding-gecko@001) - Cosine similarity default metric - Vertex AI Vector Search latency: <100ms for retrieval
How to Eliminate Wrong Answers: - If an answer mentions 'retraining' or 'fine-tuning' in the context of grounding, it is likely wrong. - If an answer says 'grounding eliminates hallucinations,' it is too absolute; grounding reduces but does not eliminate. - If an answer suggests using a relational database without any vector capability for similarity search, it is wrong unless it specifies BigQuery grounding for structured data.
Exam Tip: When you see a question about 'reducing hallucinations without retraining,' the correct answer is almost always RAG/grounding. Look for keywords like 'retrieve,' 'search,' 'knowledge base,' 'enterprise data.'
RAG combines retrieval from a knowledge base with an LLM to produce grounded responses without retraining.
The four stages of RAG are indexing, retrieval, augmentation, and generation.
Default top-k retrieval in Vertex AI Search is 5 documents.
textembedding-gecko@001 generates 768-dimensional embeddings.
Grounding can use Vertex AI Search (unstructured) or BigQuery (structured data).
RAG reduces but does not eliminate hallucinations; relevance thresholds help mitigate risk.
Grounding is applied at inference time, not during training.
These come up on the exam all the time. Here's how to tell them apart.
RAG (Retrieval-Augmented Generation)
Does not modify model weights.
Updates knowledge by changing the external database.
Lower cost (no training compute).
Ideal for frequently changing data.
Reduces hallucinations by providing relevant context at inference.
Fine-Tuning
Updates model weights through additional training.
Knowledge is static until retrained.
Higher cost (requires GPU/TPU training time).
Ideal for adapting model behavior, style, or domain expertise.
May still hallucinate on data not seen during training.
Mistake
RAG requires retraining the language model.
Correct
RAG does not modify the LLM's weights. It only augments the input prompt with retrieved documents at inference time. The model remains unchanged.
Mistake
Grounding completely eliminates hallucinations.
Correct
Grounding significantly reduces hallucinations but does not eliminate them. If the retrieved context is irrelevant, contradictory, or low-quality, the LLM may still generate incorrect information.
Mistake
Vertex AI Search can only index unstructured text files.
Correct
Vertex AI Search supports multiple data types including HTML, PDF, TXT, JSON, and can connect to BigQuery for structured data. It also supports website crawling.
Mistake
The embedding model must be the same as the generative model.
Correct
They are independent. For example, you can use Google's textembedding-gecko for embeddings and any LLM (Gemini, open-source) for generation.
Mistake
RAG is only useful for question-answering chatbots.
Correct
RAG can be used for many tasks including summarization, content generation with citations, data extraction, and code generation grounded in internal documentation.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Grounding is the broader concept of anchoring an LLM's response to verifiable sources. RAG is a specific technique that implements grounding by retrieving relevant documents from a knowledge base and using them as context for generation. In Google Cloud, grounding can be achieved via Vertex AI Search, BigQuery, or custom vector databases. The exam often uses the terms interchangeably, but technically RAG is a subset of grounding methods.
Vertex AI Search automatically indexes documents (HTML, PDF, TXT, JSON) by chunking them into segments (default 256-512 tokens) and generating embeddings using a built-in model. When a query comes in, it embeds the query and performs a vector similarity search to retrieve the top-k chunks. These chunks are then provided as context to the LLM (e.g., Gemini) via a grounding tool. You can configure the data store, top-k, and relevance thresholds.
Yes. Vertex AI Agent Builder supports BigQuery grounding where you can configure a tool that executes a SQL query against a BigQuery table and injects the results into the prompt. This is ideal for data that changes frequently, such as inventory levels, pricing, or user-specific information. The query is executed in real-time at inference, ensuring freshness.
The behavior depends on configuration. You can set a relevance threshold; if no chunks exceed it, the system can either return a default response like 'I don't know' or fall back to the LLM's parametric knowledge. The exam expects you to know that you should configure a fallback to avoid hallucinations. Some implementations also use a 'no answer' detection model.
Yes, RAG is model-agnostic. You can use it with any generative model that accepts a text prompt, including open-source models like Llama 2 or Falcon, as well as proprietary models like Gemini or GPT-4. The key requirement is that the model can condition its response on the provided context. Google Cloud supports RAG with models deployed on Vertex AI.
Common metrics include retrieval precision (fraction of retrieved documents that are relevant), answer accuracy (human evaluation or automated metrics like BLEU/ROUGE), and hallucination rate. In production, you can log the retrieved chunks and the final answer for manual review. Vertex AI provides monitoring tools to track grounding metadata and response quality.
Costs include: (1) Embedding generation for indexing (one-time per document) and for each query. (2) Storage for vector embeddings and document chunks. (3) LLM generation cost per token. (4) Vector database query cost per request. Vertex AI charges per character for embeddings and per token for generation. Compared to fine-tuning, RAG is typically cheaper because it avoids training compute.
You've just covered RAG and Grounding Generative AI Responses — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.
Done with this chapter?