GCDLChapter 88 of 101Objective 3.3

LLM Concepts: Tokens, Context, Fine-Tuning

This chapter covers foundational concepts of Large Language Models (LLMs) essential for the Google Cloud Digital Leader exam: tokens, context windows, and fine-tuning. These concepts underpin how LLMs process and generate text, and understanding them is critical for evaluating AI solutions on Google Cloud. Approximately 10-15% of exam questions in the Data Analytics and AI domain touch on LLM fundamentals, often testing your ability to distinguish between tokenization, context limits, and fine-tuning approaches. Mastery of these topics will help you recommend appropriate AI services like Vertex AI and understand their operational constraints.

25 min read
Intermediate
Updated May 31, 2026

The Library Analogy for LLMs

Imagine a vast library where every book is a token — a word or subword. The library has a special reading room that can only hold 8,192 books at a time (the context window). A librarian (the model) reads the books in order, but can only refer to books currently in the room. When a new patron arrives with a question, the librarian selects the most relevant books from the shelves and arranges them on the table. The patron then asks a question, and the librarian scans the books on the table (the context) to generate an answer, one book at a time. If the patron asks a follow-up, the librarian may need to bring in new books and discard old ones to stay within the room's capacity. Fine-tuning is like giving the librarian a specialized set of books on a specific topic to study beforehand, so they become an expert in that area. The librarian doesn't rewrite the books; they learn to reference and combine them more effectively. The library's total collection is the model's training data, and the arrangement on the table is the prompt. The librarian's ability to recall and synthesize is the model's inference capability.

How It Actually Works

What Are Tokens?

Tokens are the fundamental units of text that LLMs process. They are not always whole words; a token can be a word, a part of a word (subword), or even a character, depending on the tokenizer used. For example, the sentence "I love Google Cloud" might be tokenized as ["I", " love", " Google", " Cloud"] or ["I", " lov", "e", " Go", "ogle", " Cloud"]. The specific tokenization scheme is determined by the model's tokenizer, which is a separate component trained on the training data. Most modern LLMs, including Google's PaLM and Gemini, use subword tokenization (e.g., Byte-Pair Encoding or SentencePiece). This allows the model to handle out-of-vocabulary words by breaking them into known subword units.

Why Tokens Matter for the Exam

The GCDL exam expects you to understand that token count directly affects cost and performance. Google Cloud's Vertex AI pricing for generative AI models is often per token (input and output). For instance, the text-bison@002 model charges $0.0005 per 1,000 input tokens and $0.0005 per 1,000 output tokens (as of 2024). So a prompt with 10,000 tokens costs 10x more than one with 1,000 tokens. Additionally, the model's context window is measured in tokens, not characters or words. A model with a 8,192-token context window can process roughly 6,000 words, but the exact word count depends on the language and tokenization.

How Tokenization Works Internally

The tokenizer first normalizes the input text (e.g., lowercasing, Unicode normalization). Then it applies a trained algorithm to split the text into tokens. For SentencePiece, it treats the input as a sequence of Unicode characters and uses a unigram language model to find the most likely segmentation. The tokenizer has a fixed vocabulary of, say, 32,000 tokens. Each token is mapped to a unique integer ID. The model then uses these IDs as input embeddings. When the model generates text, it outputs token IDs, which the tokenizer decodes back into text. This is why you cannot simply count words to estimate token usage.

Context Window: Definition and Constraints

The context window is the maximum number of tokens the model can consider at once when generating a response. It includes both the input prompt and the generated output. For example, Gemini 1.5 Pro has a context window of up to 1 million tokens (as of 2024). In contrast, earlier models like text-bison@002 have 8,192 tokens. The context window is a hard limit: if the input exceeds it, the model cannot process the entire input; it must be truncated or the request fails. The exam tests that you know the relative sizes: Gemini models have much larger context windows than PaLM 2 models.

Fine-Tuning: Purpose and Process

Fine-tuning is the process of taking a pre-trained LLM and training it further on a smaller, task-specific dataset. The pre-trained model already has broad knowledge from training on trillions of tokens; fine-tuning adapts it to a specific domain or style. For example, a customer support chatbot might be fine-tuned on historical support tickets to respond with appropriate tone and accuracy. Fine-tuning updates the model's weights, creating a new model version. It is distinct from prompt engineering (which does not change weights) and from in-context learning (which uses examples within the prompt).

Fine-Tuning Methods

Full fine-tuning: All model weights are updated. Requires significant compute and memory. Works well for large datasets.

Parameter-Efficient Fine-Tuning (PEFT): Only a small subset of parameters are updated, such as adapter layers (e.g., LoRA). This is faster and cheaper. Google Cloud's Vertex AI supports both methods.

Instruction Tuning: A type of fine-tuning where the model is trained on (instruction, response) pairs to follow instructions better. Many modern models are instruction-tuned.

When to Fine-Tune vs. Prompt Engineering

Fine-tuning is appropriate when:

The task requires learning a new pattern not present in the base model's training (e.g., a proprietary codebase).

You have a large, high-quality labeled dataset (thousands of examples).

The model's behavior needs to be consistent and reliable for a specific use case.

Prompt engineering is better when:

You need quick iteration without training cost.

The task is simple and can be described with a few examples (few-shot learning).

You want to use the base model as-is for general tasks.

Interaction with Related Technologies

On Google Cloud, fine-tuned models can be deployed on Vertex AI Prediction or Vertex AI Endpoints. The fine-tuning process itself uses Vertex AI Training or custom training on Compute Engine. You can also use Model Garden to access and fine-tune foundation models. The exam may ask about the trade-offs between using a base model via API (no fine-tuning) versus deploying a fine-tuned model. The latter gives higher accuracy on domain tasks but incurs additional training and hosting costs.

Exam-Relevant Details

Token limits are often tested: e.g., Gemini 1.5 Pro = 1M tokens, Gemini 1.0 Pro = 32K, PaLM 2 = 8K.

Fine-tuning does not change the context window size; it only updates weights.

The cost of fine-tuning includes training compute and storage of the model artifact.

Vertex AI supports automatic hyperparameter tuning for fine-tuning jobs.

Fine-tuning requires a dataset in JSONL format with prompt-response pairs.

Common Pitfalls

Confusing context window with training data size: the context window is the input limit at inference, not the size of the training corpus.

Assuming fine-tuning increases context window: it does not; the architecture's maximum remains.

Thinking tokens equal words: always approximate, but never exact. A good rule of thumb is 1 token ≈ 0.75 words for English.

Summary of Key Mechanisms

1.

Tokenization splits text into tokens via a trained subword tokenizer.

2.

The tokenized input is fed into the model, which has a maximum context window.

3.

If the prompt plus response exceed the context window, the model truncates or errors.

4.

Fine-tuning adjusts model weights on a specific dataset to improve performance.

5.

Fine-tuning does not alter the tokenizer or context window.

Walk-Through

1

Tokenization of Input Text

The input text is first normalized and then split into tokens by the tokenizer. For SentencePiece, the text is segmented into subword units using a unigram language model trained on the corpus. Each token is assigned a unique integer ID from a fixed vocabulary (e.g., 32,000 tokens). This step is deterministic for a given tokenizer. The output is a sequence of token IDs. For example, 'Hello world' might become [15496, 2159]. The tokenizer is part of the model and cannot be changed without retraining.

2

Input Embedding and Position Encoding

Each token ID is mapped to a dense vector (embedding) of fixed dimension (e.g., 768 for smaller models). These embeddings are learned during pre-training. Positional encodings are added to indicate the token's position in the sequence, as the transformer architecture is permutation-invariant. The resulting vectors are the input to the first transformer layer.

3

Context Window Check

Before processing, the model checks if the total number of tokens (input prompt + maximum allowed output) exceeds the context window. If it does, the input must be truncated or the request fails. For example, if the context window is 8,192 tokens and the input prompt is 8,000 tokens, the model can only generate up to 192 tokens of output. Most APIs automatically truncate the input from the beginning or the end, depending on configuration.

4

Forward Pass Through Transformer Layers

The embedded sequence passes through multiple transformer decoder layers (e.g., 32 layers). Each layer performs self-attention (allowing each token to attend to all previous tokens) and feed-forward neural network operations. The self-attention mechanism has O(n^2) complexity in the sequence length n, which is why long contexts are computationally expensive. The output of the final layer is a probability distribution over the vocabulary for the next token.

5

Autoregressive Token Generation

The model generates one token at a time. It selects the next token based on the probability distribution (using sampling or greedy decoding). The generated token is appended to the input sequence, and the process repeats. Each step re-encodes the entire sequence (though optimizations like KV caching reuse previous computations). Generation stops when an end-of-sequence token is produced or a maximum length is reached.

6

Fine-Tuning: Dataset Preparation

For fine-tuning, you prepare a dataset of (prompt, response) pairs in JSONL format, where each line is a JSON object with 'input_text' and 'output_text' fields. The dataset must be representative of the target task. Google Cloud recommends at least 500-1000 examples. The data is then split into training and evaluation sets. The fine-tuning job uses this data to update the model weights.

7

Fine-Tuning: Training Loop

The pre-trained model is loaded, and a training loop runs for several epochs (e.g., 2-5). During each epoch, the model processes batches of examples, computes the loss (e.g., cross-entropy between predicted and actual tokens), and updates weights via backpropagation. Hyperparameters like learning rate, batch size, and number of epochs are configured. Vertex AI supports automatic hyperparameter tuning. After training, the model is evaluated on the held-out set to check for overfitting.

8

Deployment of Fine-Tuned Model

The fine-tuned model is saved as a Vertex AI Model resource. It can be deployed to a Vertex AI Endpoint for online predictions or used for batch predictions. The endpoint auto-scales based on traffic. The cost includes the compute resources for the endpoint (e.g., number of replicas, machine type). The fine-tuned model has the same context window as the base model but may have different behavior.

What This Looks Like on the Job

Enterprise Scenario 1: Legal Document Summarization A law firm needs to summarize thousands of legal contracts daily. They use a base LLM via Vertex AI, but the generic model produces summaries that miss key legal clauses. They fine-tune the model on a dataset of 5,000 contract-summary pairs annotated by their legal team. The fine-tuned model now correctly identifies indemnification clauses and termination terms. The context window of 8,192 tokens is sufficient for most contracts (average 4,000 tokens). However, they must truncate very long contracts, which occasionally omits important details. They mitigate this by pre-processing and splitting contracts into sections. The fine-tuning cost was $500 for training on a single GPU, and the endpoint costs $0.10 per hour for a single replica. Misconfiguration: initially they did not shuffle the training data, causing the model to overfit to the order. After shuffling, accuracy improved by 15%.

Scenario 2: Customer Support Chatbot A large e-commerce company deploys a chatbot to handle returns and refunds. They use a pre-trained model with prompt engineering (few-shot examples) but find it inconsistent. They fine-tune on 10,000 support ticket responses. The fine-tuned model now uses the company's tone and correctly follows refund policies. The context window of 32K tokens (Gemini 1.0 Pro) allows including the entire customer conversation history. However, they notice that fine-tuning caused the model to become overly specific and lose general knowledge (catastrophic forgetting). They mitigate by mixing 10% of general data during fine-tuning. The endpoint handles 1000 requests per minute with 4 replicas. Common issue: if the fine-tuning dataset contains biased responses, the model amplifies those biases.

Scenario 3: Code Generation for Internal Tools A software company fine-tunes a code generation model on their internal codebase to help developers write functions using proprietary libraries. They use PEFT (LoRA) to reduce training cost. The fine-tuned model can generate code with the correct API calls 90% of the time, compared to 40% with the base model. The context window of 8K tokens is limiting for large code files; they must split functions. They deploy the model on a Vertex AI endpoint with a single GPU. Misconfiguration: they used a learning rate that was too high, causing the model to produce gibberish. After reducing the learning rate by 10x, training stabilized.

How GCDL Actually Tests This

The GCDL exam tests your understanding of LLM concepts primarily under Objective 3.3 (Data Analytics and AI) and related sub-objectives. Expect 2-4 questions that directly ask about tokens, context windows, or fine-tuning. The questions are often scenario-based: e.g., 'A company wants to analyze long legal documents. Which model capability is most important?' Answer: Large context window. Wrong answers often focus on fine-tuning or tokenization speed.

Common Wrong Answers and Traps: 1. 'Fine-tuning increases the context window.' This is false; the context window is a fixed architectural parameter. Candidates confuse fine-tuning with model upgrades. Exam trick: they might say 'fine-tuning allows the model to process longer inputs.' The correct answer is 'no, fine-tuning adjusts weights but does not change the maximum token limit.' 2. 'Tokens are the same as words.' Candidates often choose options that equate tokens to words, leading to miscalculations. The exam may give a word count and ask for token count estimation; the correct approach is to know that 1 token ≈ 0.75 words for English, but the exact ratio varies. They might ask 'Which is closest to 100 tokens?' and the answer is 'about 75 words.' 3. 'Prompt engineering is a form of fine-tuning.' This is a common misunderstanding. Prompt engineering does not change model weights; it only modifies the input. Exam questions may ask 'Which technique requires updating model parameters?' The correct answer is fine-tuning, not prompt engineering. 4. 'The context window includes only the input prompt.' Actually, it includes both input and generated output. A question might say 'A model with 8K context window receives a 7K token prompt. How many tokens can it generate?' The answer is at most 1K tokens (8K - 7K). Candidates often answer 8K, forgetting the output counts.

Specific Numbers and Terms to Memorize:

Gemini 1.5 Pro: 1 million token context window.

Gemini 1.0 Pro: 32K tokens.

PaLM 2 (text-bison@002): 8,192 tokens.

Fine-tuning dataset format: JSONL with 'input_text' and 'output_text'.

Vertex AI fine-tuning supports PEFT (LoRA) and full fine-tuning.

Tokenization algorithm: SentencePiece for many Google models.

Edge Cases:

If the prompt plus max output exceeds the context window, the API truncates the prompt (usually from the beginning). Candidates may think the model automatically extends the window.

Fine-tuning on a small dataset (<100 examples) often leads to overfitting; the exam may ask about best practices (e.g., use at least 500 examples).

The cost of fine-tuning includes both training and hosting; a question may ask 'Which cost is not associated with fine-tuning?' Answer: Data transfer (if data is in the same region).

How to Eliminate Wrong Answers: Use the mechanism: if the question asks about changing model behavior without retraining, eliminate fine-tuning. If the question asks about handling long documents, eliminate options that suggest tokenization speed (like 'faster tokenizer') — the key is context window size. For cost questions, remember that token count drives API cost, not word count.

Key Takeaways

Tokens are subword units; 1 token ≈ 0.75 words for English, but exact ratio varies.

Context window is the maximum total tokens (input + output) the model can process; Gemini 1.5 Pro has 1M tokens, Gemini 1.0 Pro has 32K, PaLM 2 has 8K.

Fine-tuning updates model weights on a task-specific dataset; it does not change the context window or tokenizer.

Fine-tuning requires a dataset in JSONL format with 'input_text' and 'output_text' fields; Vertex AI recommends at least 500 examples.

Prompt engineering does not change model weights; it uses in-context learning via examples in the prompt.

Full fine-tuning updates all parameters; PEFT (e.g., LoRA) updates only a small set, reducing cost and overfitting.

The cost of using an LLM via API is based on token count (input + output), not word count.

When the input exceeds the context window, the prompt is truncated (usually from the beginning) or the request fails.

Fine-tuning can cause catastrophic forgetting; mixing general data during training mitigates this.

Vertex AI supports both full fine-tuning and PEFT for foundation models like PaLM and Gemini.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Full Fine-Tuning

Updates all model weights (billions of parameters)

Requires significant compute (multiple GPUs/TPUs)

Higher risk of overfitting on small datasets

Produces a completely new model artifact (several GB)

Best for large datasets (10k+ examples) and when maximum accuracy is needed

Parameter-Efficient Fine-Tuning (PEFT)

Updates only a small fraction of parameters (e.g., adapter layers)

Requires much less compute (can run on a single GPU)

Lower risk of overfitting; preserves base model knowledge

Produces a small adapter file (few MB) that is combined with base model

Ideal for small to medium datasets (500-10k examples) and when training cost is a concern

Watch Out for These

Mistake

Fine-tuning increases the model's context window.

Correct

Fine-tuning does not change the architecture; the context window remains the same as the base model. The context window is a fixed maximum number of tokens the model can process at once, determined by the model's design. To handle longer sequences, you must use a model with a larger native context window.

Mistake

Tokens are exactly equivalent to words.

Correct

Tokens are subword units; one word can be split into multiple tokens (e.g., 'unbelievable' might become ['un', 'believ', 'able']). The token-to-word ratio varies by language and model. For English, a rough estimate is 1 token ≈ 0.75 words, but this is not exact.

Mistake

Prompt engineering and fine-tuning are the same thing.

Correct

Prompt engineering modifies the input text to elicit desired behavior without changing the model. Fine-tuning updates the model's weights through additional training. Prompt engineering is faster and cheaper but less powerful for specialized tasks. Fine-tuning is more permanent and requires a dataset.

Mistake

The context window only counts the input prompt, not the output.

Correct

The context window includes both the input tokens and the generated output tokens. If the input is 7,000 tokens and the context window is 8,000, the model can only generate up to 1,000 tokens of output before hitting the limit.

Mistake

Fine-tuning is always better than using a base model with good prompts.

Correct

Fine-tuning is resource-intensive and may cause overfitting or catastrophic forgetting. For many tasks, prompt engineering with few-shot examples achieves comparable results at lower cost. Fine-tuning is only recommended when you have a large, high-quality dataset and the base model's behavior is insufficient.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between tokens and words in LLMs?

Tokens are the smallest units that an LLM processes, often subwords. For example, the word 'unhappiness' might be tokenized as ['un', 'happiness'] or ['un', 'happi', 'ness']. The tokenizer is trained on a large corpus to find an optimal segmentation. A rough rule of thumb is that 1 token ≈ 0.75 words for English text, but this varies. On the GCDL exam, remember that token count is what matters for cost and context limits, not word count.

Can I increase the context window of a pre-trained model?

No, the context window is fixed by the model architecture. You cannot increase it after training. To handle longer sequences, you must use a model with a larger native context window, such as Gemini 1.5 Pro (1M tokens). Fine-tuning does not change the context window. Some techniques like sliding window attention exist but are not standard in off-the-shelf models.

How many examples do I need for fine-tuning on Vertex AI?

Google Cloud recommends at least 500 examples for fine-tuning, but the optimal number depends on the task complexity. For simple tasks, 500-1000 examples may suffice. For complex tasks, you may need thousands. Using fewer than 100 examples often leads to poor results. Vertex AI also supports PEFT, which can work with smaller datasets (e.g., 100-500 examples) but still benefits from more data.

What is the cost of fine-tuning a model on Vertex AI?

The cost includes training compute (based on machine type and training time) and storage for the model artifact. For example, fine-tuning a PaLM 2 model on a small dataset (500 examples) might cost around $50-100, while a large dataset (10k examples) could cost $500-1000. Additionally, deploying the fine-tuned model incurs ongoing endpoint costs (e.g., $0.10-1.00 per hour per replica).

Does fine-tuning improve the model's ability to follow instructions?

It can, but only if the fine-tuning dataset includes instruction-response pairs. This is called instruction tuning. Many base models are already instruction-tuned (e.g., text-bison@002). Fine-tuning on domain-specific instructions can further improve compliance, but it may also cause the model to overfit to the style of the training data.

What happens if my prompt exceeds the context window?

The model cannot process the entire input. Most APIs will truncate the prompt, typically from the beginning (i.e., the oldest tokens are dropped). Some APIs allow configuration of truncation direction. If the truncated prompt loses critical information, the response quality degrades. The best practice is to stay within the context window or use a model with a larger window.

Can I fine-tune a model on Google Cloud without using Vertex AI?

Yes, you can use custom training on Compute Engine or AI Platform, but Vertex AI provides managed services that simplify the process. Vertex AI supports fine-tuning of foundation models through Model Garden and AutoML. For custom models, you can use Vertex AI Training with your own training code. The exam focuses on Vertex AI as the recommended service.

Terms Worth Knowing

Ready to put this to the test?

You've just covered LLM Concepts: Tokens, Context, Fine-Tuning — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.

Done with this chapter?