AI-900Chapter 76 of 100Objective 5.1

Foundation Models and Fine-Tuning

This chapter covers foundation models and fine-tuning, two pivotal concepts in generative AI that the AI-900 exam tests under objective 5.1 (Generative AI). Foundation models are large pre-trained models that serve as versatile starting points, while fine-tuning adapts them to specific tasks. These topics appear in roughly 15-20% of AI-900 exam questions, often asking you to distinguish between foundation models, fine-tuning, and other techniques like RAG or prompt engineering. Mastery of these concepts is essential for understanding how Azure OpenAI Service and other generative AI services deliver customized solutions.

25 min read
Intermediate
Updated May 31, 2026

Foundation Models as Master Chefs and Fine-Tuning as Signature Dishes

Imagine a master chef who has trained for years on every cuisine—Italian, Japanese, French, Indian—learning all the fundamental techniques: knife skills, sauce making, fermentation, pastry. This chef can produce a passable version of almost any dish. This is a foundation model: a massive generalist trained on diverse data across many domains. Now, you want the chef to specialize in your restaurant’s signature dish, say a unique spicy ramen. Instead of retraining the chef from scratch, you take the chef and give them focused practice: you have them cook that ramen 100 times, adjusting the broth, noodle texture, and spice blend based on customer feedback. The chef retains all their general skills but now excels at that specific dish. This is fine-tuning: taking a pre-trained foundation model and further training it on a smaller, task-specific dataset to adapt its behavior. The chef doesn’t forget how to make sushi or pasta; they just become exceptionally good at ramen. Similarly, a fine-tuned model retains its general language understanding but optimizes for a particular task like sentiment analysis or legal document summarization. The key mechanistic parallel: fine-tuning updates the model’s weights (the chef’s neural pathways) using backpropagation on the new dataset, but with a much smaller learning rate to prevent catastrophic forgetting of the original training. The chef analogy breaks down if you think the chef forgets everything else—they don’t, just as fine-tuning preserves the foundation model’s broad capabilities.

How It Actually Works

What Are Foundation Models?

Foundation models are large-scale machine learning models trained on vast amounts of diverse data using self-supervised or semi-supervised learning. The term was popularized by the Stanford Institute for Human-Centered AI (HAI) in 2021. These models are typically based on transformer architectures and contain billions of parameters. Examples include GPT-4 (OpenAI), BERT (Google), Llama (Meta), and DALL-E (OpenAI for images). The key property: they are generalists. A single foundation model can perform multiple tasks—translation, summarization, question answering, code generation—without task-specific training, because the training data covers many domains.

Why Foundation Models Exist

Before foundation models, AI development required training a separate model from scratch for each task. This was expensive, time-consuming, and required large labeled datasets. Foundation models solve this by providing a single pre-trained model that can be adapted to many downstream tasks. This transfer learning approach drastically reduces the data and compute needed for new tasks. For example, training GPT-3 from scratch cost an estimated $4.6 million; fine-tuning it for a specific task costs a fraction of that.

How Foundation Models Work Internally

Foundation models are trained using a transformer architecture. The training process involves: - Pre-training: The model is trained on a large corpus of text (or images) using objectives like masked language modeling (BERT) or autoregressive next-token prediction (GPT). The model learns statistical patterns, grammar, facts, reasoning abilities, and even some world knowledge. - Parameters: During training, the model adjusts its weights (parameters) to minimize prediction error. For example, GPT-3 has 175 billion parameters. These parameters encode the learned patterns. - Context window: The model can attend to a fixed number of previous tokens (e.g., GPT-4 Turbo has a 128K token context window). This determines how much context the model can consider when generating output. - Inference: When you input a prompt, the model processes it through its layers, generating a probability distribution over the vocabulary for the next token, then samples or selects the most likely token. This repeats autoregressively.

What Is Fine-Tuning?

Fine-tuning is the process of taking a pre-trained foundation model and further training it on a smaller, task-specific dataset. This adapts the model's behavior to perform better on that particular task. For example, you might fine-tune GPT-4 on a dataset of customer support conversations to create a chatbot that responds in your company's tone and follows specific policies.

How Fine-Tuning Works Internally

Fine-tuning typically follows these steps: 1. Prepare a dataset: Collect a labeled dataset of input-output pairs for the target task. For text generation, this might be prompt-response pairs. The dataset should be representative and large enough (hundreds to thousands of examples). 2. Initialize with pre-trained weights: Start with the foundation model's weights. This is crucial—you are not training from scratch. 3. Set a low learning rate: The learning rate is typically 1e-5 to 5e-5, much lower than the pre-training learning rate (which might be 1e-4 or higher). This prevents the model from forgetting its general knowledge (catastrophic forgetting). 4. Train on the new dataset: Run standard supervised learning: for each batch, compute the loss (e.g., cross-entropy between predicted tokens and target tokens), then backpropagate to update weights. The model learns to map inputs to desired outputs. 5. Evaluate and iterate: Monitor performance on a validation set. Fine-tuning may require only a few epochs (1-5) because the model already has strong priors.

Key Components and Defaults

Learning rate: Typical fine-tuning learning rate: 2e-5 for GPT models. Too high causes forgetting; too low makes no progress.

Batch size: Common values: 8, 16, 32. Larger batch sizes stabilize training but require more memory.

Epochs: Usually 1-5. More epochs risk overfitting if dataset is small.

Weight decay: Often 0.1 to prevent overfitting.

Warmup steps: Linear warmup over the first 10-20% of training steps to avoid destabilizing the pre-trained weights.

Optimizer: AdamW is standard.

Configuration in Azure OpenAI Service

Azure OpenAI Service supports fine-tuning for models like GPT-3.5-Turbo, GPT-4 (limited), and others. The process is: 1. Prepare training data: JSONL format with each line containing a conversation (messages array) with role (system, user, assistant) and content. 2. Upload data: Use Azure OpenAI Studio or the REST API. 3. Create a fine-tuning job: Specify the base model, training file, validation file (optional), and hyperparameters (learning_rate_multiplier, n_epochs, batch_size). 4. Monitor: Track training loss and validation loss via the portal. 5. Deploy: Once fine-tuned, deploy the model as a custom endpoint.

Example CLI command using Azure CLI:

az cognitiveservices account list --query "[?contains(name, 'myopenai')].{Name:name, ResourceGroup:resourceGroup}" -o table

But for fine-tuning, you typically use the REST API or Python SDK:

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com",
    api_key="your-api-key",
    api_version="2024-02-01"
)

client.fine_tuning.jobs.create(
    training_file="file-123abc",
    model="gpt-35-turbo",  # base model
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 16,
        "learning_rate_multiplier": 0.1
    }
)

Interaction with Related Technologies

Fine-tuning is often compared to: - Prompt Engineering: No training; you craft prompts to guide the model. Cheaper but less powerful. - RAG (Retrieval Augmented Generation): Combines retrieval from a knowledge base with generation. No model retraining; adapts at inference time. - Transfer Learning: Fine-tuning is a form of transfer learning where the pre-trained model is the source. - Catastrophic Forgetting: If fine-tuning uses too high a learning rate or too many epochs, the model may overfit to the new task and lose general capabilities.

Trap Patterns on the Exam

Confusing fine-tuning with prompt engineering: Prompt engineering does not update model weights; fine-tuning does.

Thinking fine-tuning requires millions of examples: In reality, hundreds or thousands often suffice due to transfer learning.

Assuming fine-tuning is the only way to customize: RAG and prompt engineering are alternatives.

Believing foundation models are task-specific: They are general; fine-tuning makes them specialized.

Walk-Through

1

Select Base Foundation Model

Choose a pre-trained model that aligns with your task. For text generation, options include GPT-3.5-Turbo, GPT-4, or Llama. Consider factors like model size (parameters), context window, cost, and performance on benchmarks. In Azure OpenAI, you select from available base models in the region. The choice affects fine-tuning results: larger models may capture more nuance but cost more to fine-tune and deploy.

2

Prepare and Format Training Data

Collect a dataset of input-output pairs. For chat models, format as conversations with system, user, and assistant messages. Each example must be a valid JSON object. The dataset should be diverse and representative. For classification, use prompt-completion pairs. Ensure data quality: remove duplicates, correct errors, and balance classes. Typical dataset size: 500-10,000 examples. Azure OpenAI requires JSONL format with UTF-8 encoding.

3

Upload Dataset to Azure OpenAI

Upload the training file (and optional validation file) to Azure OpenAI using the API or Azure OpenAI Studio. The service stores the file and returns a file ID. Use the `files.create` endpoint. The file must be under 1 GB. Example: `curl -X POST https://your-resource.openai.azure.com/openai/files -H 'api-key: YOUR_API_KEY' -F 'purpose=fine-tune' -F 'file=@training_data.jsonl'`

4

Configure Hyperparameters and Submit Job

Set hyperparameters: n_epochs (1-5), batch_size (1-32), learning_rate_multiplier (0.1-1.0). Lower learning rate multiplier reduces the risk of catastrophic forgetting. Submit the fine-tuning job via the API or CLI. Azure OpenAI schedules the job on managed compute. Monitor job status: pending, running, succeeded, failed. Use `client.fine_tuning.jobs.create()` with the training file ID and base model name.

5

Evaluate and Deploy Fine-Tuned Model

After training completes, evaluate the model on a held-out validation set. Check loss curves for overfitting. If performance is satisfactory, deploy the model as a custom endpoint via Azure OpenAI Studio or API. The endpoint URL includes the fine-tuned model name. You can then call the endpoint with standard chat completions API. Monitor usage and cost; fine-tuned models incur higher per-token costs than base models.

What This Looks Like on the Job

Enterprise Scenario 1: Customer Support Chatbot

A large e-commerce company wants to deploy a chatbot that handles returns, order status, and product recommendations in a friendly, brand-consistent tone. They have thousands of historical support transcripts. They fine-tune GPT-3.5-Turbo on 5,000 curated conversations. The fine-tuned model learns specific policies (e.g., 'Return window is 30 days') and tone (empathetic, concise). In production, the chatbot reduces human escalation by 40%. Configuration: 3 epochs, batch size 8, learning rate multiplier 0.2. Common pitfall: using too many epochs (10) caused the model to memorize specific responses and hallucinate incorrect policies. The solution was to reduce epochs and add a validation set.

Enterprise Scenario 2: Legal Document Summarization

A law firm needs to summarize lengthy contracts into bullet points. They fine-tune a base model (e.g., Llama 2) on 2,000 contract-summary pairs. Fine-tuning improves ROUGE scores by 15% compared to prompt engineering. They deploy on Azure ML with managed endpoints. Performance consideration: fine-tuned model has higher latency (2-3 seconds vs 1 second for base model) due to larger parameter updates. They mitigate by using a smaller base model (7B parameter Llama) and quantization. Misconfiguration: using a learning rate multiplier of 1.0 caused the model to forget legal terminology and generate generic summaries. Adjusted to 0.1.

Scenario 3: Code Generation for Internal Tools

A software company fine-tunes Codex (a code generation model) on their internal API documentation and codebase. The fine-tuned model generates code snippets that correctly use internal function names and follow company style guides. They use Azure OpenAI Service with fine-tuning. Scale: 10,000 examples of code-comment pairs. They monitor for drift: after 6 months, they retrain with new examples. Common issue: the model overfits to deprecated APIs if the dataset is not updated. They implement a data freshness pipeline to include only recent code.

How AI-900 Actually Tests This

AI-900 Objective 5.1: Generative AI

The exam tests your ability to:

Identify the characteristics of foundation models (trained on broad data, general-purpose, can be adapted).

Understand fine-tuning as a technique to adapt a pre-trained model to a specific task using a smaller dataset.

Differentiate fine-tuning from prompt engineering and RAG.

Recognize scenarios where fine-tuning is appropriate (e.g., custom tone, domain-specific knowledge, consistent output format).

Know that fine-tuning updates model weights, while prompt engineering does not.

Common Wrong Answers and Why Candidates Choose Them

1.

'Fine-tuning requires training from scratch' – Candidates confuse fine-tuning with pre-training. Reality: fine-tuning starts from a pre-trained model. The exam may present a scenario where a company uses a pre-trained model and then trains it further; that is fine-tuning, not training from scratch.

2.

'Fine-tuning is the same as prompt engineering' – Both customize output, but prompt engineering does not change the model. Candidates pick this because both involve providing examples (few-shot prompt). The key difference: fine-tuning changes weights; prompt engineering does not.

3.

'Foundation models are task-specific' – The name 'foundation' implies general, but candidates may think each model is built for one task. The exam will ask: 'Which is a characteristic of a foundation model?' Answer: trained on diverse data.

4.

'Fine-tuning requires millions of examples' – Candidates overestimate data needs. The exam may state a scenario with 1,000 examples and ask if fine-tuning is feasible; answer is yes.

Specific Numbers and Terms on the Exam

Parameters: GPT-3 has 175 billion parameters; GPT-4 has more (exact not public). BERT has 340 million. The exam may ask: 'Which model has the most parameters?'

Context window: GPT-4 Turbo: 128K tokens; GPT-3.5: 4K or 16K.

Learning rate multiplier: Common default 0.1-0.2.

Epochs: Typically 2-4.

Dataset size: Hundreds to thousands of examples.

Edge Cases and Exceptions

Catastrophic forgetting: If fine-tuning uses too high a learning rate or too many epochs, the model may lose general capabilities. The exam might ask: 'What is a risk of fine-tuning?'

Fine-tuning is not always the best choice: For tasks requiring up-to-date information, RAG may be better. The exam may present a scenario where a company needs to answer questions about current events and ask which approach to use (RAG).

Fine-tuning can be done on image models: DALL-E can be fine-tuned for specific art styles.

How to Eliminate Wrong Answers

If the question mentions 'no training data' or 'no model retraining', eliminate fine-tuning.

If the question describes 'providing examples in the prompt', it is prompt engineering, not fine-tuning.

If the question says 'model learns from new data and updates its weights', it is fine-tuning.

If the question mentions 'combining retrieval with generation', it is RAG.

Key Takeaways

Foundation models are large, pre-trained models trained on diverse data for general-purpose use.

Fine-tuning adapts a foundation model to a specific task by further training on a smaller dataset.

Fine-tuning updates model weights; prompt engineering does not.

Fine-tuning typically uses a low learning rate (e.g., 2e-5) to avoid catastrophic forgetting.

Common fine-tuning hyperparameters: n_epochs (1-5), batch_size (8-32), learning_rate_multiplier (0.1-0.2).

Azure OpenAI Service supports fine-tuning for models like GPT-3.5-Turbo and GPT-4.

Fine-tuning is not always the best choice; consider RAG for dynamic knowledge or prompt engineering for simple tasks.

The AI-900 exam may ask to identify fine-tuning scenarios: custom tone, domain-specific language, consistent output format.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Fine-Tuning

Updates model weights via additional training on task-specific data.

Requires a dataset of at least hundreds of examples.

Higher cost due to training compute and model hosting.

Produces a custom model that consistently follows the training data pattern.

Risk of catastrophic forgetting if hyperparameters are not tuned.

Prompt Engineering

No model updates; only modifies the input prompt.

Can work with zero or few examples (few-shot prompting).

Lower cost; only inference compute needed.

May be inconsistent; model can deviate from desired behavior.

No risk of forgetting; base model remains unchanged.

Watch Out for These

Mistake

Fine-tuning trains a model from scratch.

Correct

Fine-tuning starts with a pre-trained model and continues training on a smaller dataset. The initial weights come from the foundation model, not random initialization.

Mistake

You need millions of examples to fine-tune effectively.

Correct

Due to transfer learning, fine-tuning often works well with hundreds to a few thousand examples. The foundation model already understands language; fine-tuning adapts it to the specific task.

Mistake

Fine-tuning and prompt engineering are the same thing.

Correct

Prompt engineering does not change the model weights; it crafts the input to guide the model. Fine-tuning updates the model's parameters through additional training on task-specific data.

Mistake

Foundation models are designed for a single task.

Correct

Foundation models are trained on diverse data and can perform many tasks. They are general-purpose; fine-tuning or prompting makes them task-specific.

Mistake

Fine-tuning always improves performance over prompt engineering.

Correct

Fine-tuning can be more effective for tasks requiring consistent output or domain knowledge, but it is more expensive and may overfit. Prompt engineering is cheaper and can be sufficient for simple tasks.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a foundation model and a fine-tuned model?

A foundation model is a large, general-purpose model trained on broad data. A fine-tuned model is a foundation model that has been further trained on a specific dataset to specialize in a particular task. For example, GPT-4 is a foundation model; if you train it on legal documents, it becomes a fine-tuned legal model. The fine-tuned model retains general knowledge but performs better on the target task.

How many examples do I need to fine-tune a model?

Typically, hundreds to a few thousand examples are sufficient. For simple tasks, 500 examples may work; for complex tasks, 5,000 or more. The exact number depends on the base model's capabilities and the task's novelty. The exam expects you to know that fine-tuning does not require millions of examples.

Can I fine-tune a model without coding?

Azure OpenAI Studio provides a no-code interface for fine-tuning. You can upload data and configure hyperparameters through the portal. However, using the API or SDK gives more control. The exam does not require coding, but you should know the steps involved.

What is catastrophic forgetting in fine-tuning?

Catastrophic forgetting occurs when fine-tuning overwrites the model's general knowledge by using too high a learning rate or too many epochs. The model becomes overly specialized and loses its ability to handle other tasks. To prevent it, use a low learning rate, few epochs, and include a validation set.

How does fine-tuning differ from RAG?

Fine-tuning changes the model's weights to incorporate new knowledge. RAG (Retrieval Augmented Generation) retrieves relevant documents from a knowledge base at inference time and includes them in the prompt; the model weights remain unchanged. RAG is better for dynamic data; fine-tuning is better for static, consistent behavior.

What models can be fine-tuned on Azure OpenAI?

As of 2024, Azure OpenAI supports fine-tuning for GPT-3.5-Turbo, GPT-4 (limited preview), and certain other models. Always check the latest documentation. The exam may mention that GPT-4 fine-tuning is in preview.

Is fine-tuning available for image generation models?

Yes, DALL-E 3 can be fine-tuned for specific styles or subjects. Azure OpenAI Service may offer this. The exam focuses on text models, but you should know that fine-tuning applies to other modalities.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Foundation Models and Fine-Tuning — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?