AI-900Chapter 21 of 100Objective 5.1

Large Language Models (LLMs)

This chapter covers Large Language Models (LLMs), a cornerstone of generative AI and a key topic for the AI-900 exam. You'll learn what LLMs are, how they work internally, and how they are used in Azure AI services. Approximately 15-20% of exam questions relate to generative AI, with LLMs being a major focus. Understanding LLMs is essential for answering questions about Azure OpenAI Service, prompt engineering, and responsible AI.

25 min read
Intermediate
Updated May 31, 2026

LLMs as a Vast Team of Expert Writers

Imagine a company with thousands of expert writers, each specializing in different topics, sentence structures, and vocabulary. When you give them a prompt, they don't just grab a pre-written answer; instead, they collaborate in real time. The CEO (the model's architecture) breaks your prompt into tiny subtasks and assigns them to writers based on their expertise. Each writer suggests a few words, and a voting system (the attention mechanism) weighs each suggestion based on how well it fits the context. The CEO then picks the most likely next word, and the process repeats. However, the writers have no understanding of truth or logic—they only know which words tend to follow others based on their training on billions of documents. If you ask a question they've never seen, they'll still produce plausible-sounding text, but it might be entirely false. This is why LLMs can 'hallucinate'—they are like writers who confidently invent answers when they lack relevant knowledge. The entire process is deterministic given the same input and random seed, but the randomness in sampling can produce different outputs each time.

How It Actually Works

What Are Large Language Models?

Large Language Models (LLMs) are a class of AI models trained on vast amounts of text data to understand and generate human-like language. They are called 'large' because they contain billions of parameters—the numeric weights that capture patterns in language. For example, GPT-3 has 175 billion parameters, while GPT-4 is estimated to have over a trillion. These models are a type of neural network, specifically a transformer architecture, introduced in the paper 'Attention Is All You Need' (Vaswani et al., 2017). LLMs are pre-trained on a diverse corpus of text from the internet, books, articles, and other sources, learning to predict the next word in a sequence. This pre-training phase is unsupervised, meaning the model learns from raw text without explicit labels.

How LLMs Work: The Transformer Architecture

The transformer architecture is the foundation of modern LLMs. It consists of two main components: an encoder and a decoder. However, many LLMs like GPT are decoder-only, focusing on generating text. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in a sequence when processing each word. This enables the model to capture long-range dependencies—for example, understanding that 'it' in 'The cat sat on the mat because it was comfortable' refers to 'mat'.

#### Tokenization

Before processing, text is broken into tokens. Tokens can be words, subwords, or characters. GPT-3 uses byte-pair encoding (BPE) to tokenize text into around 50,000 tokens. For instance, the word 'unbelievable' might be split into 'un', 'believe', 'able'. Each token is mapped to a unique ID in the model's vocabulary.

#### Embedding

Each token ID is converted into a high-dimensional vector (embedding) that captures semantic meaning. These embeddings are learned during training. Similar words have similar vectors. For example, 'king' and 'queen' have vectors that are close in the embedding space.

#### Self-Attention

Self-attention computes a weighted sum of all token embeddings in the input sequence for each token. The weights are determined by a compatibility score between the token and every other token. This is done using three matrices: Query (Q), Key (K), and Value (V). The attention score is calculated as:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Where d_k is the dimension of the key vectors. The scaling factor sqrt(d_k) prevents large values that would push the softmax into regions with extremely small gradients.

#### Multi-Head Attention

Instead of a single attention function, transformers use multiple attention heads in parallel. Each head learns different aspects of relationships. For example, one head might focus on syntactic dependencies, while another captures semantic similarity. The outputs of all heads are concatenated and linearly transformed.

#### Feed-Forward Networks

After attention, each token's representation passes through a feed-forward neural network (FFN) with two linear transformations and a ReLU activation. This adds non-linearity and allows the model to learn complex patterns.

#### Layer Normalization and Residual Connections

Each sub-layer (attention and FFN) is followed by layer normalization and a residual connection. Residual connections help gradients flow through deep networks, preventing vanishing gradients. Layer normalization stabilizes training by normalizing activations.

Training Process

LLMs are trained using unsupervised learning on a large corpus. The objective is to predict the next token given the previous tokens (causal language modeling). The model is trained to minimize cross-entropy loss. Training requires massive computational resources—GPT-3 was trained on thousands of GPUs for weeks, costing millions of dollars.

#### Fine-Tuning

After pre-training, LLMs can be fine-tuned on specific tasks with smaller datasets. For example, a model can be fine-tuned for sentiment analysis using labeled movie reviews. This process adjusts the model's weights to specialize in the task.

#### Prompt Engineering

Instead of fine-tuning, users can craft prompts to guide the LLM's output. Techniques like few-shot learning (providing a few examples in the prompt) and chain-of-thought prompting (asking the model to reason step-by-step) improve performance.

Key Components and Values

Context Window: The maximum number of tokens the model can consider at once. For GPT-3, it's 2048 tokens; for GPT-4, it's up to 32,768 tokens. This limits how much text you can input.

Temperature: Controls randomness in output. Lower values (e.g., 0.2) make output more deterministic and focused; higher values (e.g., 0.8) increase creativity and variety.

Top-p (Nucleus Sampling): Instead of sampling from the entire vocabulary, the model considers only the tokens with cumulative probability exceeding p (e.g., 0.9). This reduces the chance of unlikely tokens.

Max Tokens: The maximum length of the generated response. This prevents infinite generation.

Stop Sequences: Specific tokens or phrases that signal the model to stop generating.

Azure OpenAI Service

Azure OpenAI Service provides access to LLMs like GPT-4, GPT-3.5, and the DALL-E model for image generation. It is a cloud-based service that integrates with Azure's security and compliance features. Key capabilities include:

Content Filtering: Azure applies content filters to block harmful outputs.

Responsible AI: Microsoft has guidelines for using LLMs ethically.

Custom Models: You can fine-tune models using your own data.

How LLMs Interact with Related Technologies

Retrieval-Augmented Generation (RAG): Combines LLMs with external knowledge bases. The LLM retrieves relevant documents from a vector database and uses them to generate answers, reducing hallucinations.

Semantic Kernel: An open-source SDK that integrates LLMs with traditional programming languages like C# and Python, allowing for orchestration of AI components.

Copilot: Microsoft's implementation of LLMs in products like GitHub Copilot and Microsoft 365 Copilot, which assist users in coding and document creation.

Common Exam Values

GPT-3 has 175 billion parameters.

GPT-4 context window is 32,768 tokens (approximately 50 pages of text).

Default temperature in Azure OpenAI is 0.7.

The 'stop' parameter can be used to end generation.

LLMs are trained using unsupervised learning.

Summary

LLMs are powerful generative models based on the transformer architecture. They learn from vast text corpora and can perform a wide range of language tasks. In Azure, they are accessible via Azure OpenAI Service, with features like content filtering and fine-tuning. Understanding their capabilities and limitations is crucial for the AI-900 exam.

Walk-Through

1

Tokenize Input Text

The first step is to convert the input text into tokens using a tokenizer. For example, GPT-3 uses Byte-Pair Encoding (BPE) with a vocabulary of 50,257 tokens. Each word or subword is mapped to a unique integer ID. The tokenizer also adds special tokens like [CLS] or [SEP] depending on the model. The output is a sequence of token IDs that the model can process.

2

Generate Embeddings

Each token ID is converted into a dense vector (embedding) of fixed dimension, typically 768 for smaller models or 12,288 for larger ones. These embeddings are learned during training and capture semantic meaning. The embeddings are then added to positional encodings to retain the order of tokens, since the transformer processes all tokens in parallel.

3

Apply Self-Attention

The embeddings pass through multiple transformer layers. In each layer, self-attention computes attention scores between every pair of tokens. The scores determine how much each token should influence the representation of another token. For example, in the sentence 'The bank of the river', the word 'bank' will attend strongly to 'river' to understand its meaning. The output is a weighted sum of all token embeddings.

4

Pass Through Feed-Forward Network

After self-attention, each token's representation is fed into a feed-forward neural network (FFN) consisting of two linear layers with a ReLU activation in between. This layer adds non-linearity and allows the model to learn complex transformations. The FFN typically has a hidden dimension four times larger than the embedding dimension.

5

Generate Next Token Probability

After the final transformer layer, the model uses a linear layer (language model head) to project the last token's hidden state to a vector of vocabulary size, producing logits. Softmax is applied to convert logits to probabilities. The model then samples the next token from this probability distribution, optionally using temperature or top-p sampling. The process repeats to generate the full output.

What This Looks Like on the Job

Enterprise Scenario 1: Customer Support Chatbot

A large e-commerce company deploys an LLM-powered chatbot using Azure OpenAI Service to handle customer inquiries. The problem: traditional rule-based chatbots fail to understand diverse phrasing and context. The solution: the chatbot uses GPT-3.5 with a system prompt that defines its role as a helpful customer agent. It is integrated with a RAG pipeline that retrieves product information from a vector database (Azure Cognitive Search) to ground responses in accurate data. In production, the company sets temperature to 0.3 to ensure consistent, safe responses. They also implement content filters to block inappropriate language. Common misconfiguration: setting max tokens too low (e.g., 100) truncates answers; setting it too high (e.g., 2000) can lead to rambling. Performance: the system handles 10,000 queries per day with an average latency of 2 seconds. When the RAG pipeline fails (e.g., database unavailable), the LLM may hallucinate product details, causing customer confusion.

Enterprise Scenario 2: Code Generation Assistant

A software development firm uses GitHub Copilot, powered by OpenAI Codex, to assist developers. The problem: developers spend significant time writing boilerplate code. The solution: Copilot suggests code completions in real-time based on comments and context. It uses a fine-tuned LLM trained on public code repositories. In production, the model is configured with a low temperature (0.1) to generate predictable code. Developers can accept, reject, or modify suggestions. Common issues: the model may suggest insecure code (e.g., SQL injection vulnerabilities) or use deprecated libraries. The firm mitigates this by adding a static analysis tool that scans Copilot's suggestions. Scale: over 1 million developers use Copilot, generating billions of suggestions per month.

Enterprise Scenario 3: Document Summarization

A legal firm deploys an LLM to summarize lengthy contracts. The problem: manually summarizing 100-page documents is time-consuming. The solution: using GPT-4 with a 32K context window, the firm can input entire documents. The prompt instructs the model to extract key clauses and summarize in bullet points. The temperature is set to 0.2 for accuracy. A critical consideration: the model may miss nuanced legal language or misinterpret terms. Therefore, the firm always has a human lawyer review the output. Misconfiguration: not setting a stop sequence can cause the model to continue generating irrelevant text, wasting tokens and increasing costs.

How AI-900 Actually Tests This

The AI-900 exam tests your understanding of LLMs primarily under objective 5.1 (Identify features of generative AI models) and related objectives in domain 5 (Generative AI). Expect 3-5 questions on LLMs. Key points:

1.

What LLMs are: They are AI models trained on large text datasets to generate human-like text. They are not rule-based or programmed with explicit grammar rules.

2.

How they are trained: Unsupervised learning on massive text corpora. They learn statistical patterns in language. Fine-tuning uses supervised learning on labeled data.

3.

Azure OpenAI models: GPT-3.5 and GPT-4 are available. GPT-3 has 175 billion parameters. Context window for GPT-4 is 32,768 tokens.

4.

Parameters: Temperature (0-1, default 0.7), max tokens, top-p, stop sequences.

5.

Responsible AI: Content filtering, transparency, fairness. The exam emphasizes using LLMs ethically.

Common wrong answers:

Choosing 'supervised learning' for pre-training (it's unsupervised).

Thinking LLMs understand truth (they don't; they generate plausible text).

Confusing fine-tuning with prompt engineering (fine-tuning changes model weights; prompt engineering does not).

Assuming LLMs have a fixed context window (they do, but can vary by model).

Exam traps:

The exam may ask about 'hallucination'—the tendency of LLMs to generate false information. The correct answer is that it's due to the model's lack of true understanding.

Questions about 'temperature': low temperature makes output more deterministic; high temperature makes it more random.

'Few-shot learning' means providing examples in the prompt, not retraining the model.

Elimination strategy: If an answer says LLMs are trained with supervised learning on all data, eliminate it. If it says LLMs can reason logically, eliminate it. Focus on the statistical nature of LLMs.

Key Takeaways

LLMs are trained using unsupervised learning on large text corpora to predict the next token.

The transformer architecture uses self-attention to capture relationships between tokens.

GPT-3 has 175 billion parameters; GPT-4 has a 32,768-token context window.

Temperature controls randomness: lower values (e.g., 0.2) produce deterministic output; higher values (e.g., 0.8) increase creativity.

LLMs can hallucinate; they do not understand truth or logic.

Azure OpenAI Service provides access to GPT models with content filtering and responsible AI features.

Fine-tuning updates model weights; prompt engineering does not.

RAG (Retrieval-Augmented Generation) combines LLMs with external knowledge to reduce hallucinations.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

GPT-3

175 billion parameters

Context window of 2048 tokens

Trained on data up to 2021

Available in Azure OpenAI Service

Lower accuracy on complex reasoning

GPT-4

Estimated over 1 trillion parameters

Context window up to 32,768 tokens

Trained on more recent data

Multimodal (can process images)

Better at nuanced tasks and fewer hallucinations

Watch Out for These

Mistake

LLMs understand language like humans do.

Correct

LLMs do not understand language; they model statistical patterns in text. They have no comprehension, consciousness, or true reasoning ability. They generate text based on probabilities learned from training data.

Mistake

LLMs are trained using supervised learning.

Correct

Pre-training is unsupervised (the model predicts next words without labels). Fine-tuning may use supervised learning, but not the initial training.

Mistake

LLMs always produce correct and factual information.

Correct

LLMs can hallucinate—generate plausible but false information. They have no inherent truth-checking mechanism and can be confidently wrong.

Mistake

The context window is unlimited.

Correct

Each LLM has a fixed maximum context window (e.g., 2048 tokens for GPT-3, 32,768 for GPT-4). Inputs longer than this are truncated or cause errors.

Mistake

Prompt engineering and fine-tuning are the same thing.

Correct

Prompt engineering involves crafting input prompts to guide output without changing model weights. Fine-tuning updates the model's weights using additional training data.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between GPT-3 and GPT-4?

GPT-4 is more advanced than GPT-3. It has a larger context window (32,768 tokens vs. 2048), better reasoning abilities, and is multimodal (can process images). GPT-4 also produces fewer hallucinations and is more accurate on complex tasks. Both are available in Azure OpenAI Service, but GPT-4 is more expensive.

What does the temperature parameter do in LLMs?

Temperature controls the randomness of the model's output. A low temperature (e.g., 0.2) makes the model more deterministic and conservative, choosing the most likely tokens. A high temperature (e.g., 0.8) increases randomness, allowing for more creative and diverse outputs. The default in Azure OpenAI is 0.7.

How can I reduce hallucinations in LLMs?

Use techniques like Retrieval-Augmented Generation (RAG) to ground the model in external data, set a low temperature to reduce randomness, provide clear prompts with context, and use fine-tuning on domain-specific data. Also, implement content filtering to catch obvious errors.

What is the context window of an LLM?

The context window is the maximum number of tokens (words or subwords) the model can consider at once when generating a response. For GPT-3, it's 2048 tokens; for GPT-4, it's up to 32,768 tokens. Inputs longer than this are truncated or cause errors.

What is the difference between fine-tuning and prompt engineering?

Fine-tuning updates the model's weights by training it on additional data, making it specialize in a task. Prompt engineering involves crafting input prompts to guide the model's output without changing its weights. Fine-tuning requires computational resources and labeled data; prompt engineering does not.

What is the role of self-attention in transformers?

Self-attention allows the model to weigh the importance of different tokens in a sequence when processing each token. It computes attention scores that determine how much each token should influence the representation of others, enabling the model to capture long-range dependencies and context.

How does Azure OpenAI Service ensure responsible AI?

Azure OpenAI Service includes content filtering to block harmful outputs, requires users to adhere to responsible AI guidelines, and provides transparency about model capabilities and limitations. Microsoft also offers tools to monitor and mitigate biases.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Large Language Models (LLMs) — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?