AI-900Chapter 77 of 100Objective 5.2

GPT Models: GPT-3.5, GPT-4, o-Series

In the AI-900 exam's Generative AI objective, you'll encounter GPT models—specifically GPT-3.5, GPT-4, and the o-series—that form the foundation of Azure OpenAI Service. Understanding these models is crucial for the AI-900 exam, as they represent the core of Azure OpenAI Service capabilities. Approximately 10-15% of exam questions touch on generative AI models, including their capabilities, limitations, and appropriate use cases. You will learn how these models work, their differences, and how to choose the right one for a given task.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

GPT Models as Master Chefs

Picture a master chef who has honed their skills on millions of recipes from around the world. The chef doesn't memorize every recipe; instead, they learn the underlying principles of cooking: how flavors combine, what temperatures do to ingredients, and how textures work. When you ask for a dish, the chef doesn't just repeat a stored recipe—they generate a new one by predicting the next ingredient and step based on the context of your request. For GPT-3.5, think of a chef who trained on a huge library but sometimes gets the seasoning wrong or confuses similar cuisines. GPT-4 is like a chef who trained on an even larger, more diverse library, with better feedback from expert tasters, so they produce more refined and accurate dishes. The o-series models are like a chef who uses a special technique: they 'think' step-by-step, writing down their reasoning before cooking, which helps them solve complex recipes that require planning, like a multi-course meal with precise timing. Just as a chef's training data, model size, and reasoning process determine the quality of the dish, GPT models' capabilities depend on their architecture, training data, and inference techniques. And just as a chef cannot invent a completely new cuisine they've never seen, GPT models cannot generate information outside their training data—they can only recombine and predict what they've learned.

How It Actually Works

1. What Are GPT Models and Why Do They Exist?

GPT stands for Generative Pre-trained Transformer. These are large language models (LLMs) developed by OpenAI that generate human-like text by predicting the next word in a sequence. They are 'pre-trained' on vast amounts of text data from the internet, books, and other sources, learning grammar, facts, reasoning abilities, and even some biases. The 'Transformer' architecture, introduced in the paper 'Attention is All You Need' (Vaswani et al., 2017), allows the model to weigh the importance of different words in a sentence, enabling it to understand context and long-range dependencies.

GPT models exist because traditional rule-based or statistical language models struggled with fluency, context, and generating coherent long-form text. GPT models, especially larger ones, exhibit emergent abilities like translation, summarization, question answering, and code generation without being explicitly trained for each task. This makes them powerful tools for natural language processing (NLP) tasks, powering applications like chatbots, content generation, and code assistants.

2. How GPT Models Work Internally

At a high level, a GPT model takes a sequence of tokens (words or subwords) as input and outputs a probability distribution over the next token. It does this through multiple layers of transformer blocks, each containing self-attention and feed-forward neural networks.

Tokenization: Input text is split into tokens using a subword tokenizer (e.g., Byte Pair Encoding). For example, 'unbelievable' might become ['un', 'believ', 'able']. Each token is mapped to an embedding vector.

Positional Encoding: Since transformers process all tokens simultaneously, positional encodings are added to embeddings to give the model information about token order.

Self-Attention: For each token, the model computes attention scores with every other token in the input. This allows it to focus on relevant parts of the input. For instance, in 'The cat sat on the mat because it was tired', self-attention helps the model link 'it' to 'cat'.

Feed-Forward Layers: After attention, each token's representation passes through a feed-forward neural network, adding non-linearity and transforming the representation.

Stacking Layers: GPT models have many transformer layers (e.g., GPT-3 has 96 layers). Each layer refines the representation, building increasingly abstract features.

Output Layer: The final layer produces a probability distribution over the vocabulary for the next token. The model then samples from this distribution (using techniques like temperature or top-k sampling) to generate the next token.

3. Key Components: Model Size, Training Data, and Context Window

Model Size (Parameters): GPT-3.5 has 175 billion parameters, while GPT-4 is rumored to be larger (though exact size not disclosed). More parameters generally mean more capacity to learn patterns, but also require more compute.

Training Data: GPT-3.5 was trained on a mix of Common Crawl, WebText2, Books, Wikipedia, and other sources, totaling hundreds of billions of tokens. GPT-4 is trained on an even larger and more diverse dataset, including more code and multilingual data.

Context Window: The maximum number of tokens the model can consider when generating a response. GPT-3.5 has a context window of 4,096 tokens (about 3,000 words). GPT-4 offers 8,192 tokens (standard) and 32,768 tokens (extended). The o-series models (like o1) have a context window of up to 200,000 tokens, enabling processing of entire books or long codebases.

Temperature and Top-p: These are inference parameters that control randomness. Temperature (0-2) scales the logits before softmax; lower values make output more deterministic. Top-p (nucleus sampling) selects tokens with cumulative probability up to p; lower p reduces diversity.

4. Configuration and Usage on Azure

On Azure, GPT models are accessed via Azure OpenAI Service. To use them: 1. Create an Azure OpenAI resource in a supported region (e.g., East US, West Europe). 2. Deploy a model (e.g., gpt-35-turbo, gpt-4, o1-preview). 3. Use the REST API or SDK to send prompts and receive completions.

Example API call using curl:

curl https://<resource-name>.openai.azure.com/openai/deployments/<deployment-id>/chat/completions?api-version=2024-02-15-preview \
  -H "Content-Type: application/json" \
  -H "api-key: <api-key>" \
  -d '{
    "messages": [{"role": "user", "content": "What is AI?"}],
    "max_tokens": 100,
    "temperature": 0.7
  }'

5. Interaction with Related Technologies

GPT models are often used in conjunction with: - Azure Cognitive Search: For Retrieval-Augmented Generation (RAG), grounding the model with enterprise data. - Azure AI Content Safety: To filter harmful outputs. - Azure Machine Learning: For fine-tuning models on custom datasets. - Azure Functions and Logic Apps: To integrate GPT into automated workflows.

6. Differences Between GPT-3.5, GPT-4, and o-Series

GPT-3.5 (gpt-35-turbo): Optimized for conversational tasks, cost-effective, good for general chat, summarization, and simple Q&A. It is faster and cheaper than GPT-4.

GPT-4: More accurate, creative, and better at complex reasoning. It excels in tasks requiring deep understanding, like legal analysis, advanced coding, and nuanced content generation. It is more expensive and slower.

o-Series (o1, o1-mini): Designed for complex reasoning tasks. They use a technique called 'chain-of-thought' reasoning internally, spending more time 'thinking' before responding. They are best for math, science, coding problems, and multi-step planning. They are more expensive and have higher latency but produce more accurate results for hard problems.

7. Performance and Cost Considerations

Latency: GPT-3.5 typically responds in under 1 second for short prompts. GPT-4 can take 2-5 seconds. o-series models may take 10-30 seconds for complex reasoning.

Cost: Pricing is per 1,000 tokens (input + output). As of 2025, GPT-3.5 costs ~$0.0015 per 1K input tokens and $0.002 per 1K output tokens. GPT-4 costs ~$0.03 per 1K input and $0.06 per 1K output. o-series models are priced higher, e.g., o1 at $0.015 per 1K input and $0.06 per 1K output.

Rate Limits: Azure enforces tokens per minute (TPM) limits. Default for GPT-3.5 is 240K TPM, GPT-4 is 40K TPM, and o1 is 10K TPM. Limits can be increased via quota requests.

8. Common Use Cases

GPT-3.5: Customer support chatbots, content generation for blogs, email drafting, simple code generation.

GPT-4: Advanced research assistance, contract analysis, complex code debugging, creative writing.

o-Series: Solving advanced math problems, scientific research, multi-step logic puzzles, code optimization.

9. Limitations and Considerations

Hallucination: Models may generate plausible but incorrect information. Always verify critical outputs.

Bias: Training data may contain biases; outputs can reflect them. Use content filters and review.

Context Window: Information beyond the context window is forgotten. For long documents, use summarization or RAG.

Determinism: With temperature=0, output is deterministic, but for creative tasks, higher temperature is better.

Walk-Through

Tokenize Input Text

The input prompt is split into tokens using a subword tokenizer like Byte Pair Encoding (BPE). For example, 'Hello, world!' might become ['Hello', ',', ' world', '!']. Each token is assigned a unique ID from the model's vocabulary. The tokenizer ensures that common words are single tokens while rare words are broken into multiple tokens. This step is invisible to the user but critical for efficiency. The maximum number of tokens for a request is limited by the model's context window (e.g., 4,096 for GPT-3.5). If the input exceeds this, it must be truncated or split.

Embed Tokens and Add Positional Encoding

Each token ID is mapped to a high-dimensional vector (embedding) that captures its semantic meaning. These embeddings are learned during training. Then, positional encodings are added to give the model information about the order of tokens. Without positional encoding, the model would treat 'cat sat' and 'sat cat' identically. The positional encoding uses sine and cosine functions of different frequencies, allowing the model to learn relative positions.

Process Through Transformer Layers

The embedded sequence passes through multiple transformer layers (e.g., 96 for GPT-3). Each layer consists of two sublayers: multi-head self-attention and a feed-forward neural network. In self-attention, each token computes attention scores with all other tokens, allowing the model to weigh the importance of different words. For example, in 'The dog that chased the cat ran fast', the model learns that 'ran' is more related to 'dog' than to 'cat'. The feed-forward network then transforms each token's representation independently. Layer normalization and residual connections stabilize training. After 96 layers, the representation captures deep contextual meaning.

Generate Next Token Probability Distribution

The output from the final transformer layer is passed through a linear layer and a softmax function to produce a probability distribution over the entire vocabulary (e.g., 50,257 tokens for GPT-3). Each token gets a probability score. The model then selects the next token based on the sampling strategy: greedy (highest probability), temperature-scaled (dividing logits by temperature before softmax), or top-k/top-p sampling. For example, with temperature=0.7, the distribution is flattened, increasing diversity. The selected token is appended to the input sequence, and the process repeats until a stop condition (e.g., max tokens, end-of-sequence token) is met.

Repeat Until Completion

The model continues generating tokens one by one, feeding the newly generated token back into the input. This autoregressive process continues until the model outputs an end-of-sequence token or the maximum token limit (set by max_tokens parameter) is reached. During generation, the model maintains a cache of key-value pairs from previous attention computations to avoid recomputation, speeding up inference. The final output is then detokenized back into human-readable text. The entire process is stateless; each request is independent unless you include conversation history in the prompt.

What This Looks Like on the Job

Scenario 1: Enterprise Customer Support Chatbot

A large e-commerce company deploys a GPT-3.5-based chatbot on Azure to handle customer inquiries about orders, returns, and product information. The chatbot is integrated with Azure Cognitive Search to retrieve order details from a database. The company chose GPT-3.5 because it is cost-effective and fast enough for real-time chat. They set temperature to 0.3 to ensure consistent, factual responses. The system handles up to 10,000 conversations per hour, with an average response time of 800ms. A common misconfiguration is not setting appropriate content filters; without them, the model might generate inappropriate responses. The company uses Azure AI Content Safety to block harmful content. They also monitor token usage to avoid unexpected costs. When misconfigured (e.g., too high temperature), the chatbot may give creative but incorrect answers, leading to customer frustration.

Scenario 2: Legal Document Analysis with GPT-4

A law firm uses GPT-4 to analyze contracts and summarize key clauses. They need high accuracy and deep reasoning, so they chose GPT-4 despite higher cost. The firm uses the 32K context window to process entire contracts in one request. They set temperature to 0 for deterministic output. The system is deployed in a private Azure OpenAI instance with no internet access to ensure data privacy. They fine-tune the model on legal terminology to improve performance. A common issue is hallucination: the model might invent clauses not present in the document. To mitigate, they implement a RAG pipeline where the model only answers based on retrieved chunks. They also use human review for critical documents. Cost management is crucial; they limit max_tokens to avoid generating unnecessary text.

Scenario 3: Advanced Math Problem Solving with o-Series

A research institute uses o1-preview to solve complex mathematical proofs and generate code for simulations. The o-series model's chain-of-thought reasoning is essential for multi-step problems. For example, they ask the model to prove a theorem, and the model outputs step-by-step reasoning. They use the 200K token context window to include entire research papers. Latency is high (20-30 seconds per response), but accuracy is paramount. They set temperature to 0 and top_p to 0.1 to minimize randomness. A challenge is cost: o1 is expensive, so they only use it for hard problems, routing simpler queries to GPT-3.5. They also monitor for reasoning errors; if the model makes a mistake early, it propagates. They implement a validation step where the model checks its own reasoning.

How AI-900 Actually Tests This

AI-900 Objective 5.2: Identify capabilities of generative AI models

The AI-900 exam focuses on high-level understanding rather than deep technical details. You should know:

The difference between GPT-3.5, GPT-4, and o-series in terms of capability, cost, and use cases.

That GPT models are used for text generation, summarization, translation, and code generation.

That larger models (GPT-4) are more capable but more expensive and slower.

That o-series models are designed for complex reasoning tasks.

Common Wrong Answers

'GPT-4 is always better than GPT-3.5 for every task' – Candidates choose this because GPT-4 is newer and more powerful. Reality: For simple tasks like translation or summarization, GPT-3.5 is often sufficient and more cost-effective.

'o-series models are faster than GPT-3.5' – Candidates assume newer means faster. Reality: o-series models have higher latency due to internal reasoning steps.

'GPT models can access the internet in real-time' – Candidates think GPT is like a search engine. Reality: GPT models are static; they only know what they were trained on (up to a cutoff date). They do not browse the web unless integrated with a tool like Bing.

'All GPT models have the same context window' – Candidates generalize. Reality: GPT-3.5 has 4K, GPT-4 has 8K or 32K, o-series up to 200K.

Specific Numbers and Terms

GPT-3.5: 175 billion parameters, 4,096 token context window.

GPT-4: larger parameters (not disclosed), up to 32,768 tokens.

o-series: designed for reasoning, up to 200,000 tokens.

Temperature range: 0-2 (default 1). Top-p range: 0-1.

Azure OpenAI Service: the platform to deploy these models.

Edge Cases and Exceptions

The exam may ask about 'zero-shot' vs 'few-shot' learning. GPT models can perform tasks without examples (zero-shot) or with a few examples in the prompt (few-shot).

Fine-tuning is available for GPT-3.5 and GPT-4 but not for o-series (as of 2025).

Content filtering is applied by default on Azure OpenAI; you can configure severity levels.

How to Eliminate Wrong Answers

If a question asks for the 'best model for a simple chatbot', eliminate GPT-4 and o-series because they are overkill and expensive.

If a question mentions 'step-by-step reasoning' or 'complex math', choose o-series.

If a question mentions 'real-time data', the answer is not GPT alone; it needs RAG or a plugin.

Remember: larger context window does not mean better reasoning; it means more text can be processed at once.

Key Takeaways

GPT models are autoregressive language models that predict the next token based on input context.

GPT-3.5 is cost-effective and fast, suitable for general conversational AI.

GPT-4 is more capable for complex tasks but more expensive and slower.

o-series models are designed for advanced reasoning with high latency and cost.

Context window sizes: GPT-3.5 (4K), GPT-4 (8K/32K), o-series (up to 200K).

Temperature controls randomness; lower values produce more deterministic output.

Azure OpenAI Service is the platform to deploy these models with built-in content safety.

GPT models cannot access real-time data; use RAG to ground them with current information.

Fine-tuning is available for GPT-3.5 and GPT-4 but not for o-series.

Choose the model based on task complexity, latency requirements, and budget.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

GPT-3.5

175 billion parameters (approx.)

4,096 token context window

Lower cost (~$0.002 per 1K output tokens)

Faster response times (<1s typical)

Best for simple tasks: chat, summarization, translation

GPT-4

Larger parameter count (exact undisclosed)

Up to 32,768 token context window

Higher cost (~$0.06 per 1K output tokens)

Slower response times (2-5s typical)

Best for complex tasks: reasoning, analysis, creative writing

GPT-4

General-purpose language model

Standard transformer architecture

Moderate latency (2-5s)

Good at a wide range of tasks

Supports fine-tuning

o-Series (o1)

Reasoning-optimized model

Uses internal chain-of-thought reasoning

Higher latency (10-30s)

Excels at math, science, coding, multi-step logic

Does not support fine-tuning (as of 2025)

Watch Out for These

Mistake

GPT models understand language like humans do.

Correct

GPT models do not 'understand' in a human sense; they perform statistical pattern matching to predict the next token. They have no consciousness, emotions, or true comprehension. They generate text that appears meaningful because they have learned correlations from vast data.

Mistake

GPT-4 is 10 times better than GPT-3.5.

Correct

Improvements are task-dependent. GPT-4 shows significant gains in complex reasoning and accuracy, but for many routine tasks, the difference is marginal. The exam expects you to know that GPT-4 is 'more capable' but not a specific multiplier.

Mistake

o-series models are just GPT-4 with a different name.

Correct

o-series models use a different architecture (or reasoning technique) that emphasizes chain-of-thought internal reasoning. They are optimized for complex problem-solving, not general chat. They have higher latency and cost.

Mistake

You can train a GPT model on your own data without fine-tuning.

Correct

Pre-trained models are static. To incorporate custom data, you must either fine-tune the model (update weights) or use RAG (retrieve relevant data at inference time). The exam emphasizes RAG as a common pattern.

Mistake

All GPT models are available in all Azure regions.

Correct

Model availability varies by region. GPT-4 may not be available in all regions due to capacity constraints. The exam may test that you need to check regional availability.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between GPT-3.5 and GPT-4 on Azure OpenAI?

GPT-3.5 (gpt-35-turbo) is a faster, cheaper model optimized for conversational tasks with a 4,096-token context window. GPT-4 is more capable, with better reasoning and creativity, offering up to 32,768 tokens, but at higher cost and latency. For the AI-900 exam, remember that GPT-4 is 'more capable' and GPT-3.5 is 'cost-effective'. Choose based on task complexity and budget.

What are o-series models and when should I use them?

o-series models (e.g., o1, o1-mini) are designed for complex reasoning tasks. They use internal chain-of-thought processing, spending more time 'thinking' before responding. Use them for math, science, coding, and multi-step logic problems. They have higher latency and cost than GPT-4. On the exam, they are associated with 'advanced reasoning'.

Can GPT models access the internet or my private data?

No, GPT models are static and only know what they were trained on (up to a cutoff date). They cannot access the internet or your private data unless you explicitly provide it in the prompt or use a RAG pattern with Azure Cognitive Search. The exam tests this limitation – do not assume GPT has real-time access.

What is the context window and why does it matter?

The context window is the maximum number of tokens the model can consider when generating a response. It includes both input and output tokens. For example, GPT-3.5 has a 4,096 token limit. If your prompt plus expected response exceeds this, you must truncate or split the input. The exam may ask about context window sizes for different models.

How do I control the creativity of GPT model outputs?

Use the temperature parameter (0-2). Lower temperature (e.g., 0.1) makes output more deterministic and focused. Higher temperature (e.g., 1.5) increases randomness and creativity. Top-p (nucleus sampling) also controls diversity. For factual tasks, use low temperature. For creative writing, use higher temperature. Default is 1.0.

What is the difference between zero-shot and few-shot learning in GPT models?

Zero-shot means the model performs a task without any examples. For example, asking 'Translate to French: Hello' without showing any translation examples. Few-shot means providing a few examples in the prompt to guide the model. GPT models are capable of both. The exam may test that few-shot improves performance for specific tasks.

Can I fine-tune GPT models on my own data?

Yes, Azure OpenAI supports fine-tuning for GPT-3.5 and GPT-4 (base models) but not for o-series as of 2025. Fine-tuning updates the model weights with your data, improving performance for domain-specific tasks. Alternatively, you can use RAG without fine-tuning. The exam expects you to know fine-tuning is available for some models.

Terms Worth Knowing

Artificial intelligence Computer vision Generative AI Machine learning Natural language processing Responsible AI

Ready to put this to the test?

You've just covered GPT Models: GPT-3.5, GPT-4, o-Series — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Foundation Models and Fine-Tuning

DALL-E for Image Generation

See the full AI-900 study guide