This chapter covers prompt engineering fundamentals, a critical skill for optimizing generative AI models like GPT-4 and DALL-E. On the AI-900 exam, this topic appears in objective 5.3 (Generative AI) and accounts for approximately 5-10% of questions. You will learn how to craft effective prompts, understand parameters like temperature and top-p, and apply techniques such as few-shot prompting and chain-of-thought reasoning. Mastery of these concepts is essential for passing the exam and for real-world AI solution development.
Jump to a section
Imagine you are a master chef in a kitchen with an incredibly skilled but literal-minded assistant. The assistant can execute any recipe perfectly, but only if you provide unambiguous, step-by-step instructions. If you say 'make a sauce,' the assistant might boil tomatoes or whip cream—it has no context. Prompt engineering is like crafting that recipe. You specify ingredients (data), techniques (reasoning steps), and desired outcome (format) with precision. For example, instead of 'cook the chicken,' you write 'preheat oven to 375°F, season chicken breast with salt and pepper, bake for 25 minutes until internal temperature reaches 165°F.' The assistant follows exactly, no guessing. If you forget to specify 'no bones,' you get bone-in. If you say 'make it spicy' without defining 'spicy,' you might get ghost peppers. The chef must anticipate ambiguity and preempt it. This mirrors how prompts guide LLMs: the model has vast knowledge but needs explicit structure to produce reliable, controlled outputs. A well-engineered prompt reduces randomness, enforces format (JSON, list), and constrains the model to relevant knowledge domains. Just as a recipe yields consistent dishes across different chefs, a good prompt yields consistent responses across different model runs.
What is Prompt Engineering?
Prompt engineering is the practice of designing input text (prompts) to elicit desired outputs from large language models (LLMs) and other generative AI models. It is both an art and a science, requiring understanding of model internals, tokenization, and inference parameters. The goal is to reduce ambiguity, control output format, and improve accuracy without retraining the model.
Why Prompt Engineering Exists
Generative models are trained on vast corpora and can produce diverse responses. Without careful prompting, outputs can be irrelevant, incorrect, or unsafe. Prompt engineering emerged as a way to steer model behavior using only the input text, leveraging the model's in-context learning ability. It is a cost-effective alternative to fine-tuning, especially when labeled data is scarce.
How Prompt Engineering Works Internally
When you send a prompt to an LLM, it is tokenized into subword units. The model processes the sequence through its transformer layers, attending to each token relative to others. The final hidden state is passed through a softmax to predict the next token. Prompt engineering manipulates this process by:
Providing clear instructions (e.g., 'Summarize the following text in one sentence:')
Including examples (few-shot) to teach the model the desired pattern
Using role-playing (e.g., 'You are a helpful assistant')
Specifying output format (e.g., 'Respond in JSON: {"answer": ...}')
Adding constraints (e.g., 'Do not include explanations')
Key Components and Parameters
Temperature: Controls randomness. Range 0-2, default 1. Lower values (e.g., 0.2) make output more deterministic; higher (e.g., 0.8) increase creativity. On AI-900, know that temperature=0 yields nearly identical outputs for the same prompt.
Top-p (nucleus sampling): Sets a cumulative probability threshold. The model samples from the smallest set of tokens whose cumulative probability exceeds p. Default is 1 (all tokens). A value of 0.9 means only tokens with top 90% probability are considered.
Max tokens: Limits output length. Default varies by model (e.g., 4096 for GPT-3.5). If exceeded, output is truncated.
Stop sequences: Specify strings that halt generation when encountered (e.g., 'END').
Frequency penalty: Reduces repetition by penalizing tokens that have already appeared. Range -2 to 2, default 0.
Presence penalty: Encourages the model to talk about new topics. Also range -2 to 2, default 0.
Prompt Engineering Techniques
- Zero-shot prompting: Ask directly without examples. Works well for simple tasks. - Few-shot prompting: Provide a few input-output examples in the prompt. The model learns the pattern in-context. Example: 'Translate English to French: English: Hello French: Bonjour English: Goodbye French:' - Chain-of-thought (CoT) prompting: Ask the model to reason step-by-step before giving the answer. Improves accuracy on arithmetic and logic tasks. Example: 'If John has 5 apples and eats 2, how many remain? Let's think step by step.' - Self-consistency: Generate multiple outputs with high temperature and select the most common answer. - Tree-of-thought: Explore multiple reasoning paths and evaluate them.
Configuration and Verification
In Azure OpenAI, you configure prompt parameters in the API call:
import openai
response = openai.ChatCompletion.create(
model="gpt-35-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.7,
max_tokens=150,
top_p=0.95,
frequency_penalty=0,
presence_penalty=0,
stop=["
"]
)To verify prompt effectiveness, run multiple tests with different temperatures and compare outputs. Use metrics like BLEU for translation or exact match for factual questions.
Interaction with Related Technologies
Prompt engineering works alongside retrieval-augmented generation (RAG), where external data is injected into the prompt to ground the model. It also integrates with content filtering to block harmful outputs. In Azure, you can use Prompt Flow to orchestrate complex prompt chains.
Define Task Objective
Clearly state what you want the model to do. For example, 'Summarize the following article in 50 words.' Avoid vague verbs like 'explain' or 'describe' unless you specify constraints. The model uses the instruction to set the context for generation. A well-defined objective reduces the search space of possible outputs.
Choose Prompt Format
Decide on zero-shot, few-shot, or chain-of-thought. For factual queries, zero-shot may suffice. For complex reasoning, use few-shot with examples that mirror the expected reasoning. For multi-step problems, chain-of-thought forces the model to show intermediate steps, improving accuracy by up to 30% on math tasks.
Set Model Parameters
Configure temperature, top-p, max tokens, and penalties. For deterministic tasks (e.g., code generation), set temperature=0. For creative writing, set temperature=0.8. Top-p=0.9 is a safe default. Max tokens should be slightly larger than expected output length to avoid truncation.
Craft Prompt Text
Write the prompt using clear language. Use delimiters (e.g., ###, """) to separate instructions from input. Include role assignment (e.g., 'You are a math tutor'). Specify output format (e.g., 'List each step as a bullet point'). Avoid leading questions that bias the answer.
Test and Iterate
Run the prompt with different inputs. Evaluate output for accuracy, relevance, and format compliance. Adjust parameters or add examples based on failures. For example, if output is too verbose, increase frequency penalty or add 'Be concise.' Use a holdout set to measure performance.
Enterprise Scenario 1: Customer Support Chatbot
A large telecom company uses GPT-4 to handle customer inquiries. They deploy prompt engineering to ensure responses are accurate and follow company policy. The prompt includes: 'You are a customer support agent for XYZ Telecom. Answer only based on the provided knowledge base. If unsure, say "I don't know." Do not share personal information.' They also use few-shot examples of common issues (e.g., billing errors). With temperature=0.2, responses are consistent. Misconfiguration (e.g., temperature=1) led to hallucinations where the bot invented policies, causing customer confusion.
Enterprise Scenario 2: Code Generation for Developers
A software company uses GitHub Copilot, which relies on prompt engineering. Developers write comments as prompts (e.g., '// function to calculate factorial'). The model generates code. To improve accuracy, they use few-shot prompts with correct code examples. They set max_tokens to 500 to avoid incomplete functions. A common issue: if the prompt is too vague (e.g., '// do something'), the model generates irrelevant code. They learned to be specific: '// Python function to compute Fibonacci numbers using recursion.'
Enterprise Scenario 3: Content Moderation
A social media platform uses Azure OpenAI to moderate posts. The prompt is: 'Classify the following text as safe or unsafe. If unsafe, specify the category: hate speech, harassment, violence. Do not explain.' They use temperature=0 for deterministic classification. They also include few-shot examples of borderline cases. Misconfiguration (e.g., top_p=1) caused inconsistent labels. After tuning to top_p=0.9, accuracy improved. They also faced token limits: long posts were truncated, so they pre-processed text to fit within 2000 tokens.
AI-900 Objective 5.3: Prompt Engineering
The exam tests understanding of prompt engineering as a technique to optimize generative AI outputs without retraining.
You must know the effect of temperature: lower values (near 0) produce more deterministic outputs; higher values increase randomness.
Know that top-p (nucleus sampling) controls diversity by limiting the set of tokens considered.
Understand few-shot prompting: providing examples in the prompt to guide the model.
Recognize chain-of-thought prompting as a method to improve reasoning by asking the model to show its work.
Common Wrong Answers
'Prompt engineering requires retraining the model.' Incorrect — it is a zero-shot or few-shot technique that does not modify model weights.
'Temperature controls the number of tokens generated.' Wrong — temperature affects randomness; max_tokens controls length.
'Top-p is the same as temperature.' No — top-p is cumulative probability; temperature scales logits.
'Few-shot prompting means the model is fine-tuned.' Incorrect — few-shot uses in-context learning without weight updates.
Exam-Specific Values
Default temperature: 1.0
Default top-p: 1.0
Temperature range: 0 to 2
Top-p range: 0 to 1
Penalty range: -2 to 2
Max tokens varies by model (e.g., GPT-3.5: 4096, GPT-4: 8192)
Edge Cases
Temperature=0: Still not completely deterministic due to floating point, but nearly so.
Top-p=0: Not allowed; must be >0.
Stop sequences: If not set, model may continue generating beyond desired end.
Few-shot example count: Typically 2-5 examples; too many can confuse the model.
How to Eliminate Wrong Answers
If answer mentions 'retraining' or 'fine-tuning', it is wrong for prompt engineering.
If answer confuses temperature with max tokens, eliminate.
If answer says top-p is a penalty, it is wrong.
If answer claims chain-of-thought reduces output length, it is false (it usually increases).
Prompt engineering optimizes LLM outputs without retraining.
Temperature controls randomness: 0 = deterministic, 1 = default, 2 = very random.
Top-p (nucleus sampling) limits token selection to a cumulative probability threshold.
Few-shot prompting provides examples to guide model behavior.
Chain-of-thought prompting improves reasoning by asking for step-by-step thinking.
Max tokens limits output length; if exceeded, output is truncated.
Stop sequences halt generation when encountered.
Frequency and presence penalties reduce repetition and encourage novelty.
Prompt engineering is a key skill for Azure OpenAI solutions.
AI-900 tests understanding of these parameters and techniques.
These come up on the exam all the time. Here's how to tell them apart.
Zero-Shot Prompting
No examples provided in prompt.
Relies entirely on model's pre-trained knowledge.
Best for simple, well-defined tasks.
May fail on complex or ambiguous tasks.
Lower token usage (no examples).
Few-Shot Prompting
Includes 2-5 input-output examples.
Teaches the model desired pattern via in-context learning.
Better for tasks requiring specific format or reasoning.
Can improve accuracy by 10-30% on complex tasks.
Higher token usage; risk of exceeding context length.
Mistake
Higher temperature always produces better results.
Correct
Higher temperature increases randomness; it is better for creative tasks but worse for factual accuracy. For deterministic tasks, temperature should be low (0-0.2).
Mistake
Prompt engineering is only about writing clear instructions.
Correct
It also involves setting parameters (temperature, top-p), using examples (few-shot), and structuring prompts with delimiters and roles. Clear instructions are just one part.
Mistake
Few-shot prompting requires hundreds of examples.
Correct
Few-shot typically uses 2-5 examples. The model learns from context, not from many examples. More examples can exceed token limits.
Mistake
Chain-of-thought prompting always improves accuracy.
Correct
CoT helps for reasoning tasks but may not help for simple factual queries. It also increases token usage and latency.
Mistake
Top-p and temperature are independent and can be set arbitrarily.
Correct
They interact: using both may over-constrain output. It is recommended to adjust one at a time. Typically, temperature is adjusted first.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Temperature scales the logits before softmax, controlling the randomness of token selection. Higher temperature makes the distribution more uniform, increasing randomness. Top-p (nucleus sampling) selects the smallest set of tokens whose cumulative probability exceeds a threshold p, then samples from that set. Both control output diversity but through different mechanisms. On AI-900, know that temperature is more commonly adjusted; top-p is often left at 1. Use temperature=0 for deterministic outputs.
Typically 2-5 examples. Using too few may not establish a clear pattern; using too many may exceed the model's context window or confuse the model. The examples should be diverse and representative of the task. For the AI-900 exam, know that few-shot uses a small number of examples (not hundreds).
Yes, prompt engineering applies to text-to-image models. You specify subject, style, lighting, composition, and other attributes. For example, 'a photorealistic cat wearing a hat, digital art, soft lighting.' Parameters like temperature may not apply, but the concept of crafting precise instructions is similar. On AI-900, this is covered under generative AI for images.
Chain-of-thought (CoT) prompting asks the model to reason step-by-step before giving the final answer. It improves performance on arithmetic, logic, and multi-step reasoning tasks. Use it when the task requires intermediate steps. On AI-900, CoT is a key technique for improving accuracy in reasoning scenarios.
Yes, by adding safety instructions in the system message, such as 'Do not generate harmful content.' However, it is not foolproof. Azure also provides content filtering and responsible AI tools. On the exam, know that prompt engineering is a layer of defense, but not a complete solution.
The system message sets the behavior and persona of the assistant. It is a form of prompt engineering. For example, 'You are a helpful assistant that answers questions concisely.' It influences the entire conversation. On AI-900, understand that system messages are part of prompt design.
Max tokens sets the maximum number of tokens the model can generate. If the model reaches this limit, it stops generating, which may truncate the response. Set it slightly above expected output length. On the exam, know that exceeding max tokens causes truncation, not an error.
You've just covered Prompt Engineering Fundamentals — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?