This chapter covers the Azure OpenAI Service, a pivotal offering for generative AI workloads on Azure. For the AI-900 exam, understanding Azure OpenAI is critical as it represents a major domain in generative AI objectives, typically appearing in 15–20% of exam questions. We will explore its architecture, key models, use cases, and how to responsibly deploy and interact with the service. By the end, you will be able to distinguish Azure OpenAI from other AI services and answer exam questions confidently.
Jump to a section
Imagine a master chef's kitchen where every ingredient, recipe, and technique is stored in a vast library. The chef (Azure OpenAI Service) doesn't cook from scratch every time; instead, she uses pre-trained 'base recipes' (foundation models like GPT-4) that have been perfected over years. When a customer (your application) orders a specific dish (a completion or chat response), the chef selects the base recipe, then adjusts it based on the customer's preferences (the prompt). She can also add custom spices (fine-tuning) to make the dish unique for repeat customers. The kitchen has strict quality controls (content filters) to ensure no spoiled ingredients (harmful content) go into the dish. The chef can handle multiple orders simultaneously (scalability) and can even create new recipes on the fly (code generation). If a customer wants to use their own secret ingredient (your own data), the chef can incorporate it via a special process (Retrieval Augmented Generation or RAG) to make the dish more personalized. The kitchen also provides a menu (Azure AI Studio) to help customers design their orders without talking directly to the chef. This entire kitchen is not in one location but is distributed across multiple cloud kitchens (Azure regions) for reliability and speed. The key is that the chef does the heavy lifting of understanding language, generating text, and ensuring safety, while the customer simply describes what they want.
What is Azure OpenAI Service?
Azure OpenAI Service is Microsoft's cloud offering that provides REST API access to OpenAI's powerful language models, including GPT-4, GPT-3.5, and the DALL-E image generation model. It is designed to integrate seamlessly with Azure's ecosystem, offering enterprise-grade security, compliance, and scalability. Unlike directly using OpenAI's API, Azure OpenAI runs within Microsoft's Azure infrastructure, ensuring data residency, private networking (via Azure Virtual Network), and managed identity authentication. This makes it suitable for organizations that require strict data governance.
Why Azure OpenAI Exists
Before Azure OpenAI, developers had to either build their own large language models (LLMs) — an extremely resource-intensive task — or use third-party APIs that may not meet enterprise compliance. Azure OpenAI bridges this gap by providing pre-trained, state-of-the-art models as a managed service. It abstracts the complexity of model hosting, scaling, and updating. For the AI-900 exam, you must understand that Azure OpenAI is a 'Platform as a Service' (PaaS) offering for generative AI, not a 'Software as a Service' (SaaS) — you still write code to call the API.
How Azure OpenAI Works Internally
When you send a request to Azure OpenAI, it goes through several layers:
- Authentication: Every request must include an API key or Azure Active Directory (Azure AD) token. The service validates this against your Azure subscription.
- Content Filtering: Before the request reaches the model, it passes through Azure AI Content Safety filters. These check for hate, self-harm, sexual, and violence content. If the prompt violates policies, it is rejected with a 400 error.
- Model Inference: The request is routed to the specific model deployment you created. The model processes the input (prompt) using its transformer architecture, generating tokens sequentially. Key parameters like temperature (controls randomness, default 0.7) and max_tokens (limits output length, default varies) affect generation.
- Response Filtering: The generated output is also filtered for harmful content before being returned to your application.
- Logging and Monitoring: All requests can be logged to Azure Monitor and Azure Log Analytics for auditing and debugging.
Key Components and Their Defaults
Models: Azure OpenAI offers multiple models. The most common for text are GPT-3.5-Turbo and GPT-4. GPT-4 is more capable but also more expensive. For code, Codex models (based on GPT-3) are available. DALL-E 2/3 for image generation. Each model has a context window: GPT-4 supports up to 32,768 tokens, GPT-3.5-Turbo up to 16,384 tokens. A token is roughly 0.75 words.
Deployments: You must create a deployment for each model you want to use. Deployments define the model, version, and scaling options (throughput). You can have multiple deployments of the same model with different configurations.
Pricing: Azure OpenAI is pay-as-you-go based on tokens consumed. Rates vary by model. For example, GPT-4 costs $0.03 per 1K prompt tokens and $0.06 per 1K completion tokens (as of 2025). Reservations and committed use discounts are available.
Rate Limits: Default limits are set per deployment, typically 1,000 requests per minute (RPM) and up to 100,000 tokens per minute (TPM). These can be increased via support ticket.
Content Filters: Default filters are configured at the resource level. You can adjust severity thresholds (low, medium, high) for each category. The service uses Microsoft's Responsible AI standards.
Configuration and Verification Commands
You interact with Azure OpenAI primarily via REST API calls. Here's an example using curl to generate a chat completion:
curl https://<your-resource>.openai.azure.com/openai/deployments/<deployment-id>/chat/completions?api-version=2024-02-15-preview \
-H "Content-Type: application/json" \
-H "api-key: <your-api-key>" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 150
}'To verify your deployment, you can use Azure CLI:
az cognitiveservices account list --resource-group <rg-name> --output tableOr use Azure AI Studio (https://ai.azure.com) to test prompts interactively.
How Azure OpenAI Interacts with Related Technologies
Azure Cognitive Search: Used for Retrieval Augmented Generation (RAG). You can index your own data and have the model retrieve relevant chunks to ground its responses.
Azure Functions and Logic Apps: Can trigger OpenAI calls based on events.
Azure Bot Service: Integrate OpenAI to power conversational bots.
Azure AI Content Safety: Pre-built content filters are applied automatically; you can also call Content Safety API separately for more granular control.
Azure Monitor: Track token usage, latency, and errors.
Azure Virtual Network: Restrict access to the OpenAI endpoint from within your VNet.
Exam-Relevant Details
Azure OpenAI is available in limited regions: East US, South Central US, West Europe, France Central, etc. Not all regions support all models.
The service supports both API key and Azure AD authentication. For production, Azure AD is recommended.
Fine-tuning is available for GPT-3.5-Turbo and GPT-4 (limited preview). It requires a separate training job and incurs additional costs.
The default content filter severity is 'medium'. You can turn off filters but only with explicit approval.
Azure OpenAI does NOT store your data unless you enable logging. Microsoft does not use your data to retrain models.
Step-by-Step: Making a Simple Call
Create an Azure OpenAI resource in Azure Portal. Choose a region and pricing tier.
Deploy a model via Azure AI Studio or Azure CLI. Select model and version.
Generate an API key from the resource's 'Keys and Endpoint' blade.
Construct a request with the endpoint, deployment ID, and API key.
Send the request using your programming language of choice.
Parse the response which contains the generated text and token usage.
Common Pitfalls
Using wrong API version: Always use a supported API version (e.g., 2024-02-15-preview).
Exceeding token limits: The prompt + max_tokens must not exceed the model's context window.
Not handling rate limits: Implement retry logic with exponential backoff.
Ignoring content filters: Prompts or responses may be blocked; handle 400 errors gracefully.
Create Azure OpenAI Resource
Navigate to the Azure Portal and select 'Create a resource'. Search for 'Azure OpenAI' and click 'Create'. Fill in the subscription, resource group, region, and name. Choose the pricing tier (Standard S0 is typical). Review and create. This step provisions the base infrastructure, including an endpoint URL and initial API keys. The resource is a Cognitive Services account, so you can manage it alongside other AI services. Note: Some regions require explicit approval for access; you may need to request access via a form.
Deploy a Model
Once the resource is created, go to Azure AI Studio (https://ai.azure.com). Under 'Deployments', click 'Create new deployment'. Select a model like GPT-4 or GPT-3.5-Turbo. Choose a model version (e.g., 0613). Set the deployment name (e.g., 'gpt-4-deployment'). Configure throughput: you can set a rate limit (e.g., 1K TPM). This deployment creates an endpoint path: /openai/deployments/{deployment-name}. You can have multiple deployments per resource. The deployment is where all API calls are directed.
Get API Key and Endpoint
In the Azure Portal, go to your OpenAI resource. Under 'Keys and Endpoint', you'll see two keys (Key 1 and Key 2). Use either for authentication. Also note the endpoint URL, e.g., https://myopenai.openai.azure.com/. This URL is used in all API calls. For security, rotate keys regularly and use Azure Key Vault to store them. In production, prefer Azure AD authentication using managed identities instead of keys.
Construct and Send API Request
Use the endpoint, deployment ID, and API key to build a REST call. For chat models, the request body includes a 'messages' array with system, user, and assistant roles. Set parameters like temperature (0.0-2.0), max_tokens, top_p, and frequency_penalty. Example using Python: requests.post(url, headers={'api-key': key}, json=payload). The response contains 'choices' array with the assistant's reply and 'usage' object with prompt_tokens, completion_tokens, total_tokens. Handle HTTP errors: 401 for bad key, 429 for rate limit, 400 for content filter.
Handle Response and Monitor Usage
Parse the JSON response to extract the assistant's message. The 'choices' array typically has one element unless n>1. Monitor token usage to control costs. Use Azure Monitor to set up alerts on token consumption or error rates. For high-traffic applications, implement caching of common responses to reduce costs. Also, consider using streaming (stream=True) to get partial results for better user experience. The API returns tokens incrementally when streaming is enabled.
Enterprise Scenario 1: Customer Support Chatbot
A large e-commerce company deploys Azure OpenAI to power a customer support chatbot. They use GPT-3.5-Turbo for its balance of cost and quality. The chatbot is integrated with Azure Bot Service and Azure Cognitive Search for retrieving product information. The system handles thousands of concurrent conversations. Key considerations: They use Azure AD authentication for security, set rate limits to 10,000 TPM, and implement a fallback to human agents when confidence is low. They also fine-tune the model on past support tickets to improve accuracy. Misconfiguration: Initially, they set temperature too high (1.2), causing the bot to generate creative but irrelevant answers. They lowered it to 0.3 for more deterministic responses.
Enterprise Scenario 2: Legal Document Summarization
A law firm uses Azure OpenAI to summarize lengthy contracts. They deploy GPT-4 for its superior reasoning and long context window. The system processes documents via Azure Functions, sends them to the API, and stores summaries in Azure Blob Storage. Compliance is critical: they enable logging to Azure Monitor for audit trails and ensure no data is stored by Microsoft. They also use Azure Private Endpoint to keep traffic within their VNet. Problem: They initially used the default content filter, which blocked some legal terms. They adjusted severity to 'low' for the 'hate' category after consulting Microsoft.
Enterprise Scenario 3: Code Generation for Developers
A software company integrates Azure OpenAI into their IDE plugin to assist developers. They use the Codex model (based on GPT-3) for code completion. The plugin sends code context as prompts and receives suggestions. They handle rate limiting by batching requests and using exponential backoff. Performance: Average response time is under 2 seconds for short completions. They also use fine-tuning on their internal codebase to improve suggestions. Common misconfiguration: Developers sometimes exceed the max_tokens limit, causing truncated outputs. They now set max_tokens to 256 and use multiple calls for longer completions.
What AI-900 Tests on Azure OpenAI
AI-900 objective 5.2 focuses on identifying capabilities of Azure OpenAI Service. You need to know:
What models are available (GPT-4, GPT-3.5, DALL-E, Codex).
Use cases: content generation, summarization, code generation, image generation, conversational AI.
How it differs from other Azure AI services (e.g., Language Service for pre-built NLP, not generative).
Responsible AI features: content filtering, transparency notes.
Deployment options: Azure AI Studio, REST API, SDKs.
Common Wrong Answers and Why
'Azure OpenAI is a tool for building custom machine learning models.' Wrong: Azure OpenAI provides pre-trained models, not a platform to train your own from scratch. The exam wants you to know it's a pre-built model service.
'You can only use Azure OpenAI with Microsoft products.' Wrong: It can be called from any application via REST API, including non-Microsoft platforms.
'Azure OpenAI stores your data and uses it to improve models.' Wrong: Microsoft does not use your data for retraining unless you opt-in. Data is not stored by default.
'Content filters cannot be modified.' Wrong: You can adjust severity levels or disable filters with approval.
Specific Numbers and Terms
Token limits: GPT-4 up to 32,768 tokens; GPT-3.5-Turbo up to 16,384.
Default temperature: 0.7.
API version: Use latest preview (e.g., 2024-02-15-preview).
Regions: East US, South Central US, West Europe, France Central.
Authentication: API key or Azure AD.
Pricing: Pay-per-token.
Edge Cases and Exceptions
Fine-tuning is not available for all models; currently limited to GPT-3.5-Turbo and GPT-4 (preview).
DALL-E integration is separate from text models; you need a different deployment.
Some regions require explicit access approval; not all subscriptions are auto-enabled.
The service can be used offline via Azure Arc? No, it's cloud-only.
How to Eliminate Wrong Answers
If an answer says 'build custom models' — eliminate, because Azure OpenAI uses pre-trained models.
If an answer says 'on-premises deployment' — eliminate, it's cloud-only.
If an answer says 'data is used to train models' — eliminate, unless it mentions opt-in.
If an answer says 'only supports text' — eliminate, it supports images via DALL-E.
Focus on the distinction between generative AI (Azure OpenAI) and traditional ML (Azure Machine Learning). The exam loves to test this.
Azure OpenAI provides REST API access to OpenAI models like GPT-4, GPT-3.5, and DALL-E.
It is a PaaS offering with enterprise security, compliance, and scalability.
Models are deployed per instance; you must create a deployment before calling the API.
Authentication can be via API key or Azure AD (recommended for production).
Content filters are applied by default; severity can be adjusted.
Token limits: GPT-4 up to 32,768 tokens; GPT-3.5-Turbo up to 16,384 tokens.
Azure OpenAI does not store your data or use it for retraining without consent.
Fine-tuning is available for select models (GPT-3.5-Turbo, GPT-4 preview).
Use cases include text generation, summarization, code generation, image generation, and conversational AI.
The service is available in limited regions; some require access approval.
These come up on the exam all the time. Here's how to tell them apart.
Azure OpenAI Service
Generative AI: creates new content (text, code, images).
Based on large transformer models (GPT-4, DALL-E).
Requires prompt engineering and fine-tuning for customization.
Use cases: chatbots, content creation, code generation.
Pricing per token; higher cost for larger models.
Azure Cognitive Services (Language Service)
Pre-built NLP: extracts insights (sentiment, key phrases).
Based on smaller, task-specific models.
No prompt engineering; uses predefined APIs.
Use cases: text analytics, translation, language detection.
Pricing per transaction; lower cost per call.
Mistake
Azure OpenAI is the same as the public OpenAI API.
Correct
Azure OpenAI is a separate service hosted on Microsoft Azure, offering additional enterprise features like Azure AD integration, private networking, and compliance certifications. It uses the same models but with different endpoints and SLAs.
Mistake
You can train your own model from scratch using Azure OpenAI.
Correct
Azure OpenAI provides pre-trained models. You can fine-tune them but not train from scratch. For custom model training, use Azure Machine Learning.
Mistake
Azure OpenAI automatically stores all prompts and responses for retraining.
Correct
By default, Azure OpenAI does not store your data. You can enable logging for monitoring, but Microsoft does not use your data to improve models unless you explicitly opt in.
Mistake
Content filters cannot be adjusted and always block the same content.
Correct
Content filter severity levels can be configured per category (low, medium, high). You can also disable filters with approval. The default is medium.
Mistake
Azure OpenAI only works with C# or .NET applications.
Correct
Azure OpenAI provides a REST API that can be called from any programming language. SDKs are available for Python, JavaScript, C#, and others.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
First, ensure you have an Azure subscription. Then, go to the Azure Portal and create an Azure OpenAI resource. Some regions require explicit access approval via a request form. Once approved, you can deploy models and get API keys. Alternatively, you can use Azure AI Studio to get started quickly.
GPT-4 is more capable, with better reasoning, creativity, and longer context (up to 32,768 tokens vs 16,384 for GPT-3.5-Turbo). However, GPT-4 is more expensive and slower. For many tasks, GPT-3.5-Turbo provides good performance at lower cost.
Yes, via Retrieval Augmented Generation (RAG). You can index your data in Azure Cognitive Search and use the model to retrieve and generate responses based on that data. This keeps your data within Azure and does not require fine-tuning.
Implement retry logic with exponential backoff. Monitor the 'Retry-After' header in 429 responses. You can also request higher rate limits via a support ticket. Use multiple deployments to distribute load.
The default content filter severity is 'medium' for all categories (hate, self-harm, sexual, violence). This means moderate content may be blocked. You can adjust severity to 'low' or 'high' per category, or disable filters with Microsoft approval.
Yes, Azure OpenAI can be used in HIPAA-compliant environments when configured with appropriate security measures, such as using Azure AD authentication, enabling logging, and signing a Business Associate Agreement (BAA).
Use Azure Cost Management to view token consumption and costs. You can also set budgets and alerts. Additionally, Azure Monitor provides metrics on token usage per deployment. The pricing page gives per-model rates.
You've just covered Azure OpenAI Service — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?