This chapter covers Azure OpenAI Service deployments and API access, a core topic in the Generative AI domain of the AI-900 exam (Objective 5.2). You will learn how to deploy models, configure endpoints, manage authentication, handle rate limits, and use best practices for production. Expect 5-10% of exam questions to touch this area, primarily focusing on deployment types, authentication methods, and quota management.
Jump to a section
Imagine Azure OpenAI deployments like a hotel with multiple concierge desks. Each desk (deployment) has a specific model (e.g., GPT-4) and a set capacity (tokens per minute). When you call the hotel's main number (the endpoint), you are connected to a concierge. The concierge has a notepad (context window) that can hold a certain number of requests at once. If you ask a question, the concierge writes it down and then consults a reference book (the model) to compose an answer. The time it takes depends on how long your request is and how many other guests are waiting. The hotel can have multiple concierge desks, each with its own book and notepad size. You can choose which desk to approach by specifying the deployment name in your request. The hotel also has a global limit on how many total requests can be handled per minute across all desks, and each desk has its own rate limit. If you exceed the rate limit, the concierge tells you to wait (HTTP 429). To manage high traffic, you can create multiple desks with the same book (model) but different capacities, and use a load balancer (Azure API Management) to distribute requests. The hotel also offers reserved desks (provisioned throughput) for VIP guests who need guaranteed response times.
What is Azure OpenAI Service?
Azure OpenAI Service provides REST API access to OpenAI's powerful language models including GPT-4, GPT-4 Turbo with Vision, GPT-3.5-Turbo, and Embeddings models. Unlike using OpenAI directly, Azure offers enterprise-grade security, compliance, and integration with Azure services. The service is deployed within your Azure subscription, meaning data stays within your Azure region and adheres to your organization's policies.
Deployment Types: Global vs. Regional
When you create a deployment in Azure OpenAI Studio, you choose between two deployment types:
Global Standard: The model is deployed across multiple Azure regions, providing high availability and automatic failover. Requests are routed to the nearest region with capacity. This is the default and recommended for most workloads.
Regional Standard: The model is deployed in a single region you select. This offers lower latency for users in that region but may have reduced availability during regional outages.
Model Deployment Process
To deploy a model, you must first have an Azure OpenAI resource created in your subscription. Then, in Azure OpenAI Studio, you navigate to the Deployments tab and click "Create new deployment." You select a model (e.g., GPT-4), a version (e.g., 0613), and a deployment name (e.g., my-gpt4). The deployment name becomes part of the API endpoint URL.
API Endpoint and Authentication
Each deployment has a unique endpoint URL in the format:
https://<resource-name>.openai.azure.com/openai/deployments/<deployment-name>/chat/completions?api-version=2024-02-15-previewAuthentication is done via API keys or Azure Active Directory (AAD).
API Key: Two keys are provided per resource: KEY1 and KEY2. You pass the key in the api-key header. Keys can be regenerated without downtime by alternating between them.
Azure Active Directory: Recommended for production. You assign a managed identity or service principal to your application and grant it the "Cognitive Services OpenAI User" role. Then you obtain an AAD token and pass it in the Authorization header as Bearer <token>.
Rate Limits and Quotas
Rate limits are enforced at two levels:
Per deployment: Tokens per minute (TPM) and requests per minute (RPM). Default TPM for GPT-4 is 40,000 for pay-as-you-go. You can request increases.
Per resource: Global requests per minute limit (e.g., 1000 RPM).
If you exceed these limits, the API returns HTTP 429 (Too Many Requests) with a Retry-After header indicating seconds to wait.
Provisioned Throughput
For predictable performance, you can purchase provisioned throughput units (PTUs). Each PTU guarantees a certain number of tokens per minute (e.g., 1 PTU = 1000 TPM for GPT-4). This is ideal for production workloads with strict latency requirements.
API Versions
The API version is specified in the URL query parameter api-version. Azure OpenAI supports multiple versions; the latest stable version as of 2024 is 2024-02-15-preview. Using a specific version ensures your application works consistently even as the API evolves.
Content Filtering
Azure OpenAI includes built-in content filters that detect and block harmful content (hate, violence, self-harm, sexual). These filters are applied to both input prompts and model outputs. You can configure filter severity levels (low, medium, high) via the "Content Filters" tab in Azure OpenAI Studio. For some use cases, you can request to disable filters (subject to approval).
Monitoring and Logging
You can enable diagnostic settings to send logs to Azure Monitor, Log Analytics, or storage accounts. Metrics include:
Total calls
Total tokens
Response time
Rate limit hits
Content filter hits
Best Practices
Use AAD authentication over API keys for better security.
Implement retry logic with exponential backoff for 429 errors.
Use a single deployment for development and multiple deployments for production to isolate workloads.
Monitor usage and set alerts for approaching quotas.
Use Azure API Management to throttle, cache, and transform requests.
Code Example: Calling the Chat Completions API
import requests
import json
endpoint = "https://myresource.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-02-15-preview"
headers = {
"Content-Type": "application/json",
"api-key": "YOUR_API_KEY"
}
data = {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Azure OpenAI?"}
],
"max_tokens": 100,
"temperature": 0.7
}
response = requests.post(endpoint, headers=headers, json=data)
print(response.json()["choices"][0]["message"]["content"])Error Handling
Common HTTP status codes: - 200: Success - 400: Bad request (e.g., invalid parameters) - 401: Unauthorized (invalid or missing API key) - 429: Rate limit exceeded - 500: Internal server error
Scaling and Performance
To scale, you can:
Increase TPM quota by submitting a support request.
Use multiple deployments behind a load balancer.
Use provisioned throughput for guaranteed performance.
Cache common responses using Azure Cache for Redis.
Interaction with Azure Services
Azure OpenAI integrates with: - Azure Cognitive Search: For Retrieval Augmented Generation (RAG) – you can search your data and feed results into the prompt. - Azure Functions: To orchestrate workflows. - Azure Logic Apps: For no-code integration. - Azure API Management: To manage, secure, and monitor APIs.
Cost Management
Costs are based on tokens consumed (input + output). Prices vary by model. For example, GPT-4 costs $0.03 per 1K input tokens and $0.06 per 1K output tokens. Use the Azure Pricing Calculator to estimate costs. Set budgets and alerts to avoid unexpected charges.
Create Azure OpenAI Resource
In the Azure portal, search for 'Azure OpenAI' and click 'Create'. Fill in subscription, resource group, region (e.g., East US), and name. Choose pricing tier (Standard S0). Click 'Review + Create' then 'Create'. After deployment, go to the resource and note the endpoint and keys.
Deploy a Model
In Azure OpenAI Studio, go to 'Deployments' and click 'Create new deployment'. Select model (e.g., GPT-4) and version (e.g., 0613). Give a deployment name (e.g., gpt-4-deployment). Set Tokens per Minute (TPM) limit (default 40K). Choose deployment type (Global or Regional). Click 'Create'. The deployment appears in the list.
Obtain Endpoint and Keys
In the Azure OpenAI resource overview, copy the 'Endpoint' URL (e.g., https://myresource.openai.azure.com/). Under 'Keys and Endpoint', copy KEY1 or KEY2. For AAD, assign the 'Cognitive Services OpenAI User' role to your identity and obtain a token using Azure CLI: `az account get-access-token --resource https://cognitiveservices.azure.com`.
Make an API Call
Construct the full URL: endpoint + '/openai/deployments/' + deployment-name + '/chat/completions?api-version=2024-02-15-preview'. Set headers: 'api-key' or 'Authorization: Bearer <token>'. Send a POST request with a JSON body containing 'messages' array. Handle the response: parse JSON, extract 'choices[0].message.content'.
Handle Rate Limits
If you receive HTTP 429, read the 'Retry-After' header (value in seconds). Implement exponential backoff: retry after 1s, then 2s, 4s, etc. up to a maximum. Monitor your TPM usage via Azure Monitor. To increase quota, submit a support request with justification.
Enterprise Scenario 1: Customer Support Chatbot
A large e-commerce company deploys a GPT-4 based chatbot to handle customer inquiries. They use a Global Standard deployment to ensure low latency worldwide. The chatbot integrates with Azure Cognitive Search to retrieve product information from a vector database (RAG). The company uses API keys for simplicity but plans to migrate to AAD for better security. They set the TPM limit to 100K and monitor usage with alerts. They also implement a fallback to a simpler model (GPT-3.5-Turbo) when the primary model is overloaded. A common misconfiguration is not setting a proper retry policy, leading to dropped requests during traffic spikes.
Enterprise Scenario 2: Internal Code Assistant
A software company uses Azure OpenAI to power an internal code review assistant. They deploy GPT-4 in a Regional Standard deployment in their home region to keep data within the country. They use provisioned throughput (5 PTUs) to guarantee response times under 2 seconds. Authentication is via AAD with managed identity for the app service. They log all requests and responses to Azure Monitor for auditing. The biggest challenge is managing cost: they set a monthly budget of $5,000 and use token-based throttling to prevent runaway usage.
Enterprise Scenario 3: Content Moderation Pipeline
A social media platform uses Azure OpenAI's content filters to moderate user-generated content. They deploy a GPT-3.5-Turbo model with strict content filter settings (high severity for hate and violence). The pipeline processes millions of posts daily, so they use multiple deployments with a round-robin load balancer. They monitor filter hit rates and adjust filter severity based on false positive rates. A common issue is hitting the global RPM limit; they resolved this by spreading traffic across multiple Azure OpenAI resources in different regions.
AI-900 Objective 5.2: Describe Azure OpenAI Service
The exam tests your understanding of:
Deployment types (Global vs Regional)
Authentication methods (API keys vs AAD)
Rate limits and quotas (TPM, RPM)
Content filtering capabilities
API versioning
Common Wrong Answers
"You must use Azure Active Directory for authentication." – Actually, both API keys and AAD are supported. API keys are simpler for development; AAD is recommended for production.
"Global deployment means the model is deployed in every Azure region." – Global means it's available across multiple regions for high availability, but not necessarily all regions. Regional is locked to one.
"Rate limits are per resource only." – Rate limits are per deployment AND per resource. You can have different TPM limits for each deployment.
"Content filters cannot be disabled." – They can be disabled by request and approval, but are enabled by default.
Specific Numbers and Terms
Default GPT-4 TPM: 40,000
API version format: 2024-02-15-preview (latest in 2024)
HTTP 429: Rate limit exceeded
Provisioned Throughput Unit (PTU): 1 PTU = 1000 TPM for GPT-4
Role for AAD: Cognitive Services OpenAI User
Edge Cases
If you exceed both deployment and resource limits, which error do you get? 429 with Retry-After.
Can you use the same API key for multiple resources? No, each resource has its own keys.
What happens if you use an old API version? The API may still work but features may be missing; best to use latest.
Eliminating Wrong Answers
When you see a question about authentication, remember: both keys and AAD are valid. If the question says "only" or "must", it's likely wrong. For deployment types, think about availability vs. latency. For rate limits, remember the two levels.
Azure OpenAI provides REST API access to OpenAI models with enterprise-grade security.
Two deployment types: Global Standard (high availability) and Regional Standard (low latency in one region).
Authentication can be via API keys or Azure AD; AAD is preferred for production.
Rate limits are per deployment (TPM, RPM) and per resource (global RPM).
HTTP 429 indicates rate limit exceeded; implement retry with exponential backoff.
Content filters are enabled by default; severity levels can be configured.
Provisioned throughput units (PTUs) guarantee performance for production workloads.
API version is specified in the URL; always use the latest stable version.
Monitor usage with Azure Monitor and set budgets to control costs.
Integration with Azure Cognitive Search enables Retrieval Augmented Generation (RAG).
These come up on the exam all the time. Here's how to tell them apart.
API Key Authentication
Simpler to implement: just pass the key in header.
Less secure: keys can be leaked, no role-based access.
Two keys allow rotation without downtime.
Suitable for development and testing.
Cannot use managed identities or service principals.
Azure AD Authentication
More secure: uses tokens, supports RBAC.
Requires additional setup: assign roles, obtain token.
Works with managed identities for automatic credential management.
Recommended for production environments.
Supports conditional access policies.
Mistake
Azure OpenAI is exactly the same as OpenAI's API.
Correct
Azure OpenAI offers the same models but with enterprise features: data residency, AAD integration, compliance certifications, and content filtering. The API is similar but endpoints differ and Azure has its own rate limits.
Mistake
You can only use API keys for authentication.
Correct
You can use either API keys or Azure Active Directory. AAD is recommended for production for better security and role-based access control.
Mistake
Global deployment means the model is deployed in every Azure region.
Correct
Global deployment means the model is available across multiple regions for high availability, but not necessarily all. Regional deployment is restricted to one region.
Mistake
Content filters cannot be modified or disabled.
Correct
You can configure filter severity levels and request to disable filters for approved use cases. By default, filters are enabled at medium severity.
Mistake
Rate limits are only per resource.
Correct
Rate limits are enforced per deployment (TPM and RPM) and per resource (global RPM). Both can be hit independently.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Global Standard deployment distributes the model across multiple Azure regions for high availability and automatic failover. Regional Standard deploys the model in a single region you choose, offering lower latency for users in that region but less resilience. Global is recommended for most workloads; Regional is for data residency or latency optimization.
You can authenticate using either an API key (passed in the 'api-key' header) or an Azure Active Directory token (passed in the 'Authorization' header as 'Bearer <token>'). API keys are simpler but less secure; AAD is recommended for production as it supports role-based access control and managed identities.
HTTP 429 means you have exceeded the rate limit (tokens per minute or requests per minute). The response includes a 'Retry-After' header with the number of seconds to wait. Implement retry logic with exponential backoff: wait 1 second, then 2, 4, 8, etc., up to a maximum. Also consider increasing your quota or distributing load across multiple deployments.
No, each Azure OpenAI resource has its own set of API keys (KEY1 and KEY2). You must use the key corresponding to the resource you are targeting. If you have multiple resources, you need to manage keys separately.
You can request a quota increase by submitting a support request in the Azure portal. Provide justification for the increase, such as expected usage and business need. Alternatively, you can purchase Provisioned Throughput Units (PTUs) for guaranteed capacity.
Content filters automatically detect and block harmful content (hate, violence, self-harm, sexual) in both prompts and completions. You can configure severity levels (low, medium, high) for each category in Azure OpenAI Studio under 'Content Filters'. For approved use cases, you can request to disable filters.
As of 2024, the latest stable API version is '2024-02-15-preview'. Always check the official documentation for the most current version. Using a specific version ensures your application works consistently even as the API evolves.
You've just covered Azure OpenAI Deployments and API Access — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?