AI-900Chapter 79 of 100Objective 5.2

Azure OpenAI Deployments and API Access

This chapter covers Azure OpenAI Service deployments and API access, a core topic in the Generative AI domain of the AI-900 exam (Objective 5.2). You will learn how to deploy models, configure endpoints, manage authentication, handle rate limits, and use best practices for production. Expect 5-10% of exam questions to touch this area, primarily focusing on deployment types, authentication methods, and quota management.

25 min read
Intermediate
Updated May 31, 2026

Azure OpenAI as a Hotel Concierge

Imagine Azure OpenAI deployments like a hotel with multiple concierge desks. Each desk (deployment) has a specific model (e.g., GPT-4) and a set capacity (tokens per minute). When you call the hotel's main number (the endpoint), you are connected to a concierge. The concierge has a notepad (context window) that can hold a certain number of requests at once. If you ask a question, the concierge writes it down and then consults a reference book (the model) to compose an answer. The time it takes depends on how long your request is and how many other guests are waiting. The hotel can have multiple concierge desks, each with its own book and notepad size. You can choose which desk to approach by specifying the deployment name in your request. The hotel also has a global limit on how many total requests can be handled per minute across all desks, and each desk has its own rate limit. If you exceed the rate limit, the concierge tells you to wait (HTTP 429). To manage high traffic, you can create multiple desks with the same book (model) but different capacities, and use a load balancer (Azure API Management) to distribute requests. The hotel also offers reserved desks (provisioned throughput) for VIP guests who need guaranteed response times.

How It Actually Works

What is Azure OpenAI Service?

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including GPT-4, GPT-4 Turbo with Vision, GPT-3.5-Turbo, and Embeddings models. Unlike using OpenAI directly, Azure offers enterprise-grade security, compliance, and integration with Azure services. The service is deployed within your Azure subscription, meaning data stays within your Azure region and adheres to your organization's policies.

Deployment Types: Global vs. Regional

When you create a deployment in Azure OpenAI Studio, you choose between two deployment types:

Global Standard: The model is deployed across multiple Azure regions, providing high availability and automatic failover. Requests are routed to the nearest region with capacity. This is the default and recommended for most workloads.

Regional Standard: The model is deployed in a single region you select. This offers lower latency for users in that region but may have reduced availability during regional outages.

Model Deployment Process

To deploy a model, you must first have an Azure OpenAI resource created in your subscription. Then, in Azure OpenAI Studio, you navigate to the Deployments tab and click "Create new deployment." You select a model (e.g., GPT-4), a version (e.g., 0613), and a deployment name (e.g., my-gpt4). The deployment name becomes part of the API endpoint URL.

API Endpoint and Authentication

Each deployment has a unique endpoint URL in the format:

https://<resource-name>.openai.azure.com/openai/deployments/<deployment-name>/chat/completions?api-version=2024-02-15-preview

Authentication is done via API keys or Azure Active Directory (AAD).

API Key: Two keys are provided per resource: KEY1 and KEY2. You pass the key in the api-key header. Keys can be regenerated without downtime by alternating between them.

Azure Active Directory: Recommended for production. You assign a managed identity or service principal to your application and grant it the "Cognitive Services OpenAI User" role. Then you obtain an AAD token and pass it in the Authorization header as Bearer <token>.

Rate Limits and Quotas

Rate limits are enforced at two levels:

Per deployment: Tokens per minute (TPM) and requests per minute (RPM). Default TPM for GPT-4 is 40,000 for pay-as-you-go. You can request increases.

Per resource: Global requests per minute limit (e.g., 1000 RPM).

If you exceed these limits, the API returns HTTP 429 (Too Many Requests) with a Retry-After header indicating seconds to wait.

Provisioned Throughput

For predictable performance, you can purchase provisioned throughput units (PTUs). Each PTU guarantees a certain number of tokens per minute (e.g., 1 PTU = 1000 TPM for GPT-4). This is ideal for production workloads with strict latency requirements.

API Versions

The API version is specified in the URL query parameter api-version. Azure OpenAI supports multiple versions; the latest stable version as of 2024 is 2024-02-15-preview. Using a specific version ensures your application works consistently even as the API evolves.

Content Filtering

Azure OpenAI includes built-in content filters that detect and block harmful content (hate, violence, self-harm, sexual). These filters are applied to both input prompts and model outputs. You can configure filter severity levels (low, medium, high) via the "Content Filters" tab in Azure OpenAI Studio. For some use cases, you can request to disable filters (subject to approval).

Monitoring and Logging

You can enable diagnostic settings to send logs to Azure Monitor, Log Analytics, or storage accounts. Metrics include:

Total calls

Total tokens

Response time

Rate limit hits

Content filter hits

Best Practices

Use AAD authentication over API keys for better security.

Implement retry logic with exponential backoff for 429 errors.

Use a single deployment for development and multiple deployments for production to isolate workloads.

Monitor usage and set alerts for approaching quotas.

Use Azure API Management to throttle, cache, and transform requests.

Code Example: Calling the Chat Completions API

import requests
import json

endpoint = "https://myresource.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-02-15-preview"
headers = {
    "Content-Type": "application/json",
    "api-key": "YOUR_API_KEY"
}
data = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Azure OpenAI?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
}
response = requests.post(endpoint, headers=headers, json=data)
print(response.json()["choices"][0]["message"]["content"])

Error Handling

Common HTTP status codes: - 200: Success - 400: Bad request (e.g., invalid parameters) - 401: Unauthorized (invalid or missing API key) - 429: Rate limit exceeded - 500: Internal server error

Scaling and Performance

To scale, you can:

Increase TPM quota by submitting a support request.

Use multiple deployments behind a load balancer.

Use provisioned throughput for guaranteed performance.

Cache common responses using Azure Cache for Redis.

Interaction with Azure Services

Azure OpenAI integrates with: - Azure Cognitive Search: For Retrieval Augmented Generation (RAG) – you can search your data and feed results into the prompt. - Azure Functions: To orchestrate workflows. - Azure Logic Apps: For no-code integration. - Azure API Management: To manage, secure, and monitor APIs.

Cost Management

Costs are based on tokens consumed (input + output). Prices vary by model. For example, GPT-4 costs $0.03 per 1K input tokens and $0.06 per 1K output tokens. Use the Azure Pricing Calculator to estimate costs. Set budgets and alerts to avoid unexpected charges.

Walk-Through

1

Create Azure OpenAI Resource

In the Azure portal, search for 'Azure OpenAI' and click 'Create'. Fill in subscription, resource group, region (e.g., East US), and name. Choose pricing tier (Standard S0). Click 'Review + Create' then 'Create'. After deployment, go to the resource and note the endpoint and keys.

2

Deploy a Model

In Azure OpenAI Studio, go to 'Deployments' and click 'Create new deployment'. Select model (e.g., GPT-4) and version (e.g., 0613). Give a deployment name (e.g., gpt-4-deployment). Set Tokens per Minute (TPM) limit (default 40K). Choose deployment type (Global or Regional). Click 'Create'. The deployment appears in the list.

3

Obtain Endpoint and Keys

In the Azure OpenAI resource overview, copy the 'Endpoint' URL (e.g., https://myresource.openai.azure.com/). Under 'Keys and Endpoint', copy KEY1 or KEY2. For AAD, assign the 'Cognitive Services OpenAI User' role to your identity and obtain a token using Azure CLI: `az account get-access-token --resource https://cognitiveservices.azure.com`.

4

Make an API Call

Construct the full URL: endpoint + '/openai/deployments/' + deployment-name + '/chat/completions?api-version=2024-02-15-preview'. Set headers: 'api-key' or 'Authorization: Bearer <token>'. Send a POST request with a JSON body containing 'messages' array. Handle the response: parse JSON, extract 'choices[0].message.content'.

5

Handle Rate Limits

If you receive HTTP 429, read the 'Retry-After' header (value in seconds). Implement exponential backoff: retry after 1s, then 2s, 4s, etc. up to a maximum. Monitor your TPM usage via Azure Monitor. To increase quota, submit a support request with justification.

What This Looks Like on the Job

Enterprise Scenario 1: Customer Support Chatbot

A large e-commerce company deploys a GPT-4 based chatbot to handle customer inquiries. They use a Global Standard deployment to ensure low latency worldwide. The chatbot integrates with Azure Cognitive Search to retrieve product information from a vector database (RAG). The company uses API keys for simplicity but plans to migrate to AAD for better security. They set the TPM limit to 100K and monitor usage with alerts. They also implement a fallback to a simpler model (GPT-3.5-Turbo) when the primary model is overloaded. A common misconfiguration is not setting a proper retry policy, leading to dropped requests during traffic spikes.

Enterprise Scenario 2: Internal Code Assistant

A software company uses Azure OpenAI to power an internal code review assistant. They deploy GPT-4 in a Regional Standard deployment in their home region to keep data within the country. They use provisioned throughput (5 PTUs) to guarantee response times under 2 seconds. Authentication is via AAD with managed identity for the app service. They log all requests and responses to Azure Monitor for auditing. The biggest challenge is managing cost: they set a monthly budget of $5,000 and use token-based throttling to prevent runaway usage.

Enterprise Scenario 3: Content Moderation Pipeline

A social media platform uses Azure OpenAI's content filters to moderate user-generated content. They deploy a GPT-3.5-Turbo model with strict content filter settings (high severity for hate and violence). The pipeline processes millions of posts daily, so they use multiple deployments with a round-robin load balancer. They monitor filter hit rates and adjust filter severity based on false positive rates. A common issue is hitting the global RPM limit; they resolved this by spreading traffic across multiple Azure OpenAI resources in different regions.

How AI-900 Actually Tests This

AI-900 Objective 5.2: Describe Azure OpenAI Service

The exam tests your understanding of:

Deployment types (Global vs Regional)

Authentication methods (API keys vs AAD)

Rate limits and quotas (TPM, RPM)

Content filtering capabilities

API versioning

Common Wrong Answers

1.

"You must use Azure Active Directory for authentication." – Actually, both API keys and AAD are supported. API keys are simpler for development; AAD is recommended for production.

2.

"Global deployment means the model is deployed in every Azure region." – Global means it's available across multiple regions for high availability, but not necessarily all regions. Regional is locked to one.

3.

"Rate limits are per resource only." – Rate limits are per deployment AND per resource. You can have different TPM limits for each deployment.

4.

"Content filters cannot be disabled." – They can be disabled by request and approval, but are enabled by default.

Specific Numbers and Terms

Default GPT-4 TPM: 40,000

API version format: 2024-02-15-preview (latest in 2024)

HTTP 429: Rate limit exceeded

Provisioned Throughput Unit (PTU): 1 PTU = 1000 TPM for GPT-4

Role for AAD: Cognitive Services OpenAI User

Edge Cases

If you exceed both deployment and resource limits, which error do you get? 429 with Retry-After.

Can you use the same API key for multiple resources? No, each resource has its own keys.

What happens if you use an old API version? The API may still work but features may be missing; best to use latest.

Eliminating Wrong Answers

When you see a question about authentication, remember: both keys and AAD are valid. If the question says "only" or "must", it's likely wrong. For deployment types, think about availability vs. latency. For rate limits, remember the two levels.

Key Takeaways

Azure OpenAI provides REST API access to OpenAI models with enterprise-grade security.

Two deployment types: Global Standard (high availability) and Regional Standard (low latency in one region).

Authentication can be via API keys or Azure AD; AAD is preferred for production.

Rate limits are per deployment (TPM, RPM) and per resource (global RPM).

HTTP 429 indicates rate limit exceeded; implement retry with exponential backoff.

Content filters are enabled by default; severity levels can be configured.

Provisioned throughput units (PTUs) guarantee performance for production workloads.

API version is specified in the URL; always use the latest stable version.

Monitor usage with Azure Monitor and set budgets to control costs.

Integration with Azure Cognitive Search enables Retrieval Augmented Generation (RAG).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

API Key Authentication

Simpler to implement: just pass the key in header.

Less secure: keys can be leaked, no role-based access.

Two keys allow rotation without downtime.

Suitable for development and testing.

Cannot use managed identities or service principals.

Azure AD Authentication

More secure: uses tokens, supports RBAC.

Requires additional setup: assign roles, obtain token.

Works with managed identities for automatic credential management.

Recommended for production environments.

Supports conditional access policies.

Watch Out for These

Mistake

Azure OpenAI is exactly the same as OpenAI's API.

Correct

Azure OpenAI offers the same models but with enterprise features: data residency, AAD integration, compliance certifications, and content filtering. The API is similar but endpoints differ and Azure has its own rate limits.

Mistake

You can only use API keys for authentication.

Correct

You can use either API keys or Azure Active Directory. AAD is recommended for production for better security and role-based access control.

Mistake

Global deployment means the model is deployed in every Azure region.

Correct

Global deployment means the model is available across multiple regions for high availability, but not necessarily all. Regional deployment is restricted to one region.

Mistake

Content filters cannot be modified or disabled.

Correct

You can configure filter severity levels and request to disable filters for approved use cases. By default, filters are enabled at medium severity.

Mistake

Rate limits are only per resource.

Correct

Rate limits are enforced per deployment (TPM and RPM) and per resource (global RPM). Both can be hit independently.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Global and Regional deployment in Azure OpenAI?

Global Standard deployment distributes the model across multiple Azure regions for high availability and automatic failover. Regional Standard deploys the model in a single region you choose, offering lower latency for users in that region but less resilience. Global is recommended for most workloads; Regional is for data residency or latency optimization.

How do I authenticate to Azure OpenAI API?

You can authenticate using either an API key (passed in the 'api-key' header) or an Azure Active Directory token (passed in the 'Authorization' header as 'Bearer <token>'). API keys are simpler but less secure; AAD is recommended for production as it supports role-based access control and managed identities.

What does HTTP 429 mean and how should I handle it?

HTTP 429 means you have exceeded the rate limit (tokens per minute or requests per minute). The response includes a 'Retry-After' header with the number of seconds to wait. Implement retry logic with exponential backoff: wait 1 second, then 2, 4, 8, etc., up to a maximum. Also consider increasing your quota or distributing load across multiple deployments.

Can I use the same API key for multiple Azure OpenAI resources?

No, each Azure OpenAI resource has its own set of API keys (KEY1 and KEY2). You must use the key corresponding to the resource you are targeting. If you have multiple resources, you need to manage keys separately.

How do I increase my tokens per minute (TPM) quota?

You can request a quota increase by submitting a support request in the Azure portal. Provide justification for the increase, such as expected usage and business need. Alternatively, you can purchase Provisioned Throughput Units (PTUs) for guaranteed capacity.

What are content filters and how do I configure them?

Content filters automatically detect and block harmful content (hate, violence, self-harm, sexual) in both prompts and completions. You can configure severity levels (low, medium, high) for each category in Azure OpenAI Studio under 'Content Filters'. For approved use cases, you can request to disable filters.

What is the latest API version for Azure OpenAI?

As of 2024, the latest stable API version is '2024-02-15-preview'. Always check the official documentation for the most current version. Using a specific version ensures your application works consistently even as the API evolves.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure OpenAI Deployments and API Access — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?