AI-900Chapter 98 of 100Objective 4.2

Language Detection

This chapter covers language detection in Azure AI Language, a key capability for natural language processing (NLP) workloads. Language detection is a foundational step in many AI solutions, enabling downstream services like translation, sentiment analysis, and key phrase extraction to process text correctly. On the AI-900 exam, this topic appears in approximately 5-8% of questions, typically as part of objective 4.2 (Build NLP Solutions). You will be tested on the API's capabilities, the JSON response structure, supported languages, and how to handle ambiguous or mixed-language text.

25 min read
Intermediate
Updated May 31, 2026

Language Detection as a Post Office Sorter

Imagine a central post office that receives mail from all over the world. Each envelope has an address, but the sender might write it in any language. The sorter's job is to identify the language of the address before routing it to the correct translator. The sorter does not read the content; they look for patterns: common words like 'the' for English, 'el' for Spanish, 'der' for German. They also check character sets—English uses Latin script, Russian uses Cyrillic, Chinese uses Hanzi. The sorter has a reference book of language profiles, each containing common words, letter frequencies, and script ranges. When an envelope arrives, the sorter scans the address, extracts these features, and compares them to each profile using a scoring system. The profile with the highest score wins. If scores are too close (e.g., Portuguese vs. Spanish), the sorter may flag it for manual review or use a tiebreaker like the country of origin (if known). This parallel mirrors Azure's language detection: the service receives text, tokenizes it, extracts n-gram features, and uses a trained classification model to assign a language code and confidence score. The sorter's 'manual review' is akin to the 'ambiguous' label when confidence is low.

How It Actually Works

What is Language Detection?

Language detection is the process of automatically identifying the natural language of a given text. In Azure AI Language, this is a pre-built, cloud-based API that returns the detected language, a confidence score, and an ISO 639-1 language code. It does not require training custom models; you simply send text via a REST endpoint or SDK, and the service analyzes the text using statistical models trained on large corpora.

Why It Exists

Many NLP pipelines require knowing the language of input text before applying language-specific operations such as translation, sentiment analysis, or entity recognition. For example, a customer support chatbot might receive messages in multiple languages and need to route them to the correct language model. Language detection automates this, eliminating the need for manual language identification.

How It Works Internally

Azure's language detection uses a machine learning classifier trained on millions of documents across 100+ languages. The algorithm works at the character and word level:

Tokenization: The text is split into tokens (words and punctuation).

Feature extraction: The system extracts character n-grams (e.g., sequences of 2-5 characters) and word-level features. These features capture language-specific patterns like common letter pairs, diacritics, and script usage.

Classification: A multi-class classifier (e.g., logistic regression or neural network) assigns a probability distribution over all supported languages. The language with the highest probability is returned, along with its confidence score (0 to 1).

Post-processing: If the top two languages have very close scores (e.g., within 0.1), the system may flag the result as 'ambiguous' and return both languages with a lower overall confidence.

Key Components

Endpoint: REST API at https://<your-resource-name>.cognitiveservices.azure.com/language/:analyze-text?api-version=2023-04-01 (or a regional endpoint).

Request Format: JSON with kind set to "LanguageDetection" and analysisInput containing documents.

Response Format: JSON with detectedLanguage per document, including name, iso6391Name, and confidenceScore.

Supported Languages: Over 100 languages, including major ones like English (en), Spanish (es), French (fr), German (de), Chinese (zh), Arabic (ar), and many regional variants. The full list is documented in the official Microsoft documentation and is updated periodically.

Confidence Score: A value between 0 and 1 indicating the model's confidence. A score above 0.9 is considered high. Scores below 0.5 indicate low confidence, and the result may be 'unknown'.

Default Behavior: The service can detect the dominant language in a document. For mixed-language text, it returns the language with the highest overall score.

Timers: There is no specific timer; the service is stateless and processes each request independently. Latency is typically under a second for short texts but can increase with document size.

Configuration and Verification

To use language detection, you need an Azure AI Language resource (or a Cognitive Services multi-service resource). You must have the endpoint URL and one of the keys (Key1 or Key2). You can test the API using curl, Postman, or the Azure portal's Language Studio.

Example curl request:

curl -X POST "https://<your-resource-name>.cognitiveservices.azure.com/language/:analyze-text?api-version=2023-04-01" \
-H "Ocp-Apim-Subscription-Key: <your-key>" \
-H "Content-Type: application/json" \
-d '{
  "kind": "LanguageDetection",
  "parameters": {
    "modelVersion": "latest"
  },
  "analysisInput": {
    "documents": [
      {
        "id": "1",
        "text": "Bonjour tout le monde"
      }
    ]
  }
}'

Example response:

{
  "kind": "LanguageDetectionResults",
  "results": {
    "documents": [
      {
        "id": "1",
        "detectedLanguage": {
          "name": "French",
          "iso6391Name": "fr",
          "confidenceScore": 0.99
        },
        "warnings": []
      }
    ],
    "errors": []
  }
}

You can also use the Azure SDKs (Python, C#, Java, etc.) for programmatic access. For example, in Python:

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

endpoint = "https://<your-resource-name>.cognitiveservices.azure.com/"
key = "<your-key>"
client = TextAnalyticsClient(endpoint=endpoint, credential=AzureKeyCredential(key))

documents = ["Bonjour tout le monde"]
response = client.detect_language(documents)
for doc in response:
    print(f"Detected language: {doc.primary_language.name}, confidence: {doc.primary_language.confidence_score}")

Interaction with Related Technologies

Language detection is often used as a preprocessing step for other Azure AI services: - Translator: To translate text, you need to know the source language. Language detection can provide it automatically. - Sentiment Analysis: Sentiment models are language-specific. Language detection ensures the correct model is used. - Key Phrase Extraction: Similarly, key phrase extraction models are language-dependent. - Conversational Language Understanding (CLU): Language detection can route utterances to the appropriate language model. - Custom Text Classification: When building custom models, language detection can help filter or preprocess training data.

Performance and Scale

Each request can contain up to 1,000 documents (IDs) in a single batch, with each document up to 5,120 characters. The service is designed for high throughput and can handle thousands of requests per second. For production, you should use the standard tier (S) for better performance and scalability.

Limitations

Mixed-language text: If a document contains multiple languages, the service returns the dominant language. It does not provide per-sentence language identification.

Short text: Very short text (e.g., one word) may have low confidence because there are fewer features to analyze.

Ambiguous text: Some languages share many features (e.g., Serbian and Croatian). The service may return a lower confidence score or an 'ambiguous' result.

Script variations: The service supports multiple scripts for some languages (e.g., Chinese simplified vs. traditional). It may not distinguish between them.

Versioning

The API supports model versioning. You can specify "modelVersion": "latest" or a specific version like "2023-04-01". The 'latest' version is updated periodically, and you should test changes in a staging environment before production.

Error Handling

Common errors include:

Invalid subscription key (401 Unauthorized)

Exceeded quota (429 Too Many Requests)

Invalid request format (400 Bad Request)

Region mismatch (if the key is for a different region)

Always check the error field in the response for details.

Walk-Through

1

Create an Azure AI Language resource

In the Azure portal, search for 'Language service' and click 'Create'. Choose a resource group, region, and pricing tier (Free F0 or Standard S). The Free tier allows 5,000 text records per month and 20 requests per minute. The Standard tier has higher limits and is required for production. After creation, note the endpoint URL and one of the keys from the 'Keys and Endpoint' blade. This resource will be used to authenticate API calls.

2

Prepare your input text

Decide on the text you want to analyze. For best results, use text that is at least a few words long (e.g., a sentence). The text can be in any supported language. If you have multiple documents, assign each a unique ID. Ensure each document does not exceed 5,120 characters. For batch processing, you can include up to 1,000 documents in a single request.

3

Send a language detection request

Construct an HTTP POST request to the endpoint URL with the path `/language/:analyze-text?api-version=2023-04-01`. Include the header `Ocp-Apim-Subscription-Key` with your key and `Content-Type: application/json`. In the body, set `kind` to `"LanguageDetection"`, and include an `analysisInput` object with your documents. Optionally, specify a `modelVersion`. Send the request using curl, Postman, or an SDK.

4

Parse the JSON response

The response will contain a `results` object with a `documents` array. For each document, you get a `detectedLanguage` object with `name`, `iso6391Name`, and `confidenceScore`. Also check for `warnings` (e.g., if the text is empty) and `errors` (e.g., invalid document ID). The `confidenceScore` is a double between 0 and 1. A score above 0.9 indicates high confidence; below 0.5 may indicate ambiguity or unknown language.

5

Handle the result in your application

Use the detected language to route the text to the appropriate downstream service. For example, if the language is 'fr', you might send it to a French sentiment analysis model. If the confidence is low, you could ask the user to clarify or default to a primary language. In a chatbot, you might store the language preference per session. Always handle errors gracefully—if the service returns an error, log it and potentially retry with exponential backoff.

What This Looks Like on the Job

Enterprise Scenario 1: Multilingual Customer Support Ticket Routing

A global e-commerce company receives support tickets in English, Spanish, French, and German. They use Azure AI Language detection as the first step in their pipeline. Each ticket text is sent to the API, and the detected language (e.g., 'de' for German) is used to route the ticket to a queue handled by German-speaking agents. The confidence score is used to flag low-confidence tickets for manual review. In production, they configured a batch endpoint that processes up to 500 tickets per minute. They encountered an issue where short texts like 'Hilfe' (German for 'help') were misclassified as English due to insufficient features. They mitigated this by requiring a minimum text length of 10 characters before calling the API. They also set a confidence threshold of 0.8; below that, the ticket is sent to a default queue for manual language identification.

Enterprise Scenario 2: Content Moderation for a Social Media Platform

A social media platform uses language detection to filter content before applying language-specific moderation rules. User posts are sent to the language detection API, and the detected language determines which sentiment and toxicity models are used. They noticed that mixed-language posts (e.g., Spanglish) often received a dominant language label but lost nuances. To handle this, they used the modelVersion parameter to pin to a specific version during A/B testing. They also set up alerts for when the average confidence score drops below 0.7, indicating a potential model drift. Performance-wise, they needed to handle 10,000 requests per second, so they scaled their Azure AI Language resource to the Standard tier and used multiple endpoints across regions for load balancing.

Common Configuration Pitfalls

Using the wrong endpoint: The language detection endpoint is different from the Translator endpoint. Ensure you use the Language service endpoint.

Exceeding rate limits: The Free tier has a limit of 20 requests per minute. Exceeding it returns a 429 error. Always use the Standard tier for production and implement retry logic.

Ignoring confidence scores: Relying solely on the detected language without checking confidence can lead to incorrect routing. Always implement a threshold.

Not handling errors: Network issues or invalid keys can cause failures. Implement exponential backoff and fallback mechanisms.

How AI-900 Actually Tests This

What AI-900 Tests

Objective 4.2 (Build NLP Solutions) includes language detection as a key capability. The exam expects you to:

Understand the purpose of language detection in an NLP pipeline.

Identify the correct JSON request/response structure.

Know the meaning of the confidence score and how to interpret it.

Differentiate between language detection and other Azure AI services like Translator or Text Analytics.

Recognize common use cases: routing, preprocessing, and content moderation.

Top 3 Wrong Answers on the Exam

1.

'Language detection returns the language of each sentence individually.' This is false. The service returns one language per document (the dominant language). It does not perform per-sentence detection. Candidates often confuse this with Translator's auto-detection feature, which does work per sentence.

2.

'The confidence score is the probability that the text is in the detected language.' While close, the confidence score is not a strict probability but a model's confidence level. Candidates often think 0.5 means 50% chance—it does not. It is a relative score based on the model's internal calibration.

3.

'You need to train a custom model for each language.' This is incorrect. Language detection is a pre-built capability; no training is required. Candidates may confuse it with custom text classification.

Specific Numbers and Terms

ISO 639-1 code: Two-letter code (e.g., 'en', 'es'). The exam will ask you to identify the correct code.

Confidence score range: 0 to 1. Values above 0.9 are considered high.

Maximum documents per request: 1,000.

Maximum characters per document: 5,120.

Supported languages: Over 100. The exam may list a few and ask which are supported (e.g., 'Klingon' is not).

Edge Cases and Exceptions

Empty text: Returns a warning with empty detected language.

Text with only numbers or symbols: The service may still attempt detection but likely returns a low confidence score.

Mixed-language text: The dominant language is returned; the other languages are lost. The exam might ask about handling mixed-language scenarios.

Ambiguous language pairs: e.g., Serbian and Croatian. The confidence score will be lower, and the service may return a single language but with lower confidence.

How to Eliminate Wrong Answers

If an answer mentions 'training a model', it is likely wrong for pre-built language detection.

If an answer claims 'per-sentence detection', it is wrong.

If an answer omits the confidence score or suggests it is not important, it is incomplete.

Look for answers that include the correct ISO code format (two letters, lowercase).

Remember that language detection is a separate API from Translator; do not confuse their endpoints.

Key Takeaways

Language detection identifies the dominant language of a text using a pre-built ML model.

The API returns an ISO 639-1 two-letter language code (e.g., 'en', 'es').

The confidence score ranges from 0 to 1; values above 0.9 indicate high confidence.

Each request can include up to 1,000 documents, each up to 5,120 characters.

No custom training is required; the service works out of the box.

Short or ambiguous text may result in low confidence scores.

Language detection is a preprocessing step for other NLP services like sentiment analysis or translation.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure AI Language Detection

Returns ISO 639-1 code, language name, and confidence score.

Can process up to 1,000 documents per request.

Designed for general NLP preprocessing.

Does not translate text; only detects language.

Part of the Azure AI Language service (Text Analytics).

Azure Translator Text Detection

Returns the same ISO 639-1 code but also a score (optional).

Part of the Translator service; primarily for translation.

Automatically detects language when translating if source language is not specified.

Can be used standalone via Detect endpoint.

Supports fewer languages than Language Detection (approx. 90 vs 100+).

Watch Out for These

Mistake

Language detection can identify the language of each sentence in a document.

Correct

No, the Azure AI Language detection API returns a single dominant language per document. It does not provide per-sentence results. For per-sentence detection, you would need to split the text manually and call the API separately for each sentence.

Mistake

The confidence score is a probability that the detected language is correct.

Correct

The confidence score is a model-derived score between 0 and 1, but it is not a strict probability. It reflects the model's confidence based on feature similarity. A score of 0.9 does not mean 90% probability; it means high confidence relative to the model's training.

Mistake

You must train a custom language detection model for your specific domain.

Correct

Language detection is a pre-built capability. You do not need to train a model. Azure AI Language provides a general-purpose model that works out of the box for over 100 languages.

Mistake

Language detection works best with very short text like a single word.

Correct

Short text (e.g., one word) often results in low confidence because there are fewer features to analyze. For best accuracy, provide at least a sentence or more.

Mistake

The language detection API can detect over 200 languages.

Correct

As of the current version, the API supports over 100 languages. The exact number is documented and may increase, but it is not over 200. Always refer to the official documentation.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between language detection and translation?

Language detection identifies the language of the text, while translation converts text from one language to another. In Azure, language detection is part of the AI Language service, whereas translation is the Translator service. You can use language detection before translation to automatically determine the source language.

How accurate is Azure's language detection?

Accuracy is high for common languages with sufficient text (e.g., a sentence). For short text or closely related languages (e.g., Portuguese vs. Spanish), accuracy may drop. The confidence score helps gauge reliability. In benchmarks, it performs well for over 100 languages.

Can I use language detection for mixed-language text?

The API returns the dominant language for the entire document. It does not identify multiple languages within one document. For mixed-language text, you may need to split the text manually and call the API on each segment.

What happens if the confidence score is very low?

A low confidence score (e.g., below 0.5) indicates the model is uncertain. The API still returns the best guess, but you should treat it as unreliable. In your application, you might ask the user to confirm or default to a fallback language.

How many languages does Azure AI Language detection support?

As of the latest update, it supports over 100 languages, including major world languages and many regional ones. The full list is available in the official documentation and is periodically expanded.

Do I need to train a custom model for language detection?

No. Language detection is a pre-built capability. You can use it immediately after creating an Azure AI Language resource. Custom training is only needed for tasks like custom text classification or custom entity recognition.

What is the maximum text length for a single document?

Each document can be up to 5,120 characters. If your text is longer, you must split it into multiple documents. The API can process up to 1,000 documents per request.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Language Detection — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?