AZ-204Chapter 80 of 102Objective 5.1

Azure Language and Speech Services

This chapter covers Azure Language and Speech Services, part of Azure Cognitive Services, which enable developers to integrate natural language processing and speech capabilities into applications. These services are critical for building intelligent apps that understand, analyze, and respond to human language. On the AZ-204 exam, this topic area typically accounts for about 5-10% of questions, focusing on provisioning, authentication, SDK usage, and integration patterns. You will need to know how to use pre-built models, customize models, and handle real-time vs. batch processing.

25 min read
Intermediate
Updated May 31, 2026

Language Service as Multilingual Interpreter

Imagine a large conference with attendees speaking many languages. The Azure Language Service is like a team of expert interpreters stationed in a central booth. Each attendee speaks into a microphone; the audio is sent to the booth, where the interpreters (pre-built models) instantly transcribe the speech into text (Speech-to-Text). If an attendee wants to translate, the text is passed to a translator (Translator Text API) who converts it to the target language. For custom needs, the organization can train their own interpreters (Custom Speech, Custom Translator) by providing specialized vocabulary or parallel documents. The interpreters also detect emotions (Sentiment Analysis) and key phrases (Key Phrase Extraction) from the tone of voice or text. The entire system is orchestrated by a coordinator (Azure Cognitive Services) that manages authentication (keys/tokens), scaling (handling multiple concurrent requests), and regional availability. Just as a real interpreter must be accurate, fast, and secure, the Language Service provides high accuracy, low latency, and enterprise-grade security. The analogy breaks when you consider that the service can handle millions of requests per second, something no human team could achieve.

How It Actually Works

What Are Azure Language and Speech Services?

Azure Language and Speech Services are cloud-based APIs and SDKs that provide pre-built machine learning models for natural language processing (NLP) and speech processing. They are part of Azure Cognitive Services, which are grouped into decision, vision, speech, and language categories. The Language Service (formerly Text Analytics) includes features like sentiment analysis, key phrase extraction, language detection, named entity recognition (NER), and custom text classification. The Speech Service includes speech-to-text, text-to-speech, speech translation, and speaker recognition. These services are designed to be used without machine learning expertise — developers simply call REST APIs or use client libraries.

How They Work Internally

Both services are built on deep neural networks trained on massive datasets. When you send a request (text or audio), the service processes it through a pipeline: - Pre-processing: Text is cleaned and tokenized; audio is converted to a suitable format (e.g., PCM, Opus) and split into frames. - Inference: The pre-processed data is fed into a trained model that outputs predictions (e.g., sentiment score, transcribed text). - Post-processing: Results are formatted into JSON responses. For speech, the service may apply language models, acoustic models, and pronunciation scoring.

Key Components, Defaults, and Timers

- Language Service: Key features include: - Sentiment Analysis: Returns a sentiment label (positive, negative, neutral, mixed) and confidence scores (0 to 1). Default endpoint: https://<region>.api.cognitive.microsoft.com/text/analytics/v3.1/sentiment. - Key Phrase Extraction: Returns a list of key phrases. Endpoint: /keyPhrases. - Language Detection: Returns language name, ISO 639-1 code, and confidence score. Endpoint: /languages. - Named Entity Recognition (NER): Recognizes entities like persons, locations, organizations. Endpoint: /entities/recognition/general. - Custom Text Classification: Requires a trained model; endpoint includes model version. - PII Detection: Identifies personally identifiable information (e.g., phone numbers, emails). Endpoint: /entities/recognition/pii. - Defaults: The service supports up to 1,000 characters per document (for text APIs), with a maximum batch size of 1,000 documents per request. Rate limits vary by tier (Free: 5,000 transactions/month, S0: 1,000 calls per minute). - Speech Service: Key features include: - Speech-to-Text (STT): Real-time or batch transcription. Supports multiple audio formats (WAV, MP3, OGG). Default recognition language is en-US. Use SDK or REST API. - Text-to-Speech (TTS): Converts text to natural-sounding speech. Supports neural voices, custom voice fonts. Default output format is audio-16khz-32kbitrate-mono-mp3. - Speech Translation: Real-time translation of audio to text in multiple languages. - Speaker Recognition: Identifies or verifies speakers based on voice characteristics. - Custom Speech: Allows training custom acoustic, language, and pronunciation models. Requires a dataset of audio + transcriptions. - Pronunciation Assessment: Evaluates pronunciation accuracy, fluency, and completeness. - Timers: For real-time STT, the SDK supports continuous recognition with a timeout of 30 seconds of silence before ending. The default silence timeout is 20 seconds.

Configuration and Verification Commands

To use these services, you need to create a Cognitive Services resource in Azure. You can do this via Azure CLI:

az cognitiveservices account create --name MyLanguageService --resource-group MyResourceGroup --kind TextAnalytics --sku F0 --location westus

For Speech:

az cognitiveservices account create --name MySpeechService --resource-group MyResourceGroup --kind SpeechServices --sku F0 --location westus

To get the endpoint and keys:

az cognitiveservices account keys list --name MyLanguageService --resource-group MyResourceGroup
az cognitiveservices account show --name MyLanguageService --resource-group MyResourceGroup --query "properties.endpoint"

For SDK usage, install the NuGet package:

dotnet add package Azure.AI.TextAnalytics
dotnet add package Microsoft.CognitiveServices.Speech

How They Interact with Related Technologies

Azure Bot Service: Language Service can be used to analyze user messages for sentiment and entities, improving bot responses.

Azure Search: Use Key Phrase Extraction to generate searchable keywords from documents.

Power Automate: Connect to Language Service to trigger workflows based on sentiment analysis.

Azure Functions: Serverless execution of language processing tasks.

Logic Apps: Visual workflows that call Language Service APIs.

Azure Data Lake: Process large volumes of text data with batch operations.

Azure Kubernetes Service (AKS): Deploy custom containers for Language Service (available for some features) for on-premises or edge scenarios.

Security and Authentication

All requests must include an authentication header. You can use either a subscription key (passed in the Ocp-Apim-Subscription-Key header) or an Azure AD token (for managed identity scenarios). For speech, you can also use a temporary authorization token (valid for 10 minutes) obtained from the token endpoint.

Pricing Tiers

Free (F0): 5,000 transactions per month for Language; 5 audio hours per month for Speech.

Standard (S0): Pay-as-you-go; higher throughput and features like custom models. For Speech, you pay per audio hour.

Custom model training: Additional charges for training hours.

Region Availability

Both services are available in multiple Azure regions. Some features (like custom neural voice) are restricted to certain regions (e.g., West US 2, West Europe). Always check the documentation for region-specific availability.

Monitoring and Logging

You can enable diagnostic settings to send logs to Azure Monitor, Storage, or Event Hub. Metrics include number of calls, latency, and errors. Use Application Insights for SDK-side telemetry.

Best Practices

Use batch endpoints for large volumes of text (up to 1,000 documents per request).

For real-time speech, use the SDK with WebSocket protocol for lower latency.

Implement retry logic with exponential backoff for transient failures.

Secure your keys using Azure Key Vault or managed identities.

For custom speech, ensure training data is representative of the target acoustic environment.

Use the TextAnalyticsClient class for language operations, and SpeechRecognizer for speech.

Walk-Through

1

Create Cognitive Services Resource

Go to Azure Portal, click 'Create a resource', search for 'Language service' or 'Speech service'. Choose the appropriate service (e.g., Text Analytics, Speech). Select a pricing tier (F0 for free tier, S0 for production). Choose a region (e.g., West US). Provide a resource group and name. Review and create. After deployment, note the endpoint and keys. This resource will be used for all subsequent API calls.

2

Get Endpoint and Keys

In the Azure Portal, navigate to your resource. Under 'Resource Management', click 'Keys and Endpoint'. Copy one of the keys and the endpoint URL. For security, store keys in Azure Key Vault or environment variables. The endpoint is typically in the format `https://<region>.api.cognitive.microsoft.com/`. For speech, the endpoint includes the region, e.g., `https://westus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1`.

3

Install SDK and Authenticate

In your .NET project, install the Azure.AI.TextAnalytics NuGet package (for language) or Microsoft.CognitiveServices.Speech (for speech). Create a client object using the endpoint and key. For language: `var client = new TextAnalyticsClient(new Uri(endpoint), new AzureKeyCredential(key));`. For speech: `var config = SpeechConfig.FromSubscription(key, region);`. Authentication must happen before any API call.

4

Perform Sentiment Analysis

Call the `AnalyzeSentiment` method on a document. The method returns a `DocumentSentiment` object with a sentiment label (positive, negative, neutral, mixed) and confidence scores. Example: `DocumentSentiment documentSentiment = client.AnalyzeSentiment("I had a wonderful day!", "en");`. The service uses a pre-trained model that processes the text and returns scores for each sentiment class. The scores sum to 1.

5

Perform Speech-to-Text

Create a `SpeechRecognizer` object with the speech config and an audio config (e.g., from microphone or file). Call `RecognizeOnceAsync()` for a single utterance or `StartContinuousRecognitionAsync()` for continuous recognition. The result contains the recognized text, confidence, and duration. Example: `var result = await recognizer.RecognizeOnceAsync();`. The audio is streamed to the service, which processes it in chunks and returns interim results if enabled.

What This Looks Like on the Job

Enterprise Scenario 1: Customer Feedback Analysis

A large e-commerce company wants to analyze customer reviews to detect negative sentiment in real-time. They use Azure Language Service's Sentiment Analysis API. The application sends each review (as a document) to the API and receives sentiment scores. They configure a Logic App to trigger an alert when sentiment is negative, sending an email to customer service. They process ~10,000 reviews per day, staying within the S0 tier limits (1,000 calls per minute). They use batch endpoints to send 1,000 documents per request, reducing costs. A common misconfiguration is not handling the 'mixed' sentiment label, which occurs when a review contains both positive and negative statements. They also use Key Phrase Extraction to generate tags for product improvement.

Enterprise Scenario 2: Real-Time Meeting Transcription

A multinational corporation uses Azure Speech Service to transcribe meetings in multiple languages. They use the Speech SDK with a custom language model trained on corporate jargon. The audio is captured from Microsoft Teams via a bot. The transcription is displayed in real-time and stored in Azure Blob Storage for compliance. They use Speech Translation to translate English to Spanish and French simultaneously. They encountered issues with background noise affecting accuracy, so they implemented noise suppression and used a custom acoustic model. They also use Speaker Identification to label who is speaking. The system handles up to 50 concurrent meetings, each with a 2-hour duration. They monitor usage with Azure Monitor and set up alerts for high latency.

Enterprise Scenario 3: Document Processing Pipeline

An insurance company processes thousands of claim forms daily. They use Language Service's NER to extract entities like policy numbers, dates, and names. They also use Custom Text Classification to categorize claims (e.g., auto, health, property). The pipeline uses Azure Functions to trigger when a new document is uploaded to Blob Storage. The function reads the text (using OCR if needed), sends it to Language Service, and stores the extracted data in Cosmos DB. They use batch operations for efficiency. A common pitfall is exceeding the 1,000-character limit per document, so they split long documents into chunks. They also use PII detection to redact sensitive information before storing.

How AZ-204 Actually Tests This

What AZ-204 Tests on This Topic

AZ-204 objective 5.1 focuses on integrating Cognitive Services into applications. You will be tested on:

Provisioning a Cognitive Services resource (CLI, portal, ARM templates).

Authentication using keys or Azure AD (managed identities).

Using SDKs to call Language and Speech APIs.

Handling real-time vs. batch processing.

Configuring custom models (Custom Speech, Custom Text Classification).

Monitoring and logging.

Common Wrong Answers and Why Candidates Choose Them

1.

Wrong: Use the same key for all Cognitive Services. Reality: Each service resource has its own keys. Some candidates confuse multi-service resource (one key for multiple services) with individual resources. The exam tests understanding that a multi-service resource (kind: CognitiveServices) provides a single key for multiple services, but you must still create separate resources for custom features.

2.

Wrong: Speech-to-Text only supports real-time streaming. Reality: The Speech Service also supports batch transcription via a REST API. Candidates may think only the SDK supports STT, but the batch API is available for pre-recorded audio.

3.

Wrong: Language Service can process documents longer than 1,000 characters. Reality: The standard API has a 1,000-character limit per document. Candidates often assume no limit. For longer text, you must split documents or use the preview 'analyze' operation that supports up to 125,000 characters.

4.

Wrong: Custom Speech requires no training data. Reality: You must provide audio + transcriptions to train a custom model. Candidates may think it's just configuration.

Specific Numbers and Terms That Appear on the Exam

Free tier limits: 5,000 transactions/month for Language; 5 audio hours/month for Speech.

Default STT language: en-US.

Silence timeout: 20 seconds default.

Supported audio formats: WAV, MP3, OGG, etc.

Authentication headers: Ocp-Apim-Subscription-Key for keys; Authorization: Bearer for tokens.

SDK namespaces: Azure.AI.TextAnalytics, Microsoft.CognitiveServices.Speech.

Edge Cases and Exceptions

Multi-service resource: Can be used for Language, Speech, and other services, but you cannot use custom features (Custom Speech, Custom Text) with a multi-service resource; you need a dedicated resource.

Regional restrictions: Custom Neural Voice is only available in certain regions (e.g., West US 2).

Token expiration: Authorization tokens for speech expire after 10 minutes; you must refresh them.

Batch transcription: Maximum audio file size is 1 GB per file; maximum batch size is 100 files.

How to Eliminate Wrong Answers

Understand the underlying mechanism: if a question asks about real-time transcription, the answer should involve WebSocket and SDK. If it asks about processing a large batch of text files, the answer should involve batch API or splitting documents. If it asks about custom models, look for options that mention training data. If it asks about authentication, look for key or token options. Eliminate answers that use incorrect limits (e.g., 10,000 characters) or incorrect features (e.g., using Text Analytics for speech).

Key Takeaways

Azure Language Service provides pre-built NLP models: sentiment analysis, key phrase extraction, language detection, NER, PII detection, and custom text classification.

Azure Speech Service provides speech-to-text, text-to-speech, speech translation, speaker recognition, and custom speech models.

Both services require a Cognitive Services resource with an endpoint and key (or Azure AD token).

The standard Language API has a 1,000-character limit per document; batch requests can include up to 1,000 documents.

Speech-to-Text supports real-time (SDK via WebSocket) and batch (REST API) modes.

Custom models (Custom Speech, Custom Text) require training data and a dedicated resource (not multi-service).

Free tier limits: 5,000 transactions/month for Language; 5 audio hours/month for Speech.

Authentication uses Ocp-Apim-Subscription-Key header or Authorization: Bearer with a token (valid 10 minutes for speech).

Use Azure Monitor and Application Insights for logging and telemetry.

Always handle transient errors with retry logic and exponential backoff.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Language Service (Text Analytics)

Input is text (strings), not audio.

Features: sentiment analysis, key phrases, NER, language detection, PII detection, custom classification.

Max 1,000 characters per document (standard API).

SDK namespace: Azure.AI.TextAnalytics.

Pricing based on number of transactions (1,000 characters = 1 transaction).

Azure Speech Service

Input is audio (WAV, MP3, etc.) or text for TTS.

Features: speech-to-text, text-to-speech, speech translation, speaker recognition, custom speech, pronunciation assessment.

Supports real-time streaming and batch transcription (audio up to 1 GB).

SDK namespace: Microsoft.CognitiveServices.Speech.

Pricing based on audio hours (for STT/TTS) or transactions (for translation).

Watch Out for These

Mistake

Azure Language Service can process an unlimited number of characters per document.

Correct

The standard API has a 1,000-character limit per document. For longer text, use the 'analyze' operation (preview) which supports up to 125,000 characters, or split the text into multiple documents.

Mistake

You can use a single Cognitive Services key for all Language and Speech features.

Correct

A multi-service resource (kind: CognitiveServices) provides a single key for multiple services, but custom features like Custom Speech and Custom Text require dedicated resources. Also, you must create separate resources for each service if you need different tiers.

Mistake

Speech-to-Text only works with real-time audio streaming.

Correct

The Speech Service also provides batch transcription via a REST API for pre-recorded audio files. You can submit audio files (up to 1 GB each) and get transcriptions asynchronously.

Mistake

Custom Speech models can be trained without providing transcriptions.

Correct

You must provide a dataset of audio files with matching transcriptions (text) to train a custom acoustic or language model. Without transcriptions, you can only use pre-built models.

Mistake

Language Service's sentiment analysis always returns a binary positive/negative result.

Correct

Sentiment analysis returns a label (positive, negative, neutral, mixed) and confidence scores. 'Mixed' is a valid label when both positive and negative sentiments are present. The scores are three values (positive, negative, neutral) that sum to 1.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a multi-service Cognitive Services resource and a single-service resource?

A multi-service resource (kind: CognitiveServices) provides a single endpoint and key to access multiple Cognitive Services (e.g., Language, Speech, Vision) from one resource. However, it does not support custom features like Custom Speech or Custom Text. For those, you need a dedicated single-service resource (e.g., SpeechServices, TextAnalytics). Additionally, billing is aggregated under one resource, which may be simpler for cost management.

How do I handle long text (over 1,000 characters) with Azure Language Service?

For the standard API, you must split the text into multiple documents, each under 1,000 characters, and send them in a batch request. Alternatively, you can use the preview 'analyze' operation (e.g., /analyze) which supports up to 125,000 characters per document. However, this feature may have different pricing and is not GA. Check the latest documentation.

Can I use Azure Speech Service to transcribe a pre-recorded audio file?

Yes. You can use the batch transcription API (REST) to transcribe audio files stored in Azure Blob Storage. Submit a POST request with the audio file URI, and the service will process it asynchronously. You can then poll for results. The maximum file size is 1 GB, and you can include up to 100 files per batch.

What audio formats does Azure Speech Service support?

The Speech Service supports several audio formats including WAV (PCM), MP3, OGG (Opus), FLAC, and others. For real-time streaming, the SDK uses a specific format (e.g., 16 kHz, 16-bit, mono PCM for optimal recognition). You can specify the audio format in the request using the `audio` configuration.

How do I train a custom speech model?

First, create a dedicated Speech resource (not multi-service). Then, upload a dataset of audio files with matching transcriptions (text) to the Speech Studio or via API. You can train acoustic, language, or pronunciation models. After training, you deploy the model and use its endpoint ID in your application. Training costs additional fees based on compute hours.

What is the difference between sentiment analysis and opinion mining?

Sentiment analysis returns a global sentiment label (positive, negative, neutral, mixed) for a document. Opinion mining (also called aspect-based sentiment analysis) goes deeper by extracting aspects (e.g., 'battery life') and their associated sentiment (e.g., 'positive'). Opinion mining is available as a separate feature in Language Service (v3.1) and requires the `opinionMining` parameter set to true.

How do I authenticate with Azure AD instead of keys?

You can use managed identity or a service principal. For managed identity, enable it on your compute resource (e.g., Azure Function) and assign the 'Cognitive Services User' role on the Cognitive Services resource. Then, use `DefaultAzureCredential` in your SDK code. For example: `var client = new TextAnalyticsClient(endpoint, new DefaultAzureCredential());`. This eliminates the need to store keys.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Language and Speech Services — now see how well it sticks with free AZ-204 practice questions. Full explanations included, no account needed.

Done with this chapter?