This chapter covers Azure Language and Speech Services, part of Azure Cognitive Services, which enable developers to integrate natural language processing and speech capabilities into applications. These services are critical for building intelligent apps that understand, analyze, and respond to human language. On the AZ-204 exam, this topic area typically accounts for about 5-10% of questions, focusing on provisioning, authentication, SDK usage, and integration patterns. You will need to know how to use pre-built models, customize models, and handle real-time vs. batch processing.
Jump to a section
Imagine a large conference with attendees speaking many languages. The Azure Language Service is like a team of expert interpreters stationed in a central booth. Each attendee speaks into a microphone; the audio is sent to the booth, where the interpreters (pre-built models) instantly transcribe the speech into text (Speech-to-Text). If an attendee wants to translate, the text is passed to a translator (Translator Text API) who converts it to the target language. For custom needs, the organization can train their own interpreters (Custom Speech, Custom Translator) by providing specialized vocabulary or parallel documents. The interpreters also detect emotions (Sentiment Analysis) and key phrases (Key Phrase Extraction) from the tone of voice or text. The entire system is orchestrated by a coordinator (Azure Cognitive Services) that manages authentication (keys/tokens), scaling (handling multiple concurrent requests), and regional availability. Just as a real interpreter must be accurate, fast, and secure, the Language Service provides high accuracy, low latency, and enterprise-grade security. The analogy breaks when you consider that the service can handle millions of requests per second, something no human team could achieve.
What Are Azure Language and Speech Services?
Azure Language and Speech Services are cloud-based APIs and SDKs that provide pre-built machine learning models for natural language processing (NLP) and speech processing. They are part of Azure Cognitive Services, which are grouped into decision, vision, speech, and language categories. The Language Service (formerly Text Analytics) includes features like sentiment analysis, key phrase extraction, language detection, named entity recognition (NER), and custom text classification. The Speech Service includes speech-to-text, text-to-speech, speech translation, and speaker recognition. These services are designed to be used without machine learning expertise — developers simply call REST APIs or use client libraries.
How They Work Internally
Both services are built on deep neural networks trained on massive datasets. When you send a request (text or audio), the service processes it through a pipeline: - Pre-processing: Text is cleaned and tokenized; audio is converted to a suitable format (e.g., PCM, Opus) and split into frames. - Inference: The pre-processed data is fed into a trained model that outputs predictions (e.g., sentiment score, transcribed text). - Post-processing: Results are formatted into JSON responses. For speech, the service may apply language models, acoustic models, and pronunciation scoring.
Key Components, Defaults, and Timers
- Language Service: Key features include:
- Sentiment Analysis: Returns a sentiment label (positive, negative, neutral, mixed) and confidence scores (0 to 1). Default endpoint: https://<region>.api.cognitive.microsoft.com/text/analytics/v3.1/sentiment.
- Key Phrase Extraction: Returns a list of key phrases. Endpoint: /keyPhrases.
- Language Detection: Returns language name, ISO 639-1 code, and confidence score. Endpoint: /languages.
- Named Entity Recognition (NER): Recognizes entities like persons, locations, organizations. Endpoint: /entities/recognition/general.
- Custom Text Classification: Requires a trained model; endpoint includes model version.
- PII Detection: Identifies personally identifiable information (e.g., phone numbers, emails). Endpoint: /entities/recognition/pii.
- Defaults: The service supports up to 1,000 characters per document (for text APIs), with a maximum batch size of 1,000 documents per request. Rate limits vary by tier (Free: 5,000 transactions/month, S0: 1,000 calls per minute).
- Speech Service: Key features include:
- Speech-to-Text (STT): Real-time or batch transcription. Supports multiple audio formats (WAV, MP3, OGG). Default recognition language is en-US. Use SDK or REST API.
- Text-to-Speech (TTS): Converts text to natural-sounding speech. Supports neural voices, custom voice fonts. Default output format is audio-16khz-32kbitrate-mono-mp3.
- Speech Translation: Real-time translation of audio to text in multiple languages.
- Speaker Recognition: Identifies or verifies speakers based on voice characteristics.
- Custom Speech: Allows training custom acoustic, language, and pronunciation models. Requires a dataset of audio + transcriptions.
- Pronunciation Assessment: Evaluates pronunciation accuracy, fluency, and completeness.
- Timers: For real-time STT, the SDK supports continuous recognition with a timeout of 30 seconds of silence before ending. The default silence timeout is 20 seconds.
Configuration and Verification Commands
To use these services, you need to create a Cognitive Services resource in Azure. You can do this via Azure CLI:
az cognitiveservices account create --name MyLanguageService --resource-group MyResourceGroup --kind TextAnalytics --sku F0 --location westusFor Speech:
az cognitiveservices account create --name MySpeechService --resource-group MyResourceGroup --kind SpeechServices --sku F0 --location westusTo get the endpoint and keys:
az cognitiveservices account keys list --name MyLanguageService --resource-group MyResourceGroup
az cognitiveservices account show --name MyLanguageService --resource-group MyResourceGroup --query "properties.endpoint"For SDK usage, install the NuGet package:
dotnet add package Azure.AI.TextAnalytics
dotnet add package Microsoft.CognitiveServices.SpeechHow They Interact with Related Technologies
Azure Bot Service: Language Service can be used to analyze user messages for sentiment and entities, improving bot responses.
Azure Search: Use Key Phrase Extraction to generate searchable keywords from documents.
Power Automate: Connect to Language Service to trigger workflows based on sentiment analysis.
Azure Functions: Serverless execution of language processing tasks.
Logic Apps: Visual workflows that call Language Service APIs.
Azure Data Lake: Process large volumes of text data with batch operations.
Azure Kubernetes Service (AKS): Deploy custom containers for Language Service (available for some features) for on-premises or edge scenarios.
Security and Authentication
All requests must include an authentication header. You can use either a subscription key (passed in the Ocp-Apim-Subscription-Key header) or an Azure AD token (for managed identity scenarios). For speech, you can also use a temporary authorization token (valid for 10 minutes) obtained from the token endpoint.
Pricing Tiers
Free (F0): 5,000 transactions per month for Language; 5 audio hours per month for Speech.
Standard (S0): Pay-as-you-go; higher throughput and features like custom models. For Speech, you pay per audio hour.
Custom model training: Additional charges for training hours.
Region Availability
Both services are available in multiple Azure regions. Some features (like custom neural voice) are restricted to certain regions (e.g., West US 2, West Europe). Always check the documentation for region-specific availability.
Monitoring and Logging
You can enable diagnostic settings to send logs to Azure Monitor, Storage, or Event Hub. Metrics include number of calls, latency, and errors. Use Application Insights for SDK-side telemetry.
Best Practices
Use batch endpoints for large volumes of text (up to 1,000 documents per request).
For real-time speech, use the SDK with WebSocket protocol for lower latency.
Implement retry logic with exponential backoff for transient failures.
Secure your keys using Azure Key Vault or managed identities.
For custom speech, ensure training data is representative of the target acoustic environment.
Use the TextAnalyticsClient class for language operations, and SpeechRecognizer for speech.
Create Cognitive Services Resource
Go to Azure Portal, click 'Create a resource', search for 'Language service' or 'Speech service'. Choose the appropriate service (e.g., Text Analytics, Speech). Select a pricing tier (F0 for free tier, S0 for production). Choose a region (e.g., West US). Provide a resource group and name. Review and create. After deployment, note the endpoint and keys. This resource will be used for all subsequent API calls.
Get Endpoint and Keys
In the Azure Portal, navigate to your resource. Under 'Resource Management', click 'Keys and Endpoint'. Copy one of the keys and the endpoint URL. For security, store keys in Azure Key Vault or environment variables. The endpoint is typically in the format `https://<region>.api.cognitive.microsoft.com/`. For speech, the endpoint includes the region, e.g., `https://westus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1`.
Install SDK and Authenticate
In your .NET project, install the Azure.AI.TextAnalytics NuGet package (for language) or Microsoft.CognitiveServices.Speech (for speech). Create a client object using the endpoint and key. For language: `var client = new TextAnalyticsClient(new Uri(endpoint), new AzureKeyCredential(key));`. For speech: `var config = SpeechConfig.FromSubscription(key, region);`. Authentication must happen before any API call.
Perform Sentiment Analysis
Call the `AnalyzeSentiment` method on a document. The method returns a `DocumentSentiment` object with a sentiment label (positive, negative, neutral, mixed) and confidence scores. Example: `DocumentSentiment documentSentiment = client.AnalyzeSentiment("I had a wonderful day!", "en");`. The service uses a pre-trained model that processes the text and returns scores for each sentiment class. The scores sum to 1.
Perform Speech-to-Text
Create a `SpeechRecognizer` object with the speech config and an audio config (e.g., from microphone or file). Call `RecognizeOnceAsync()` for a single utterance or `StartContinuousRecognitionAsync()` for continuous recognition. The result contains the recognized text, confidence, and duration. Example: `var result = await recognizer.RecognizeOnceAsync();`. The audio is streamed to the service, which processes it in chunks and returns interim results if enabled.
Enterprise Scenario 1: Customer Feedback Analysis
A large e-commerce company wants to analyze customer reviews to detect negative sentiment in real-time. They use Azure Language Service's Sentiment Analysis API. The application sends each review (as a document) to the API and receives sentiment scores. They configure a Logic App to trigger an alert when sentiment is negative, sending an email to customer service. They process ~10,000 reviews per day, staying within the S0 tier limits (1,000 calls per minute). They use batch endpoints to send 1,000 documents per request, reducing costs. A common misconfiguration is not handling the 'mixed' sentiment label, which occurs when a review contains both positive and negative statements. They also use Key Phrase Extraction to generate tags for product improvement.
Enterprise Scenario 2: Real-Time Meeting Transcription
A multinational corporation uses Azure Speech Service to transcribe meetings in multiple languages. They use the Speech SDK with a custom language model trained on corporate jargon. The audio is captured from Microsoft Teams via a bot. The transcription is displayed in real-time and stored in Azure Blob Storage for compliance. They use Speech Translation to translate English to Spanish and French simultaneously. They encountered issues with background noise affecting accuracy, so they implemented noise suppression and used a custom acoustic model. They also use Speaker Identification to label who is speaking. The system handles up to 50 concurrent meetings, each with a 2-hour duration. They monitor usage with Azure Monitor and set up alerts for high latency.
Enterprise Scenario 3: Document Processing Pipeline
An insurance company processes thousands of claim forms daily. They use Language Service's NER to extract entities like policy numbers, dates, and names. They also use Custom Text Classification to categorize claims (e.g., auto, health, property). The pipeline uses Azure Functions to trigger when a new document is uploaded to Blob Storage. The function reads the text (using OCR if needed), sends it to Language Service, and stores the extracted data in Cosmos DB. They use batch operations for efficiency. A common pitfall is exceeding the 1,000-character limit per document, so they split long documents into chunks. They also use PII detection to redact sensitive information before storing.
What AZ-204 Tests on This Topic
AZ-204 objective 5.1 focuses on integrating Cognitive Services into applications. You will be tested on:
Provisioning a Cognitive Services resource (CLI, portal, ARM templates).
Authentication using keys or Azure AD (managed identities).
Using SDKs to call Language and Speech APIs.
Handling real-time vs. batch processing.
Configuring custom models (Custom Speech, Custom Text Classification).
Monitoring and logging.
Common Wrong Answers and Why Candidates Choose Them
Wrong: Use the same key for all Cognitive Services. Reality: Each service resource has its own keys. Some candidates confuse multi-service resource (one key for multiple services) with individual resources. The exam tests understanding that a multi-service resource (kind: CognitiveServices) provides a single key for multiple services, but you must still create separate resources for custom features.
Wrong: Speech-to-Text only supports real-time streaming. Reality: The Speech Service also supports batch transcription via a REST API. Candidates may think only the SDK supports STT, but the batch API is available for pre-recorded audio.
Wrong: Language Service can process documents longer than 1,000 characters. Reality: The standard API has a 1,000-character limit per document. Candidates often assume no limit. For longer text, you must split documents or use the preview 'analyze' operation that supports up to 125,000 characters.
Wrong: Custom Speech requires no training data. Reality: You must provide audio + transcriptions to train a custom model. Candidates may think it's just configuration.
Specific Numbers and Terms That Appear on the Exam
Free tier limits: 5,000 transactions/month for Language; 5 audio hours/month for Speech.
Default STT language: en-US.
Silence timeout: 20 seconds default.
Supported audio formats: WAV, MP3, OGG, etc.
Authentication headers: Ocp-Apim-Subscription-Key for keys; Authorization: Bearer for tokens.
SDK namespaces: Azure.AI.TextAnalytics, Microsoft.CognitiveServices.Speech.
Edge Cases and Exceptions
Multi-service resource: Can be used for Language, Speech, and other services, but you cannot use custom features (Custom Speech, Custom Text) with a multi-service resource; you need a dedicated resource.
Regional restrictions: Custom Neural Voice is only available in certain regions (e.g., West US 2).
Token expiration: Authorization tokens for speech expire after 10 minutes; you must refresh them.
Batch transcription: Maximum audio file size is 1 GB per file; maximum batch size is 100 files.
How to Eliminate Wrong Answers
Understand the underlying mechanism: if a question asks about real-time transcription, the answer should involve WebSocket and SDK. If it asks about processing a large batch of text files, the answer should involve batch API or splitting documents. If it asks about custom models, look for options that mention training data. If it asks about authentication, look for key or token options. Eliminate answers that use incorrect limits (e.g., 10,000 characters) or incorrect features (e.g., using Text Analytics for speech).
Azure Language Service provides pre-built NLP models: sentiment analysis, key phrase extraction, language detection, NER, PII detection, and custom text classification.
Azure Speech Service provides speech-to-text, text-to-speech, speech translation, speaker recognition, and custom speech models.
Both services require a Cognitive Services resource with an endpoint and key (or Azure AD token).
The standard Language API has a 1,000-character limit per document; batch requests can include up to 1,000 documents.
Speech-to-Text supports real-time (SDK via WebSocket) and batch (REST API) modes.
Custom models (Custom Speech, Custom Text) require training data and a dedicated resource (not multi-service).
Free tier limits: 5,000 transactions/month for Language; 5 audio hours/month for Speech.
Authentication uses Ocp-Apim-Subscription-Key header or Authorization: Bearer with a token (valid 10 minutes for speech).
Use Azure Monitor and Application Insights for logging and telemetry.
Always handle transient errors with retry logic and exponential backoff.
These come up on the exam all the time. Here's how to tell them apart.
Azure Language Service (Text Analytics)
Input is text (strings), not audio.
Features: sentiment analysis, key phrases, NER, language detection, PII detection, custom classification.
Max 1,000 characters per document (standard API).
SDK namespace: Azure.AI.TextAnalytics.
Pricing based on number of transactions (1,000 characters = 1 transaction).
Azure Speech Service
Input is audio (WAV, MP3, etc.) or text for TTS.
Features: speech-to-text, text-to-speech, speech translation, speaker recognition, custom speech, pronunciation assessment.
Supports real-time streaming and batch transcription (audio up to 1 GB).
SDK namespace: Microsoft.CognitiveServices.Speech.
Pricing based on audio hours (for STT/TTS) or transactions (for translation).
Mistake
Azure Language Service can process an unlimited number of characters per document.
Correct
The standard API has a 1,000-character limit per document. For longer text, use the 'analyze' operation (preview) which supports up to 125,000 characters, or split the text into multiple documents.
Mistake
You can use a single Cognitive Services key for all Language and Speech features.
Correct
A multi-service resource (kind: CognitiveServices) provides a single key for multiple services, but custom features like Custom Speech and Custom Text require dedicated resources. Also, you must create separate resources for each service if you need different tiers.
Mistake
Speech-to-Text only works with real-time audio streaming.
Correct
The Speech Service also provides batch transcription via a REST API for pre-recorded audio files. You can submit audio files (up to 1 GB each) and get transcriptions asynchronously.
Mistake
Custom Speech models can be trained without providing transcriptions.
Correct
You must provide a dataset of audio files with matching transcriptions (text) to train a custom acoustic or language model. Without transcriptions, you can only use pre-built models.
Mistake
Language Service's sentiment analysis always returns a binary positive/negative result.
Correct
Sentiment analysis returns a label (positive, negative, neutral, mixed) and confidence scores. 'Mixed' is a valid label when both positive and negative sentiments are present. The scores are three values (positive, negative, neutral) that sum to 1.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A multi-service resource (kind: CognitiveServices) provides a single endpoint and key to access multiple Cognitive Services (e.g., Language, Speech, Vision) from one resource. However, it does not support custom features like Custom Speech or Custom Text. For those, you need a dedicated single-service resource (e.g., SpeechServices, TextAnalytics). Additionally, billing is aggregated under one resource, which may be simpler for cost management.
For the standard API, you must split the text into multiple documents, each under 1,000 characters, and send them in a batch request. Alternatively, you can use the preview 'analyze' operation (e.g., /analyze) which supports up to 125,000 characters per document. However, this feature may have different pricing and is not GA. Check the latest documentation.
Yes. You can use the batch transcription API (REST) to transcribe audio files stored in Azure Blob Storage. Submit a POST request with the audio file URI, and the service will process it asynchronously. You can then poll for results. The maximum file size is 1 GB, and you can include up to 100 files per batch.
The Speech Service supports several audio formats including WAV (PCM), MP3, OGG (Opus), FLAC, and others. For real-time streaming, the SDK uses a specific format (e.g., 16 kHz, 16-bit, mono PCM for optimal recognition). You can specify the audio format in the request using the `audio` configuration.
First, create a dedicated Speech resource (not multi-service). Then, upload a dataset of audio files with matching transcriptions (text) to the Speech Studio or via API. You can train acoustic, language, or pronunciation models. After training, you deploy the model and use its endpoint ID in your application. Training costs additional fees based on compute hours.
Sentiment analysis returns a global sentiment label (positive, negative, neutral, mixed) for a document. Opinion mining (also called aspect-based sentiment analysis) goes deeper by extracting aspects (e.g., 'battery life') and their associated sentiment (e.g., 'positive'). Opinion mining is available as a separate feature in Language Service (v3.1) and requires the `opinionMining` parameter set to true.
You can use managed identity or a service principal. For managed identity, enable it on your compute resource (e.g., Azure Function) and assign the 'Cognitive Services User' role on the Cognitive Services resource. Then, use `DefaultAzureCredential` in your SDK code. For example: `var client = new TextAnalyticsClient(endpoint, new DefaultAzureCredential());`. This eliminates the need to store keys.
You've just covered Azure Language and Speech Services — now see how well it sticks with free AZ-204 practice questions. Full explanations included, no account needed.
Done with this chapter?