AI-900Chapter 17 of 100Objective 4.4

Azure AI Speech Service

Covering approximately 5–10% of the AI-900 exam, Azure AI Speech Service is a core component of Microsoft's AI platform that enables speech-to-text, text-to-speech, and speech translation capabilities. For the AI-900 exam, this topic falls under domain 'NLP' objective 4.4, which tests your understanding of when to use each speech capability and how to configure them. Approximately 5-10% of exam questions touch on speech services, often asking you to identify the correct service based on a business scenario. By the end of this chapter, you will be able to differentiate between Speech-to-Text, Text-to-Speech, Speech Translation, and Custom Voice, and know the key configuration parameters like locale, voice name, and recognition mode.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Azure Speech Service as a Multilingual Interpreter

Twelve attendees, each speaking a different language, occupy a conference room. You hire a team of interpreters (the Azure Speech Service). Each interpreter specializes in a specific task: one transcribes spoken English into text (speech-to-text), another reads written text aloud in French (text-to-speech), and a third translates the English text into Spanish in real-time (speech translation). The interpreters work in a soundproof booth (the cloud), connected to the room via microphones and speakers (your application's audio input/output). The lead interpreter (the Speech SDK) manages the flow: when someone speaks, the audio is sent to the booth, the appropriate interpreter processes it, and the result is sent back. The interpreters have been trained on thousands of hours of meetings (custom models), so they understand accents and industry jargon. If you need a new language, you hire a new interpreter (add a language model). The entire system is scalable: if the conference becomes a global event with thousands of attendees, you can add more interpreters (scale up the service) without changing the booth's wiring. The key insight: the interpreters do not understand the content; they only transform the signal (audio to text or vice versa) based on statistical patterns, just as Azure Speech Service uses deep neural networks to map acoustic signals to linguistic representations.

How It Actually Works

What is Azure AI Speech Service?

Azure AI Speech Service is a cloud-based API that provides advanced speech capabilities powered by deep neural networks. It is part of Azure Cognitive Services, specifically under the Language Services category. The service offers three primary functionalities: - Speech-to-Text (STT): Converts audio streams into text in real-time or batch. - Text-to-Speech (TTS): Converts text into lifelike speech, including neural voices that sound natural. - Speech Translation: Translates speech from one language to another in real-time, with optional text output.

The service is designed to be used in applications like virtual assistants, transcription services, voice-controlled systems, and accessibility tools. For the AI-900 exam, you must know the specific use cases and how to choose the right service.

How It Works Internally

Azure Speech Service uses a pipeline of machine learning models: 1. Acoustic Model: Converts raw audio waveforms into phonemes (basic sound units). This model is trained on thousands of hours of speech data and handles variations in accent, noise, and microphone quality. 2. Language Model: Predicts the most likely sequence of words given the phonemes. It uses statistical n-grams and neural language models to understand context and grammar. 3. Pronunciation Model: Maps phonemes to words, handling homophones and proper nouns. 4. Neural Text-to-Speech: For TTS, a neural network generates speech waveforms from text, using a sequence-to-sequence architecture with attention. The model learns prosody, intonation, and pauses.

When you send an audio stream to the Speech SDK, the following happens:

The SDK chunks the audio into frames (typically 100ms long) and sends them to the endpoint.

The service processes each frame through the acoustic and language models to produce partial recognition results.

As more audio arrives, the service refines the hypothesis and eventually returns a final result.

For translation, the recognized text is passed to a translation model (similar to Translator Text) and then optionally synthesized to speech via TTS.

Key Components, Values, and Defaults

Speech SDK: The primary way to interact with the service. Available in multiple languages (C#, Python, JavaScript, Java, etc.).

Endpoint: The REST API or WebSocket endpoint for real-time streaming. Default endpoint pattern: https://{region}.stt.speech.microsoft.com/ for STT.

Subscription Key: A key from your Azure Cognitive Services resource. Must be passed in headers or via SDK configuration.

Region: Required. Example: westus, eastasia. The service is not available in all regions; check documentation for supported regions.

Recognition Mode: For STT, you can choose Interactive (real-time, low latency, for commands), Conversation (for multi-turn dialogue), or Dictation (for longer, continuous speech with punctuation).

Locale: Language and dialect. Example: en-US for American English, en-GB for British English. The default is en-US.

Voice Name: For TTS, you specify a voice. Microsoft provides over 200 neural voices across 70+ locales. Example: en-US-JennyNeural for a female voice. You can also use en-US-ChristopherNeural for a male voice.

Profanity Filter: Options: None, Masked (replaces profanity with asterisks), Removed, Tags. Default is Masked.

Endpoint ID: For custom models, you must provide the endpoint ID of your custom speech model.

Timeouts: The service has a 10-second silence timeout for interactive mode; if no speech is detected, the session ends.

Configuration and Verification

To configure speech-to-text using the SDK in Python:

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="westus")
speech_config.speech_recognition_language = "en-US"
audio_config = speechsdk.AudioConfig(use_default_microphone=True)
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
result = recognizer.recognize_once()
print(result.text)

For text-to-speech:

speech_config = speechsdk.SpeechConfig(subscription="YourKey", region="westus")
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text_async("Hello world").get()

Verification: You can test the service using Azure Portal's Speech Studio, which provides a no-code interface to try STT, TTS, and translation. This is also where you can create custom models.

How It Interacts with Related Technologies

Language Understanding (LUIS): Often combined with STT to build a voice-enabled chatbot. STT converts speech to text, then LUIS extracts intent and entities.

Translator Text: Speech Translation uses the same backend as Translator Text for language translation.

Custom Speech: Allows you to train custom acoustic and language models with your own data (e.g., industry-specific jargon). Requires a Subscription Key and a Speech resource in a region that supports Custom Speech.

Custom Voice: You can create a unique synthetic voice using your own audio recordings. This is a separate service with additional training time and cost.

Azure Bot Service: Integrates with Speech SDK to enable voice interactions in bots.

Performance and Scale

Latency: Interactive mode typically returns results in under 500ms for short phrases. Dictation mode may take longer.

Concurrent Requests: The free tier (F0) allows 5 concurrent requests; standard tier (S0) allows up to 100 (adjustable via support).

Audio Formats: Supported formats include WAV (PCM), MP3, OGG, and FLAC. Sample rate must be 16 kHz or 8 kHz (for telephony).

Batch Transcription: For large volumes, you can use the Batch Transcription API, which processes audio files asynchronously and supports up to 1000 requests per day.

Edge Cases

Multiple Languages: Use multilingual speech recognition by setting speech_recognition_language to a list of locales (e.g., ["en-US", "es-ES"]). The service auto-detects the language per utterance.

Custom Models: If you use a custom model, you must provide the endpoint ID. The model must be deployed to an endpoint before use.

Streaming vs. Batch: Streaming uses WebSocket for real-time; batch uses REST for pre-recorded audio.

Profanity Filter: Be aware that the default filter may mask words you want to keep in certain applications (e.g., medical terminology).

Walk-Through

Create a Speech Resource

In the Azure portal, navigate to 'Create a resource' and search for 'Speech'. Choose the Cognitive Services Speech service. Select a pricing tier: Free (F0) for development with limited calls, Standard (S0) for production. Choose a region; note that not all regions support all features (e.g., Custom Neural Voice is limited to certain regions). After creation, note the subscription key and region. The key is used in all SDK calls. The resource can be scaled later, but the region cannot be changed.

Configure Speech SDK in Application

Install the Speech SDK via your language's package manager (e.g., pip install azure-cognitiveservices-speech). Create a SpeechConfig object with your subscription key and region. For STT, set the recognition language. For TTS, set the voice name. Optionally, configure audio input/output (e.g., default microphone or a file). The SDK handles connection pooling and retries. For real-time use, use the WebSocket-based recognizer; for batch, use the REST API.

Perform Speech-to-Text Recognition

Create a SpeechRecognizer object. Call recognize_once() for a single utterance (stops after silence) or start_continuous_recognition() for ongoing transcription. The recognizer sends audio frames to the service. Partial results are available via events (recognizing). The final result is returned via the recognized event or the result object. The service uses a 10-second silence timeout; if no speech is detected, it returns a NoMatch result. For dictation, you must set the speech_recognition_language and use a dictation mode.

Perform Text-to-Speech Synthesis

Create a SpeechSynthesizer object. Call speak_text_async() to synthesize a string. The result contains audio data in a stream. You can save it to a file or play it directly. For SSML (Speech Synthesis Markup Language), use speak_ssml_async() to control prosody, pauses, and pronunciation. The service supports multiple audio formats (e.g., Riff16Khz16BitMonoPcm). The default voice is en-US-JennyNeural if not specified. Neural voices provide natural intonation.

Implement Speech Translation

Use the TranslationRecognizer. Create a SpeechTranslationConfig with subscription key and region. Set the target language(s) using addTargetLanguage(). Optionally set source language (if not auto-detect). The recognizer returns translated text and optionally synthesized speech. For real-time translation, use the same streaming approach as STT. The translation model supports 9 languages for speech-to-speech and more for speech-to-text. The service can output both the original transcription and the translation.

What This Looks Like on the Job

Enterprise Scenario 1: Real-Time Meeting Transcription

A multinational corporation uses Azure Speech Service to transcribe board meetings in real-time. The audio from multiple microphones is streamed to the service, which outputs a live transcript displayed on a screen. The system uses the Conversation transcription mode, which supports speaker diarization (identifying who spoke). The transcript is then analyzed for action items using LUIS. The company configured a custom language model with industry-specific terms (e.g., 'ROI', 'synergy') to improve accuracy. They use the S0 tier with 50 concurrent streams. A common misconfiguration is forgetting to enable diarization, resulting in a single speaker label. The service handles background noise by using a noise suppression feature in the SDK. Latency is under 1 second for each phrase.

Scenario 2: Voice-Controlled Warehouse System

A logistics company uses Azure Speech Service for voice commands in a warehouse. Workers wear headsets and say commands like 'Scan item 12345' or 'Move to aisle 3'. The STT service recognizes the command, which is then processed by a custom NLP model to trigger the warehouse management system. The system uses Interactive recognition mode for low latency (under 300ms). The audio is from a close-talking microphone, so accuracy is high. They use a custom acoustic model trained on warehouse noise (forklifts, beeps). A pitfall: the default profanity filter masked the word 'pick' because it matched a profanity pattern; they set the filter to 'None'. They also use a custom endpoint to ensure low latency.

Scenario 3: Multilingual Customer Support Bot

A global e-commerce company deploys a voice bot for customer support in 5 languages. The bot uses Speech Translation to convert the customer's speech (e.g., Spanish) into English text, then processes the intent, and responds in the customer's language using TTS. The system uses the Speech SDK with auto-language detection. They use multiple target languages to support agents. A challenge: the translation model sometimes misses cultural nuances; they added a post-processing step to refine translations. The bot handles 10,000 calls per day with a 99.9% uptime SLA. They monitor performance using Azure Monitor and set up alerts for high latency. Misconfiguration: not setting the correct locale for TTS resulted in a Spanish voice speaking English with a wrong accent.

How AI-900 Actually Tests This

What AI-900 Tests on This Topic

Objective 4.4: 'Describe capabilities of the Speech service'. The exam expects you to:

Identify the correct service for a given scenario: Speech-to-Text, Text-to-Speech, Speech Translation, or Custom Voice.

Understand the difference between prebuilt and custom models.

Know that Custom Voice requires special training and is not available in all regions.

Recognize that Speech Translation can output both text and synthesized speech.

Know that the service supports multiple languages and that you can set the locale.

Common Wrong Answers and Why

Choosing 'Translator Text' instead of 'Speech Translation': Candidates see 'translation' and pick the simpler service. However, Translator Text only handles text input, not speech. The scenario will mention 'audio' or 'speech', so Speech Translation is correct.

Selecting 'Custom Speech' when the scenario does not mention custom data: The exam often includes a scenario with industry jargon but no mention of training data. If the scenario says 'use a prebuilt model', the answer is the standard Speech service, not Custom Speech.

Confusing 'Speech-to-Text' with 'Speaker Recognition': Speaker Recognition is a separate service (not in AI-900 scope). The exam will not test it, but candidates might think STT includes speaker identification. It does not; STT only transcribes speech to text without identifying who spoke.

Thinking Text-to-Speech can translate: TTS only converts text to speech in the same language. For translation, you need Speech Translation.

Specific Numbers and Terms

Pricing Tiers: F0 (free, 5 concurrent), S0 (standard, up to 100 concurrent).

Audio Sample Rates: 16 kHz or 8 kHz.

Silence Timeout: 10 seconds for interactive mode.

Languages: Over 70 locales for STT, over 200 neural voices for TTS.

Recognition Modes: Interactive, Conversation, Dictation.

Custom Model Types: Custom Speech (acoustic and language), Custom Voice (neural voice).

Edge Cases the Exam Loves

Multiple languages in one audio stream: The exam may ask how to handle a conversation that switches between English and Spanish. Answer: Use multilingual recognition by setting multiple locales.

Real-time vs. batch: If the scenario says 'transcribe a recorded meeting', the answer is Batch Transcription API, not the real-time SDK.

Noise reduction: The SDK includes built-in noise suppression; no need for additional services.

How to Eliminate Wrong Answers

If the scenario involves audio input, eliminate any service that only works with text (e.g., Translator Text, LUIS).

If the scenario requires outputting speech (audio), eliminate services that only output text (e.g., STT, Translator Text).

If the scenario mentions 'custom' or 'unique voice', look for Custom Voice; if it mentions 'acoustic model', look for Custom Speech.

Always check the language: if the source and target languages are the same, it's STT or TTS; if different, it's Speech Translation.

Key Takeaways

Azure Speech Service includes Speech-to-Text, Text-to-Speech, and Speech Translation.

STT converts audio to text; TTS converts text to audio; Speech Translation handles cross-language speech.

The service requires a subscription key and region; region must support the desired features.

Recognition modes: Interactive (low latency), Conversation (diarization), Dictation (continuous with punctuation).

Audio must be 16 kHz or 8 kHz sample rate; supported formats include WAV, MP3, OGG.

Custom Speech improves accuracy for domain-specific vocabulary; Custom Voice creates unique synthetic voices.

The free tier (F0) allows 5 concurrent requests; standard tier (S0) supports up to 100.

Silence timeout is 10 seconds for interactive mode; use continuous recognition for longer sessions.

Speech Translation can output both text and synthesized speech in the target language.

The service integrates with LUIS and Bot Service for building conversational AI.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Speech-to-Text (STT)

Input is audio (speech).

Output is text.

Used for transcription, voice commands, and dictation.

Supports multiple recognition modes: interactive, conversation, dictation.

Can use custom acoustic and language models.

Text-to-Speech (TTS)

Input is text (plain text or SSML).

Output is audio (speech).

Used for voice assistants, audiobooks, and accessibility.

Supports over 200 neural voices with natural prosody.

Can use custom voice models (Custom Neural Voice).

Watch Out for These

Mistake

Speech-to-Text can identify who is speaking.

Correct

STT does not include speaker recognition. It only transcribes speech to text. Speaker diarization is a separate feature available in Conversation transcription, but it only labels speakers as Speaker1, Speaker2, etc., without identifying their identity.

Mistake

Text-to-Speech can translate text to another language.

Correct

TTS only converts text to speech in the same language. For translation, you need Speech Translation, which first translates the text and then optionally synthesizes it in the target language.

Mistake

Custom Speech and Custom Voice are the same thing.

Correct

Custom Speech improves recognition accuracy for specific vocabulary or accents by training acoustic and language models. Custom Voice creates a unique synthetic voice from recorded samples. They are separate services with different training processes and costs.

Mistake

Azure Speech Service requires an internet connection only for initial setup.

Correct

The service is cloud-based and requires continuous internet connectivity for real-time processing. There is no offline mode. All audio is sent to Azure for processing.

Mistake

The free tier (F0) is sufficient for production workloads.

Correct

F0 is limited to 5 concurrent requests and 5 hours of audio per month. For production, you need the S0 tier, which supports higher concurrency and unlimited audio (with pay-as-you-go pricing).

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Speech-to-Text and Text-to-Speech?

Speech-to-Text (STT) converts spoken language into written text. It is used for transcription, voice commands, and dictation. Text-to-Speech (TTS) converts written text into spoken audio. It is used for reading aloud, voice assistants, and accessibility. STT takes audio in and outputs text; TTS takes text in and outputs audio. For the AI-900 exam, you must choose based on the direction of conversion needed.

Can Azure Speech Service translate speech in real-time?

Yes, Azure Speech Translation can translate speech from one language to another in real-time. It uses the same streaming pipeline as STT but adds a translation step. The output can be text or synthesized speech. It supports multiple target languages simultaneously. This is different from the Translator Text service, which only handles text.

Do I need to train a custom model for every application?

No. The prebuilt models work well for general-purpose speech recognition and synthesis. Custom models are only needed when you have domain-specific vocabulary (e.g., medical terms), unique accents, or require a custom voice. For most applications, the prebuilt models are sufficient. The exam tests when to use custom vs. prebuilt.

What audio formats does Azure Speech Service support?

The service supports WAV (PCM), MP3, OGG, and FLAC. The sample rate must be 16 kHz or 8 kHz. For best accuracy, use 16 kHz. The SDK can also read from the default microphone. For batch transcription, you can upload audio files in these formats.

How do I handle multiple speakers in a conversation?

Use the Conversation transcription mode, which enables speaker diarization. The service labels each utterance with a speaker ID (e.g., Speaker1, Speaker2). You must enable this feature when creating the recognizer. The exam may ask about this scenario.

What is the silence timeout for speech recognition?

The interactive recognition mode has a 10-second silence timeout. If no speech is detected for 10 seconds, the session ends and returns a NoMatch result. For continuous recognition, you must handle the session manually. You can configure the timeout via the SDK using the `SpeechConfig.SetProperty` method.

Can I use Azure Speech Service offline?

No. The service is cloud-based and requires an internet connection. All audio processing happens in Azure. There is no offline mode. For edge scenarios, consider Azure IoT Edge or Azure Stack, but these are beyond AI-900 scope.

Terms Worth Knowing

Artificial intelligence Azure App Service Cloud computing Microsoft Entra ID Natural language processing Service endpoint Service principal

Ready to put this to the test?

You've just covered Azure AI Speech Service — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Sentiment Analysis and Key Phrase Extraction

Azure AI Translator

See the full AI-900 study guide