AI-900Chapter 70 of 100Objective 4.5

Speech Translation

This chapter covers Azure Cognitive Services' Speech Translation capability, a key component of the Natural Language Processing (NLP) workload on the AI-900 exam. Speech Translation enables real-time, multilingual speech-to-speech and speech-to-text conversion, integrating speech recognition, machine translation, and text-to-speech into a single API. Approximately 5-10% of AI-900 questions touch on this topic, primarily testing your ability to identify the correct service for a given scenario and understand its core features like language support, customization, and real-time vs. batch processing.

25 min read
Intermediate
Updated May 31, 2026

Speech Translation as Simultaneous Interpreter

Imagine a United Nations simultaneous interpreter sitting in a soundproof booth. The interpreter hears a speaker in French through headphones (speech recognition). As the French words arrive, the interpreter does not wait for the end of the sentence; they begin translating immediately into English (translation), speaking into a microphone. The interpreter's brain must handle two tasks: comprehending the incoming French stream and generating fluent English output. A delay of only a few seconds occurs—called the 'décalage'—which the interpreter manages by buffering a few words ahead. If the speaker uses an idiom like 'tomber dans les pommes' (to faint), the interpreter must quickly decide whether to translate literally or find an equivalent English idiom. Similarly, if the speaker talks about a technical term like 'intelligence artificielle', the interpreter must use the correct English term 'artificial intelligence'. The interpreter also adjusts tone and formality: a casual French 'salut' becomes 'hello' in English, not a literal 'hi' if the context is formal. The output is spoken through the interpreter's own voice, not the original speaker's voice (text-to-speech). This entire process happens in near real-time, with the interpreter constantly balancing accuracy, latency, and fluency—just like Azure's speech translation service.

How It Actually Works

What is Speech Translation and Why Does It Exist?

Azure Speech Translation is a cloud-based API that converts spoken audio from one language into text or synthesized speech in another language in real time. It is part of the Azure Cognitive Services Speech service and is designed for applications requiring live multilingual communication, such as conference interpreting, customer support for global audiences, and accessibility tools for language barriers.

Before cloud-based AI, building a real-time speech translation system required significant expertise in automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). Each component had to be trained separately, integrated with low latency, and deployed on expensive infrastructure. Azure Speech Translation abstracts this complexity, providing a single REST API or SDK that handles the entire pipeline.

How It Works Internally: The Pipeline

Speech Translation operates as a three-stage pipeline:

1.

Speech Recognition (ASR): The incoming audio stream is processed by a deep neural network that converts speech into text in the source language. Azure uses a Universal Language Model (ULM) that supports over 100 languages and dialects. The ASR engine handles noise reduction, speaker diarization (optional), and punctuation inference. It outputs a JSON structure with the recognized text, confidence scores, and timing information.

2.

Machine Translation (MT): The recognized text is passed to Azure Translator, a neural machine translation (NMT) system. Unlike older statistical MT, NMT considers the entire sentence context to produce more fluent translations. The translation model is trained on billions of parallel sentences and supports custom translation models via Custom Translator. The output is a JSON string containing the translated text.

3.

Text-to-Speech (TTS) (optional): If speech output is requested, the translated text is sent to Azure TTS, which uses neural voices to synthesize natural-sounding speech. Neural TTS uses deep neural networks to model prosody, intonation, and emphasis, producing voices that are nearly indistinguishable from human speech. You can choose from over 300 prebuilt voices or create a custom voice.

Key Components, Values, and Defaults

- Source Language: Can be specified explicitly or auto-detected. Auto-detection adds a small latency overhead (typically < 1 second). - Target Language(s): Up to 10 target languages per request for speech-to-text translation. For speech-to-speech, only one target language is supported per request. - Speech Recognition Modes: - Continuous recognition: Processes an ongoing audio stream, returning intermediate results. - Single utterance: Detects the end of a phrase (by silence) and returns the final result. - Translation Output: - Text: Returns translated text in JSON. - Speech: Returns synthesized audio in a chosen format (e.g., Riff16Khz16BitMonoPcm, Audio16Khz32KBitRateMonoMp3). - Latency: Typical end-to-end latency for speech-to-text translation is 2-5 seconds; for speech-to-speech, 3-7 seconds. - Supported Languages: Over 100 languages for speech recognition, and over 60 for speech translation. See the full list at [Azure Cognitive Services Language Support](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/language-support). - Pricing: Billed per audio hour for speech recognition and per character for translation. TTS is billed per audio hour.

Configuration and Verification Commands

To use Speech Translation, you need an Azure subscription and a Speech service resource. Below is an example using the Speech SDK (Python) for speech-to-text translation:

import azure.cognitiveservices.speech as speechsdk

# Create a SpeechTranslationConfig
translation_config = speechsdk.translation.SpeechTranslationConfig(subscription="YourSubscriptionKey", region="YourServiceRegion")
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("fr")
translation_config.add_target_language("de")

# Create an AudioConfig from the default microphone
audio_config = speechsdk.AudioConfig(use_default_microphone=True)

# Create a TranslationRecognizer
recognizer = speechsdk.translation.TranslationRecognizer(translation_config=translation_config, audio_config=audio_config)

# Start recognition
print("Speak now...")
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
    print("Translated: {}".format(result.translations))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
    print("Recognition canceled: {}".format(result.cancellation_details.reason))

For speech-to-speech translation, you add a voice name:

translation_config.voice_name = "fr-FR-DeniseNeural"

Then use SpeechSynthesizer to play the output.

Interaction with Related Technologies

Azure Translator Text: The translation component is the same service used for text translation. Custom Translator models can be integrated to improve domain-specific terminology.

Custom Speech: You can train custom speech models to improve recognition of domain-specific vocabulary, accents, or noise conditions. These models are then used by Speech Translation.

Custom Voice: For speech output, you can create a custom neural voice using your own audio data, which can be used in Speech Translation.

Language Understanding (LUIS): While not directly integrated, LUIS can be chained after translation to extract intent from the translated text.

Bot Framework: Speech Translation can be integrated into bots to provide multilingual conversational experiences.

Real-Time vs. Batch Translation

Real-Time (Streaming): Uses the Speech SDK with a continuous recognition loop. Ideal for live conversations, virtual meetings, and interpreting scenarios. Latency is critical.

Batch Translation: Not natively supported for speech. For batch processing, you would use Azure Speech-to-Text for transcription, then Azure Translator Text for translation. This is suitable for recorded audio files (e.g., meeting recordings, call center logs).

Performance Considerations

Latency: To minimize latency, use dedicated Speech service instances in a region close to your users. Avoid auto-detection if the source language is known.

Accuracy: Use custom speech models for noisy environments or specialized vocabulary. For translation, use Custom Translator with domain-specific parallel data.

Concurrency: The service scales automatically, but you can request quota increases for high-volume scenarios.

Audio Formats: Use the recommended sample rate (16 kHz) and mono channel for optimal recognition. Higher sample rates may increase latency without improving accuracy.

Walk-Through

1

Initialize Speech Translation Config

Create a SpeechTranslationConfig object with your subscription key and region. Specify the source language (or enable auto-detection) and add one or more target languages. For speech-to-speech, also set the voice name for the target language. This config is used by all subsequent recognizers.

2

Configure Audio Input

Create an AudioConfig object to specify the audio source. You can use the default microphone, a specific audio file, or an audio stream. For real-time scenarios, the microphone is common. For batch-like processing from a file, use AudioConfig.from_wav_file(). The audio should be in a supported format (e.g., 16 kHz, 16-bit, mono PCM).

3

Create TranslationRecognizer

Instantiate a TranslationRecognizer by passing the translation config and audio config. This object manages the recognition session. Optionally, you can attach event handlers for intermediate results, final results, and session events. The recognizer will start listening when you call recognize_once() or start_continuous_recognition().

4

Start Recognition

Call recognize_once() for a single utterance or start_continuous_recognition() for streaming. For single utterance, the recognizer listens until silence is detected (typically 500ms of silence). For continuous, it returns results as they become available. The service sends intermediate results as partial hypotheses, which update as more audio is processed.

5

Process Results

When a result is returned, check the Reason property. If TranslatedSpeech, access the translations dictionary to get the translated text for each target language. For speech-to-speech, the synthesized audio is available as an AudioDataStream. Handle errors like NoMatch (if no speech recognized) or Canceled (if the service cancels due to timeout or authentication failure).

6

Clean Up Resources

After recognition is complete, call stop_continuous_recognition() if using continuous mode, and dispose of the recognizer and audio config objects to free resources. In production, ensure proper error handling and reconnection logic for long-lived sessions.

What This Looks Like on the Job

Enterprise Scenario 1: Multilingual Customer Support

A global e-commerce company wants to provide real-time voice support in multiple languages without hiring native speakers for every language. They deploy a chatbot that uses Speech Translation to convert customer speech in Spanish to English for the support agent, and the agent's English response is translated back to Spanish speech. The system uses Azure Speech Translation with custom speech models trained on product names and common customer queries. In production, they handle 10,000 concurrent sessions using a load-balanced pool of Speech service instances in multiple regions. Key considerations: latency must be under 3 seconds to avoid awkward pauses; they use the 'single utterance' mode to avoid overlapping speech. Misconfiguration often occurs when the audio format is not 16 kHz mono, leading to poor recognition accuracy. They also use Custom Translator with a parallel corpus of support tickets to improve translation of domain-specific terms like 'refund policy' or 'shipping status.'

Enterprise Scenario 2: Live Conference Interpreting

A tech conference offers real-time translation of keynote speeches into 5 languages. The event uses Azure Speech Translation with a dedicated Speech resource. The speaker's audio is captured via a professional microphone and streamed to the service. Attendees receive translated audio on their mobile app via a custom client. The system uses continuous recognition with intermediate results to minimize latency. The biggest challenge is handling speaker accents and technical jargon; they use Custom Speech models trained on previous conference recordings. They also have a fallback: if confidence drops below 0.8, the translation is flagged for human review. The system processes 1 hour of audio per session with 99.9% uptime. Common misconfiguration: forgetting to set the voice name for speech output, resulting in default English voice for all languages.

Enterprise Scenario 3: Accessibility for Hearing-Impaired

A university provides real-time captioning for lectures in multiple languages. They use Speech Translation to transcribe the professor's speech into text and translate it into several languages for international students. The captions are displayed on a screen and also streamed to students' devices. The system uses the 'text' output mode and adds punctuation via the 'profanity filter' and 'inverse text normalization' options. They handle up to 50 simultaneous lectures using a single Speech resource with auto-scaling. Key performance metric: word error rate (WER) must be below 10%. They use Custom Speech models trained on academic vocabulary. Misconfiguration often occurs when the source language is set incorrectly for lectures with mixed languages; they rely on auto-detection but accept a slight latency increase.

How AI-900 Actually Tests This

What AI-900 Tests on Speech Translation

The AI-900 exam objectives under NLP workload include: 'Describe features of speech translation' (Objective 4.5). Specifically, you must be able to:

Identify the correct Azure service for real-time speech translation scenarios (e.g., conference interpreting, live captions).

Distinguish between speech-to-text, text-to-speech, and speech translation.

Understand that speech translation can output either text or speech.

Know that Custom Speech and Custom Translator can be used to improve accuracy.

Recognize that speech translation supports multiple target languages in one request.

Common Wrong Answers and Why Candidates Choose Them

1.

Choosing Translator Text instead of Speech Translation: Candidates see 'translation' and select Azure Translator Text, forgetting that the input is speech. The exam will describe a scenario with spoken audio and expect Speech Translation.

2.

Selecting Speech-to-Text + Translator Text separately: While this is possible, the exam asks for the 'best' or 'simplest' service. Speech Translation is a single integrated service, so it is the correct answer.

3.

Thinking Speech Translation only outputs text: The exam may present a scenario requiring spoken output (e.g., a multilingual voice assistant). Candidates must know that Speech Translation can output synthesized speech.

4.

Confusing with Language Understanding (LUIS): LUIS extracts intent, not translate languages. If the scenario requires translation, LUIS is wrong.

Specific Numbers and Terms on the Exam

The term 'real-time' or 'streaming' is often used to describe Speech Translation.

Supported languages: over 100 for speech recognition, over 60 for translation. Exact numbers may appear in a 'choose the correct statement' question.

The service is part of Azure Cognitive Services, specifically the Speech service.

Custom models: Custom Speech and Custom Translator are explicitly mentioned.

Edge Cases and Exceptions

Single vs. Multiple Target Languages: Speech Translation supports up to 10 target languages for text output, but only one for speech output. The exam may test this distinction.

Batch Translation: If the scenario involves pre-recorded audio files, the best approach is to use Batch Speech-to-Text transcription followed by Translator Text, not Speech Translation (which is real-time).

Language Auto-Detection: Available but adds latency. The exam may ask about trade-offs.

How to Eliminate Wrong Answers

1.

Identify if the input is speech or text. If speech, eliminate Translator Text and LUIS.

2.

Identify if output needs to be speech or text. If speech, eliminate Speech-to-Text only services.

3.

If the scenario mentions 'real-time' or 'live', Speech Translation is likely correct.

4.

If the scenario mentions 'custom models' for accuracy, look for options that include Custom Speech or Custom Translator.

5.

If the scenario mentions 'multiple languages simultaneously', verify that the service supports multiple target languages.

Key Takeaways

Azure Speech Translation converts spoken audio into translated text or speech in real time.

It integrates speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) into one service.

Supports over 100 languages for speech recognition and over 60 for translation.

Can output to multiple target languages (up to 10) for text, but only one for speech.

Custom Speech and Custom Translator can improve accuracy for domain-specific content.

For batch translation of pre-recorded audio, use Batch Speech-to-Text + Translator Text instead.

Latency is typically 2-5 seconds for speech-to-text, 3-7 seconds for speech-to-speech.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Speech Translation

Single API call for speech recognition and translation.

Optimized for real-time/low-latency scenarios.

Supports direct speech output (speech-to-speech).

Automatically handles punctuation and inverse text normalization.

Integrated with Custom Speech and Custom Translator.

Speech-to-Text + Translator Text

Requires two separate API calls and manual orchestration.

Higher latency due to sequential processing.

Speech output requires a separate TTS call.

Each service billed separately.

More flexible for batch processing of pre-recorded audio.

Watch Out for These

Mistake

Speech Translation can only output text.

Correct

Azure Speech Translation can output either text (speech-to-text translation) or synthesized speech (speech-to-speech translation). You configure this by setting the voice name in the SpeechTranslationConfig.

Mistake

Speech Translation requires the source language to be specified manually.

Correct

You can enable auto-detection of the source language by not setting the speech_recognition_language property. However, auto-detection adds a small latency overhead.

Mistake

You can translate speech into multiple languages with speech output for each.

Correct

For speech output, only one target language is supported per request. For text output, up to 10 target languages are supported.

Mistake

Speech Translation is the same as using Speech-to-Text plus Translator Text separately.

Correct

While functionally possible, Speech Translation is a single integrated service that handles the entire pipeline with lower latency and simpler API. The exam expects you to choose the integrated service for real-time scenarios.

Mistake

Custom Speech and Custom Translator cannot be used with Speech Translation.

Correct

Both Custom Speech and Custom Translator can be integrated with Speech Translation to improve recognition and translation accuracy for domain-specific vocabulary.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Speech Translation and Translator Text?

Speech Translation accepts spoken audio as input and can output translated text or speech. Translator Text only accepts text input and returns text output. If your scenario involves live speech, use Speech Translation. If you already have text, use Translator Text.

Can Speech Translation handle multiple speakers at once?

Speech Translation does not natively support speaker diarization (identifying who said what). For that, you need to use the Speech-to-Text API with conversation transcription, which is a separate capability. Speech Translation treats all audio as a single stream.

How do I improve translation accuracy for technical terms?

Use Custom Translator to create a custom model trained on your domain-specific parallel data (e.g., a glossary of technical terms). Then, when configuring Speech Translation, reference that custom model endpoint. Similarly, use Custom Speech to improve recognition of technical terms in the source language.

Is Speech Translation available in all Azure regions?

Speech Translation is available in most Azure regions where Cognitive Services are offered. However, for optimal latency, choose a region close to your users. Some features like Custom Neural Voice may have limited regional availability. Check the Azure documentation for the latest list.

Can I use Speech Translation for offline scenarios?

No, Speech Translation requires an internet connection to access the cloud APIs. For offline use, consider using on-device models (e.g., with Azure Cognitive Services containers) but note that translation quality may be lower. The exam focuses on cloud-based services.

What audio formats are supported for input?

The Speech service supports several audio formats, but the recommended format for best accuracy is 16 kHz, 16-bit, mono PCM (WAV). Other formats like MP3 are supported but may require transcoding, which adds latency.

How does Speech Translation handle profanity?

You can enable a profanity filter option that masks or removes profane words in the recognition output. The filter can be set to 'masked', 'removed', or 'raw'. This is configured in the SpeechTranslationConfig.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Speech Translation — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?