This chapter covers Azure Speech-to-Text and Custom Speech, two core capabilities under the Speech service in Microsoft Azure. For the AI-900 exam, approximately 10–15% of questions touch on NLP workloads, with Speech-to-Text and Custom Speech being frequently tested subtopics. You must understand the difference between prebuilt speech recognition and custom models, the training process for Custom Speech, and the scenarios where each is appropriate. This chapter provides the depth needed to answer exam questions confidently, including specific values, configuration steps, and common pitfalls.
Jump to a section
Imagine a courtroom where a stenographer (the default Speech-to-Text) types everything said verbatim using a standard shorthand system. This works well for most speakers with clear enunciation and standard vocabulary. However, if a witness uses heavy technical jargon (e.g., medical terms), has a thick regional accent, or speaks rapidly, the stenographer may miss words or type incorrect ones. Now consider a custom transcriptionist hired specifically for a trial about nuclear engineering. This person studies the field's terminology beforehand, learns the accent of the key witnesses, and adjusts their shorthand to capture domain-specific phrases like 'pressurized water reactor' accurately. The custom transcriptionist is analogous to Custom Speech: you train a base model with additional audio and text data relevant to your scenario, improving recognition for specialized vocabulary, accents, and noise conditions. The default model is your standard Azure Speech-to-Text, which works out-of-the-box but may fail where domain adaptation is needed.
What is Azure Speech-to-Text?
Azure Speech-to-Text (STT) is a cloud-based API that converts spoken audio into text. It is part of the Azure Speech service, which also includes Text-to-Speech, Speech Translation, and Speaker Recognition. The STT API uses deep neural networks trained on massive datasets of diverse speech to recognize words in real-time or batch. It supports:
Real-time transcription: Streaming audio with low latency (typically <500ms per utterance).
Batch transcription: Asynchronous processing of pre-recorded audio files.
Customization: Using Custom Speech to adapt the base model to specific domains, accents, or noise environments.
How Speech-to-Text Works Internally
The recognition process involves several stages:
Audio preprocessing: The audio signal is digitized (if analog) and converted to a standard format (e.g., 16 kHz, 16-bit, mono PCM). Azure STT requires audio at 16 kHz for optimal accuracy, though it can handle 8 kHz telephone audio with reduced accuracy.
Feature extraction: The system extracts acoustic features (e.g., Mel-frequency cepstral coefficients – MFCCs) that represent the audio in a way that separates phonetic content from background noise.
Acoustic model: A deep neural network (typically a CNN or Transformer) maps the acoustic features to phonemes – the basic units of sound. This model is trained on thousands of hours of labeled speech.
Language model: A statistical model (often an n-gram or neural language model) predicts the most likely sequence of words given the phonemes. It uses probabilities derived from large text corpora to resolve ambiguities (e.g., 'write' vs. 'right').
Decoding: The system combines acoustic and language model scores using a beam search decoder to produce the final transcript with confidence scores per word.
Default Model vs. Custom Speech
Azure provides a prebuilt base model that works well for general-purpose transcription (e.g., meeting transcription, voice commands). However, it may struggle with:
Specialized vocabulary (medical, legal, technical).
Strong accents or dialects.
Noisy environments (factory floor, open office).
Low-bandwidth audio (telephone).
Custom Speech allows you to train a custom model by providing additional data:
Audio + human-labeled transcripts: Pairs of audio files and their exact word-for-word transcriptions. This improves the acoustic model for specific speakers or noise conditions.
Related text: Sentences or phrases relevant to your domain (e.g., medical reports). This improves the language model for domain-specific vocabulary.
Pronunciation data: For words that have unusual pronunciations (e.g., 'Xbox' as 'eks-box').
Custom Speech Training Process
The training process follows these steps:
Create a Speech resource in Azure (Standard tier for customization; Free tier does not support custom models).
Upload data to a blob storage or directly to the Speech Studio.
Choose a base model: Select the most recent base model (e.g., '20230701') for best results.
Train the model: Azure combines your data with the base model to create a custom model. Training time depends on data size (typically minutes to hours).
Test the model: Evaluate accuracy using a test set of audio with known transcripts.
Deploy the custom model to an endpoint for production use.
Key Components and Defaults
Base model versions: Microsoft releases updated base models periodically. Always use the latest for best accuracy.
Data requirements:
Audio + transcripts: minimum 1 hour of audio, ideally 10+ hours for significant improvement.
Related text: minimum 100 sentences, ideally 1,000+.
Pronunciation: list of words with custom pronunciations using the IPA or a simplified notation.
Pricing: Custom Speech training is free; usage is charged per audio hour (Standard tier).
Regions: Custom Speech is available in most Azure regions (e.g., East US, West Europe).
Configuration and Verification Commands
Using the Azure CLI or REST API, you can manage Custom Speech models. Example CLI commands:
# Create a Speech resource
az cognitiveservices account create --name mySpeech --resource-group myRG --kind SpeechServices --sku S0 --location eastus
# List base models (via REST API)
curl -X GET "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/models/base" -H "Ocp-Apim-Subscription-Key: <key>"
# Start a training job
curl -X POST "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/models" -H "Ocp-Apim-Subscription-Key: <key>" -H "Content-Type: application/json" -d '{
"baseModel": {"self": "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/models/base/<base-model-id>"},
"datasets": [{"self": "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/datasets/<dataset-id>"}],
"properties": {"purpose": "LanguageModel"}
}'Interacting with Related Technologies
Language Understanding (LUIS): STT output can be fed into LUIS to extract intents and entities for conversational AI.
Azure Bot Service: Combine STT + LUIS + Bot Framework for voice-enabled bots.
Translator Speech: Real-time translation of transcribed text.
Azure Cognitive Search: Index transcribed content for full-text search.
Real-time vs. Batch Transcription
Real-time: Uses the WebSocket protocol. Audio is streamed in chunks, and partial results are returned every 300–500ms. Latency is low (under 500ms).
Batch: Uses REST API. Submit a container (e.g., SAS URI) with audio files. Results are returned asynchronously. Suitable for large volumes (e.g., call center recordings).
Supported Audio Formats
Raw PCM: 16 kHz, 16-bit, mono.
WAV: Must be PCM-encoded.
MP3: Supported but may reduce accuracy.
OGG/Opus: Supported for streaming.
Error Handling and Diagnostics
Common errors: - 401 Unauthorized: Invalid or expired subscription key. - 400 Bad Request: Audio format not supported or duration too long (max 60s per request for real-time). - 429 Rate limit exceeded: Too many requests; implement exponential backoff.
Evaluation Metrics
Word Error Rate (WER): The standard metric. Lower is better. Formula: (Substitutions + Insertions + Deletions) / Reference Word Count.
Custom Speech typically reduces WER by 20–50% compared to the base model for domain-specific scenarios.
Security and Compliance
Data at rest is encrypted using AES-256.
Data in transit uses TLS 1.2/1.3.
Custom models are stored in your Speech resource and can be deleted.
For sensitive audio, use private endpoints and managed identities.
Create Azure Speech Resource
Go to the Azure portal and create a Speech service resource. Choose the 'Speech' kind (not 'Speech Services (Custom)') and the Standard (S0) pricing tier. The Free tier (F0) does not support custom models. Note the region and subscription key – you will need them for API calls. This resource is the container for all your STT and Custom Speech operations.
Upload Training Data
In Speech Studio, navigate to 'Custom Speech' and create a new project. Upload audio files with matching transcriptions (text files). For best results, use at least 1 hour of audio per speaker or environment. Also upload related text (e.g., domain-specific documents) to improve the language model. Optionally, upload pronunciation data for unusual words. Data must be in a supported format (16 kHz, 16-bit, mono WAV or raw PCM).
Select Base Model
Choose a base model from the list provided by Azure. The base model defines the acoustic and language model that your custom model will adapt. Always pick the latest version (e.g., '20230701') for the best out-of-the-box accuracy. Older base models may have been optimized for specific scenarios but are generally less accurate.
Train Custom Model
Click 'Train' and select the datasets you uploaded. Azure will combine your data with the base model. Training time depends on data size – typically 10-60 minutes for a few hours of audio. You can monitor progress in Speech Studio. Once complete, you will see a WER against a built-in test set. If WER is not satisfactory, add more data and retrain.
Test and Deploy Model
Before production, test your custom model with a separate set of audio files (not used in training). Compare its transcription against the base model. If accuracy meets requirements, deploy the model to an endpoint. In Speech Studio, click 'Deploy' and select a region. The endpoint URL will be something like `https://<region>.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US`. Use this endpoint in your application for inference.
Enterprise Scenario 1: Medical Transcription
A hospital wants to transcribe doctor-patient conversations in real time to update electronic health records (EHRs). The default STT model often fails on medical terminology like 'pneumothorax' or 'echocardiogram'. The hospital uses Custom Speech by uploading 20 hours of de-identified audio from previous consultations with transcripts. They also upload a text corpus of medical textbooks and a pronunciation file for drug names. After training, the WER drops from 25% to 8%. The custom model is deployed to an endpoint integrated with the EHR system. Common misconfiguration: using audio from a different department (e.g., radiology vs. cardiology) can hurt accuracy, so they create separate models per specialty.
Scenario 2: Call Center Analytics
A large insurance company wants to analyze customer calls to detect sentiment and compliance issues. They use batch transcription to process thousands of hours of recorded calls. The default model misinterprets industry terms like 'deductible' and 'co-pay' and struggles with various regional accents. They create a Custom Speech model using 50 hours of call audio with transcripts and a text corpus of insurance policy documents. The model is trained once and used for all batch jobs. A key consideration: audio quality varies (cell phones, speakerphones); they use noise adaptation data to improve robustness. When misconfigured (e.g., using only clean studio audio), the model fails on real noisy calls.
Scenario 3: Voice-Enabled Industrial Equipment
A manufacturing company adds voice commands to a robotic arm for hands-free operation in a noisy factory. The default STT model has high WER due to background noise (90 dB machinery) and specialized commands like 'grip 45 degrees'. They record 10 hours of commands in the actual factory environment with the same microphone that will be used in production. They also upload a text file of all possible commands. The custom model achieves 95% accuracy. A common mistake: training on clean audio and expecting it to work in noise – the model must be trained on representative noise conditions.
Exactly What AI-900 Tests
Objective 4.4: 'Describe capabilities of Speech-to-Text and Custom Speech.' The exam focuses on: - Differentiating between prebuilt and custom models: When to use each. - Understanding the training process: What data is needed (audio+transcripts, related text, pronunciation). - Knowing the benefits: Custom Speech improves accuracy for domain-specific vocabulary, accents, and noise. - Recognizing limitations: Custom Speech does not add new languages; it adapts existing base models.
Common Wrong Answers and Why Candidates Choose Them
'Custom Speech can be trained with only text data.' – Wrong. While text data improves the language model, audio+transcripts are required to adapt the acoustic model. Candidates often confuse Custom Speech with custom language models in other services.
'Custom Speech supports all languages.' – Wrong. Custom Speech only supports languages that have a base model. Currently, it supports ~20 languages (e.g., English, Spanish, Mandarin). Unsupported languages cannot be customized.
'You need to train a model from scratch.' – Wrong. Custom Speech always starts from a base model; you cannot train from zero. The base model provides the foundational acoustic and language knowledge.
'Custom Speech is available in the Free tier.' – Wrong. The Free tier (F0) does not allow custom model training or deployment. You need the Standard (S0) tier.
Specific Numbers and Values That Appear on the Exam
Minimum audio for acoustic adaptation: 1 hour (though 10+ hours recommended).
Minimum text for language adaptation: 100 sentences.
Supported audio sample rate: 16 kHz (8 kHz for telephone).
Pricing: Custom training is free; usage charged per audio hour.
Base model versions: e.g., 20230701 – always use latest.
Edge Cases and Exceptions
Multiple speakers: Custom Speech can improve accuracy for a specific speaker if trained on that speaker's voice. For general multi-speaker scenarios, use the base model.
Real-time vs. batch: Custom Speech works with both, but batch allows longer audio (up to 10 hours per file).
Pronunciation data: Use only for words with non-standard pronunciation; overuse can hurt accuracy.
How to Eliminate Wrong Answers
If the question asks about improving accuracy for medical terms, the answer is Custom Speech with related text and audio+transcripts.
If the question mentions 'no additional training' or 'prebuilt', it refers to the default STT.
If the answer includes 'train from scratch' or 'no base model', it is wrong.
If the answer says 'Free tier', it is wrong for customization.
Azure Speech-to-Text converts spoken audio to text using deep neural networks; Custom Speech adapts the base model for domain-specific scenarios.
Custom Speech requires a base model; you cannot train from scratch.
Minimum data for acoustic adaptation: 1 hour of audio with human-labeled transcripts.
Minimum data for language adaptation: 100 sentences of related text.
Custom Speech is available only in the Standard (S0) pricing tier.
Training Custom Speech is free; you only pay for transcription usage.
The recommended audio sample rate is 16 kHz for optimal accuracy.
Custom Speech supports both real-time and batch transcription.
Word Error Rate (WER) is the key metric; lower is better.
Always use the latest base model version for best results.
These come up on the exam all the time. Here's how to tell them apart.
Prebuilt Speech-to-Text
No additional training required; works out-of-the-box.
Suitable for general-purpose transcription (meetings, dictation).
Lower accuracy for domain-specific vocabulary and accents.
Available in Free and Standard tiers.
Cannot be tailored to specific noise environments.
Custom Speech
Requires training with audio+transcripts and/or text data.
Ideal for specialized domains (medical, legal, industrial).
Significantly improves accuracy (WER reduction 20-50%).
Requires Standard (S0) tier; Free tier not supported.
Can adapt to specific speakers, accents, and noise conditions.
Mistake
Custom Speech can recognize any language.
Correct
Custom Speech only supports languages for which a base model exists. Microsoft provides base models for ~20 languages. Unsupported languages cannot be customized.
Mistake
You need to provide at least 10 hours of audio to train a custom model.
Correct
The minimum is 1 hour of audio with transcripts. However, more data (10+ hours) yields better accuracy. The exam often tests the minimum threshold of 1 hour.
Mistake
Custom Speech trains a completely new model from scratch.
Correct
Custom Speech always adapts an existing base model. You cannot train a model without a base model. The base model provides the foundational acoustic and language knowledge.
Mistake
The Free tier (F0) supports Custom Speech.
Correct
The Free tier does not allow custom model training or deployment. You must use the Standard (S0) tier for customization.
Mistake
Custom Speech only works with batch transcription.
Correct
Custom Speech works with both real-time and batch transcription. The same custom model can be used for streaming and asynchronous processing.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Speech-to-Text is the prebuilt API that works out-of-the-box for general transcription. Custom Speech allows you to train a custom model using your own audio and text data to improve accuracy for specific domains, accents, or noise conditions. The exam expects you to know that Custom Speech is an extension of the base STT model.
For acoustic model adaptation, you need at least 1 hour of audio with matching transcripts. For language model adaptation, at least 100 sentences of related text. More data (10+ hours, 1000+ sentences) yields better accuracy. The exam tests the minimum values: 1 hour and 100 sentences.
No. The Free tier (F0) does not support custom model training or deployment. You must create a Speech resource with the Standard (S0) pricing tier. This is a common exam trap.
No. Custom Speech only supports languages for which Microsoft provides a base model. Currently, about 20 languages are supported, including English, Spanish, French, German, Mandarin, and others. Unsupported languages cannot be customized.
WER is a metric that measures the accuracy of a transcription system. It is calculated as (Substitutions + Insertions + Deletions) / Reference Word Count. Lower WER means higher accuracy. Custom Speech typically reduces WER by 20-50% compared to the base model for domain-specific scenarios.
Yes. Custom Speech models can be deployed to endpoints that support both real-time (streaming) and batch (asynchronous) transcription. The same custom model works for both modes.
Supported formats include raw PCM (16 kHz, 16-bit, mono), WAV (PCM-encoded), MP3, and OGG/Opus. For best accuracy, use 16 kHz, 16-bit, mono PCM. Telephone-quality audio (8 kHz) is supported but yields lower accuracy.
You've just covered Speech to Text and Custom Speech — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?