AI-900Chapter 68 of 100Objective 4.4

Speech to Text and Custom Speech

In the AI-900 exam's NLP domain, Azure Speech-to-Text and Custom Speech are two core Azure Speech service capabilities. For the AI-900 exam, approximately 10–15% of questions touch on NLP workloads, with Speech-to-Text and Custom Speech being frequently tested subtopics. You must understand the difference between prebuilt speech recognition and custom models, the training process for Custom Speech, and the scenarios where each is appropriate. This chapter provides the depth needed to answer exam questions confidently, including specific values, configuration steps, and common pitfalls.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Court Stenographer vs. Custom Transcriptionist

A court stenographer (the default Speech-to-Text) records every spoken word verbatim using a standard shorthand system. This works well for most speakers with clear enunciation and standard vocabulary. However, if a witness uses heavy technical jargon (e.g., medical terms), has a thick regional accent, or speaks rapidly, the stenographer may miss words or type incorrect ones. Now consider a custom transcriptionist hired specifically for a trial about nuclear engineering. This person studies the field's terminology beforehand, learns the accent of the key witnesses, and adjusts their shorthand to capture domain-specific phrases like 'pressurized water reactor' accurately. The custom transcriptionist is analogous to Custom Speech: you train a base model with additional audio and text data relevant to your scenario, improving recognition for specialized vocabulary, accents, and noise conditions. The default model is your standard Azure Speech-to-Text, which works out-of-the-box but may fail where domain adaptation is needed.

How It Actually Works

What is Azure Speech-to-Text?

Azure Speech-to-Text (STT) is a cloud-based API that converts spoken audio into text. It is part of the Azure Speech service, which also includes Text-to-Speech, Speech Translation, and Speaker Recognition. The STT API uses deep neural networks trained on massive datasets of diverse speech to recognize words in real-time or batch. It supports:

Real-time transcription: Streaming audio with low latency (typically <500ms per utterance).

Batch transcription: Asynchronous processing of pre-recorded audio files.

Customization: Using Custom Speech to adapt the base model to specific domains, accents, or noise environments.

How Speech-to-Text Works Internally

The recognition process involves several stages:

Audio preprocessing: The audio signal is digitized (if analog) and converted to a standard format (e.g., 16 kHz, 16-bit, mono PCM). Azure STT requires audio at 16 kHz for optimal accuracy, though it can handle 8 kHz telephone audio with reduced accuracy.

Feature extraction: The system extracts acoustic features (e.g., Mel-frequency cepstral coefficients – MFCCs) that represent the audio in a way that separates phonetic content from background noise.

Acoustic model: A deep neural network (typically a CNN or Transformer) maps the acoustic features to phonemes – the basic units of sound. This model is trained on thousands of hours of labeled speech.

Language model: A statistical model (often an n-gram or neural language model) predicts the most likely sequence of words given the phonemes. It uses probabilities derived from large text corpora to resolve ambiguities (e.g., 'write' vs. 'right').

Decoding: The system combines acoustic and language model scores using a beam search decoder to produce the final transcript with confidence scores per word.

Default Model vs. Custom Speech

Azure provides a prebuilt base model that works well for general-purpose transcription (e.g., meeting transcription, voice commands). However, it may struggle with:

Specialized vocabulary (medical, legal, technical).

Strong accents or dialects.

Noisy environments (factory floor, open office).

Low-bandwidth audio (telephone).

Custom Speech allows you to train a custom model by providing additional data:

Audio + human-labeled transcripts: Pairs of audio files and their exact word-for-word transcriptions. This improves the acoustic model for specific speakers or noise conditions.

Related text: Sentences or phrases relevant to your domain (e.g., medical reports). This improves the language model for domain-specific vocabulary.

Pronunciation data: For words that have unusual pronunciations (e.g., 'Xbox' as 'eks-box').

Custom Speech Training Process

The training process follows these steps:

Create a Speech resource in Azure (Standard tier for customization; Free tier does not support custom models).

Upload data to a blob storage or directly to the Speech Studio.

Choose a base model: Select the most recent base model (e.g., '20230701') for best results.

Train the model: Azure combines your data with the base model to create a custom model. Training time depends on data size (typically minutes to hours).

Test the model: Evaluate accuracy using a test set of audio with known transcripts.

Deploy the custom model to an endpoint for production use.

Key Components and Defaults

Base model versions: Microsoft releases updated base models periodically. Always use the latest for best accuracy.

Data requirements:

Audio + transcripts: minimum 1 hour of audio, ideally 10+ hours for significant improvement.

Related text: minimum 100 sentences, ideally 1,000+.

Pronunciation: list of words with custom pronunciations using the IPA or a simplified notation.

Pricing: Custom Speech training is free; usage is charged per audio hour (Standard tier).

Regions: Custom Speech is available in most Azure regions (e.g., East US, West Europe).

Configuration and Verification Commands

Using the Azure CLI or REST API, you can manage Custom Speech models. Example CLI commands:

# Create a Speech resource
az cognitiveservices account create --name mySpeech --resource-group myRG --kind SpeechServices --sku S0 --location eastus

# List base models (via REST API)
curl -X GET "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/models/base" -H "Ocp-Apim-Subscription-Key: <key>"

# Start a training job
curl -X POST "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/models" -H "Ocp-Apim-Subscription-Key: <key>" -H "Content-Type: application/json" -d '{
  "baseModel": {"self": "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/models/base/<base-model-id>"},
  "datasets": [{"self": "https://<region>.cognitiveservices.azure.com/speechtotext/v3.1/datasets/<dataset-id>"}],
  "properties": {"purpose": "LanguageModel"}
}'

Interacting with Related Technologies

Language Understanding (LUIS): STT output can be fed into LUIS to extract intents and entities for conversational AI.

Azure Bot Service: Combine STT + LUIS + Bot Framework for voice-enabled bots.

Translator Speech: Real-time translation of transcribed text.

Azure Cognitive Search: Index transcribed content for full-text search.

Real-time vs. Batch Transcription

Real-time: Uses the WebSocket protocol. Audio is streamed in chunks, and partial results are returned every 300–500ms. Latency is low (under 500ms).

Batch: Uses REST API. Submit a container (e.g., SAS URI) with audio files. Results are returned asynchronously. Suitable for large volumes (e.g., call center recordings).

Supported Audio Formats

Raw PCM: 16 kHz, 16-bit, mono.

WAV: Must be PCM-encoded.

MP3: Supported but may reduce accuracy.

OGG/Opus: Supported for streaming.

Error Handling and Diagnostics

Common errors: - 401 Unauthorized: Invalid or expired subscription key. - 400 Bad Request: Audio format not supported or duration too long (max 60s per request for real-time). - 429 Rate limit exceeded: Too many requests; implement exponential backoff.

Evaluation Metrics

Word Error Rate (WER): The standard metric. Lower is better. Formula: (Substitutions + Insertions + Deletions) / Reference Word Count.

Custom Speech typically reduces WER by 20–50% compared to the base model for domain-specific scenarios.

Security and Compliance

Data at rest is encrypted using AES-256.

Data in transit uses TLS 1.2/1.3.

Custom models are stored in your Speech resource and can be deleted.

For sensitive audio, use private endpoints and managed identities.

Walk-Through

Create Azure Speech Resource

Go to the Azure portal and create a Speech service resource. Choose the 'Speech' kind (not 'Speech Services (Custom)') and the Standard (S0) pricing tier. The Free tier (F0) does not support custom models. Note the region and subscription key – you will need them for API calls. This resource is the container for all your STT and Custom Speech operations.

Upload Training Data

In Speech Studio, navigate to 'Custom Speech' and create a new project. Upload audio files with matching transcriptions (text files). For best results, use at least 1 hour of audio per speaker or environment. Also upload related text (e.g., domain-specific documents) to improve the language model. Optionally, upload pronunciation data for unusual words. Data must be in a supported format (16 kHz, 16-bit, mono WAV or raw PCM).

Select Base Model

Choose a base model from the list provided by Azure. The base model defines the acoustic and language model that your custom model will adapt. Always pick the latest version (e.g., '20230701') for the best out-of-the-box accuracy. Older base models may have been optimized for specific scenarios but are generally less accurate.

Train Custom Model

Click 'Train' and select the datasets you uploaded. Azure will combine your data with the base model. Training time depends on data size – typically 10-60 minutes for a few hours of audio. You can monitor progress in Speech Studio. Once complete, you will see a WER against a built-in test set. If WER is not satisfactory, add more data and retrain.

Test and Deploy Model

Before production, test your custom model with a separate set of audio files (not used in training). Compare its transcription against the base model. If accuracy meets requirements, deploy the model to an endpoint. In Speech Studio, click 'Deploy' and select a region. The endpoint URL will be something like `https://<region>.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US`. Use this endpoint in your application for inference.

What This Looks Like on the Job

Enterprise Scenario 1: Medical Transcription

A hospital wants to transcribe doctor-patient conversations in real time to update electronic health records (EHRs). The default STT model often fails on medical terminology like 'pneumothorax' or 'echocardiogram'. The hospital uses Custom Speech by uploading 20 hours of de-identified audio from previous consultations with transcripts. They also upload a text corpus of medical textbooks and a pronunciation file for drug names. After training, the WER drops from 25% to 8%. The custom model is deployed to an endpoint integrated with the EHR system. Common misconfiguration: using audio from a different department (e.g., radiology vs. cardiology) can hurt accuracy, so they create separate models per specialty.

Scenario 2: Call Center Analytics

A large insurance company wants to analyze customer calls to detect sentiment and compliance issues. They use batch transcription to process thousands of hours of recorded calls. The default model misinterprets industry terms like 'deductible' and 'co-pay' and struggles with various regional accents. They create a Custom Speech model using 50 hours of call audio with transcripts and a text corpus of insurance policy documents. The model is trained once and used for all batch jobs. A key consideration: audio quality varies (cell phones, speakerphones); they use noise adaptation data to improve robustness. When misconfigured (e.g., using only clean studio audio), the model fails on real noisy calls.

Scenario 3: Voice-Enabled Industrial Equipment

A manufacturing company adds voice commands to a robotic arm for hands-free operation in a noisy factory. The default STT model has high WER due to background noise (90 dB machinery) and specialized commands like 'grip 45 degrees'. They record 10 hours of commands in the actual factory environment with the same microphone that will be used in production. They also upload a text file of all possible commands. The custom model achieves 95% accuracy. A common mistake: training on clean audio and expecting it to work in noise – the model must be trained on representative noise conditions.

How AI-900 Actually Tests This

Exactly What AI-900 Tests

Objective 4.4: 'Describe capabilities of Speech-to-Text and Custom Speech.' The exam focuses on: - Differentiating between prebuilt and custom models: When to use each. - Understanding the training process: What data is needed (audio+transcripts, related text, pronunciation). - Knowing the benefits: Custom Speech improves accuracy for domain-specific vocabulary, accents, and noise. - Recognizing limitations: Custom Speech does not add new languages; it adapts existing base models.

Common Wrong Answers and Why Candidates Choose Them

'Custom Speech can be trained with only text data.' – Wrong. While text data improves the language model, audio+transcripts are required to adapt the acoustic model. Candidates often confuse Custom Speech with custom language models in other services.

'Custom Speech supports all languages.' – Wrong. Custom Speech only supports languages that have a base model. Currently, it supports ~20 languages (e.g., English, Spanish, Mandarin). Unsupported languages cannot be customized.

'You need to train a model from scratch.' – Wrong. Custom Speech always starts from a base model; you cannot train from zero. The base model provides the foundational acoustic and language knowledge.

'Custom Speech is available in the Free tier.' – Wrong. The Free tier (F0) does not allow custom model training or deployment. You need the Standard (S0) tier.

Specific Numbers and Values That Appear on the Exam

Minimum audio for acoustic adaptation: 1 hour (though 10+ hours recommended).

Minimum text for language adaptation: 100 sentences.

Supported audio sample rate: 16 kHz (8 kHz for telephone).

Pricing: Custom training is free; usage charged per audio hour.

Base model versions: e.g., 20230701 – always use latest.

Edge Cases and Exceptions

Multiple speakers: Custom Speech can improve accuracy for a specific speaker if trained on that speaker's voice. For general multi-speaker scenarios, use the base model.

Real-time vs. batch: Custom Speech works with both, but batch allows longer audio (up to 10 hours per file).

Pronunciation data: Use only for words with non-standard pronunciation; overuse can hurt accuracy.

How to Eliminate Wrong Answers

If the question asks about improving accuracy for medical terms, the answer is Custom Speech with related text and audio+transcripts.

If the question mentions 'no additional training' or 'prebuilt', it refers to the default STT.

If the answer includes 'train from scratch' or 'no base model', it is wrong.

If the answer says 'Free tier', it is wrong for customization.

Key Takeaways

Azure Speech-to-Text converts spoken audio to text using deep neural networks; Custom Speech adapts the base model for domain-specific scenarios.

Custom Speech requires a base model; you cannot train from scratch.

Minimum data for acoustic adaptation: 1 hour of audio with human-labeled transcripts.

Minimum data for language adaptation: 100 sentences of related text.

Custom Speech is available only in the Standard (S0) pricing tier.

Training Custom Speech is free; you only pay for transcription usage.

The recommended audio sample rate is 16 kHz for optimal accuracy.

Custom Speech supports both real-time and batch transcription.

Word Error Rate (WER) is the key metric; lower is better.

Always use the latest base model version for best results.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Prebuilt Speech-to-Text

No additional training required; works out-of-the-box.

Suitable for general-purpose transcription (meetings, dictation).

Lower accuracy for domain-specific vocabulary and accents.

Available in Free and Standard tiers.

Cannot be tailored to specific noise environments.

Custom Speech

Requires training with audio+transcripts and/or text data.

Ideal for specialized domains (medical, legal, industrial).

Significantly improves accuracy (WER reduction 20-50%).

Requires Standard (S0) tier; Free tier not supported.

Can adapt to specific speakers, accents, and noise conditions.

Watch Out for These

Mistake

Custom Speech can recognize any language.

Correct

Custom Speech only supports languages for which a base model exists. Microsoft provides base models for ~20 languages. Unsupported languages cannot be customized.

Mistake

You need to provide at least 10 hours of audio to train a custom model.

Correct

The minimum is 1 hour of audio with transcripts. However, more data (10+ hours) yields better accuracy. The exam often tests the minimum threshold of 1 hour.

Mistake

Custom Speech trains a completely new model from scratch.

Correct

Custom Speech always adapts an existing base model. You cannot train a model without a base model. The base model provides the foundational acoustic and language knowledge.

Mistake

The Free tier (F0) supports Custom Speech.

Correct

The Free tier does not allow custom model training or deployment. You must use the Standard (S0) tier for customization.

Mistake

Custom Speech only works with batch transcription.

Correct

Custom Speech works with both real-time and batch transcription. The same custom model can be used for streaming and asynchronous processing.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Speech-to-Text and Custom Speech?

Speech-to-Text is the prebuilt API that works out-of-the-box for general transcription. Custom Speech allows you to train a custom model using your own audio and text data to improve accuracy for specific domains, accents, or noise conditions. The exam expects you to know that Custom Speech is an extension of the base STT model.

How much data do I need to train a Custom Speech model?

For acoustic model adaptation, you need at least 1 hour of audio with matching transcripts. For language model adaptation, at least 100 sentences of related text. More data (10+ hours, 1000+ sentences) yields better accuracy. The exam tests the minimum values: 1 hour and 100 sentences.

Can I use Custom Speech with the Free tier?

No. The Free tier (F0) does not support custom model training or deployment. You must create a Speech resource with the Standard (S0) pricing tier. This is a common exam trap.

Does Custom Speech support all languages?

No. Custom Speech only supports languages for which Microsoft provides a base model. Currently, about 20 languages are supported, including English, Spanish, French, German, Mandarin, and others. Unsupported languages cannot be customized.

What is Word Error Rate (WER) and how is it used?

WER is a metric that measures the accuracy of a transcription system. It is calculated as (Substitutions + Insertions + Deletions) / Reference Word Count. Lower WER means higher accuracy. Custom Speech typically reduces WER by 20-50% compared to the base model for domain-specific scenarios.

Can I use Custom Speech for real-time transcription?

Yes. Custom Speech models can be deployed to endpoints that support both real-time (streaming) and batch (asynchronous) transcription. The same custom model works for both modes.

What audio formats are supported by Speech-to-Text?

Supported formats include raw PCM (16 kHz, 16-bit, mono), WAV (PCM-encoded), MP3, and OGG/Opus. For best accuracy, use 16 kHz, 16-bit, mono PCM. Telephone-quality audio (8 kHz) is supported but yields lower accuracy.

Terms Worth Knowing

Artificial intelligence Computer vision Generative AI Machine learning Natural language processing Responsible AI

Ready to put this to the test?

You've just covered Speech to Text and Custom Speech — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Conversational Language Understanding (CLU)

Text to Speech and Neural Voices

See the full AI-900 study guide