GCDLChapter 94 of 101Objective 3.2

Google Translation and Speech APIs

This chapter covers Google Cloud's Translation API and Speech-to-Text/Text-to-Speech APIs, which are core components of the AI and Machine Learning services. For the GCDL exam, understanding these APIs is crucial because they enable multilingual and voice-enabled applications without requiring deep ML expertise. Approximately 8-12% of exam questions touch on these services, typically in scenarios involving customer support, content localization, or accessibility. You will be expected to know the use cases, key features, and how to integrate them with other Google Cloud services.

25 min read
Intermediate
Updated May 31, 2026

The Multilingual Call Center Analogy

Imagine a global call center that receives calls in dozens of languages. When a call comes in, the system first identifies the language (e.g., Mandarin) and then routes it to a translator who converts the speech to English text. That text is then processed by a virtual assistant that understands only English. The assistant's response is sent back to the same translator, who converts the English text back into Mandarin speech and plays it to the caller. This is exactly how Google Cloud's Speech-to-Text and Translation APIs work together: Speech-to-Text converts audio to text, Translation API translates the text, and Text-to-Speech converts the translated text back to audio. The key is that each step is a separate, modular API call, allowing you to mix and match languages and even insert custom logic between steps. In production, you might use a Cloud Run service to orchestrate these calls, handling errors like unsupported languages or long audio files by splitting them into chunks. The analogy also highlights latency: just as a human translator takes time to listen, think, and speak, each API call adds latency, so you must plan for it, especially in real-time applications.

How It Actually Works

What Are Translation and Speech APIs?

Google Cloud offers two primary translation services: Translation API (formerly Cloud Translation API) and AutoML Translation. The Translation API provides both basic (v2) and advanced (v3) editions. The basic edition uses Google's pre-trained Neural Machine Translation (NMT) models to translate text between over 100 languages. The advanced edition adds features like custom glossaries, batch translation, and support for more complex document types (e.g., PDF, Word). AutoML Translation allows you to train custom translation models on your own parallel corpora for domain-specific terminology.

For speech, Google Cloud provides Cloud Speech-to-Text (STT) and Cloud Text-to-Speech (TTS). STT converts audio to text using deep learning models, supporting over 125 languages and variants. It offers features like automatic punctuation, word-level confidence scores, and speaker diarization. TTS converts text to natural-sounding speech using WaveNet or standard voices, supporting over 220 voices across 40+ languages.

How They Work Internally

Translation API (v3): - Accepts a request with input text, source language (optional), target language, and optional glossary. - The service tokenizes the input, applies the NMT model (which uses transformer architecture), and generates translated text. - For batch translation, you submit a job that reads from Cloud Storage and writes results back to Cloud Storage. - Key parameters: mimeType (text/plain or text/html), model (base or nmt), glossaryConfig.

Cloud Speech-to-Text: - Accepts audio content (inline or via Cloud Storage URI) with a RecognitionConfig that specifies encoding (e.g., LINEAR16, FLAC, MULAW), sample rate (e.g., 8000 Hz for telephony, 16000 Hz for wideband), language code, and optional features like enableAutomaticPunctuation or enableSpeakerDiarization. - The audio is processed by an acoustic model that maps audio features to phonemes, then a language model that converts phonemes to words. - Returns a RecognizeResponse with a list of SpeechRecognitionResult objects, each containing alternatives (transcripts with confidence scores). - For long audio (over 1 minute), you must use asynchronous recognition via LongRunningRecognize.

Cloud Text-to-Speech: - Accepts a SynthesisInput (text or SSML) and a VoiceSelectionParams (language code, name, gender). - Uses a neural network (WaveNet or standard) to generate audio waveforms. - Returns SynthesizeSpeechResponse with audio content encoded in the requested format (e.g., MP3, OGG, LINEAR16).

Key Components and Defaults

Translation API v3: Default model is general/nmt. Quota: 2 million characters per month for free tier, then pay-as-you-go. Batch translation supports up to 100,000 files per job.

Speech-to-Text: Default sample rate: 16000 Hz. Default encoding: LINEAR16. For telephony audio, use MULAW at 8000 Hz. Maximum audio duration for synchronous recognition: 60 seconds. Asynchronous: up to 480 minutes.

Text-to-Speech: Default voice is standard (non-WaveNet). WaveNet voices are higher quality but cost more. SSML supports tags like <speak>, <break>, <prosody>.

Configuration and Verification Commands

Translation API:

# Translate text using v3
gcloud translate translate --text="Hello world" --target-language=es --source-language=en --project=my-project

# Batch translation
gcloud translate batch --source-language=en --target-language=es --input-model=general/nmt --input=gs://my-bucket/input.txt --output=gs://my-bucket/output/

Speech-to-Text:

# Synchronous recognition
gcloud ml speech recognize gs://my-bucket/audio.flac --language-code=en-US --encoding=FLAC --sample-rate=16000

# Asynchronous recognition
gcloud ml speech recognize-long-running gs://my-bucket/audio.flac --language-code=en-US --encoding=FLAC --sample-rate=16000

Text-to-Speech:

# Synthesize speech
gcloud ml text-to-speech synthesize --text="Hello world" --voice-name=en-US-Wavenet-D --output-format=mp3 --output-file=output.mp3

Interaction with Related Technologies

These APIs are often used together in a pipeline: audio -> STT -> Translation -> TTS. This is common in multilingual customer support chatbots. They also integrate with: - Cloud Storage: For storing audio and translation input/output files. - Cloud Functions / Cloud Run: For serverless orchestration. - Dialogflow CX: For building conversational agents that use translation for multilingual support. - BigQuery: For analyzing translation usage and quality. - Vertex AI: For custom model training (AutoML Translation).

Exam-Relevant Details

The GCDL exam expects you to know the difference between Translation API basic (v2) and advanced (v3). Advanced supports glossaries and batch translation.

For Speech-to-Text, remember that synchronous recognition is limited to 60 seconds of audio; longer audio requires asynchronous.

Text-to-Speech offers WaveNet and standard voices; WaveNet is more natural but more expensive.

AutoML Translation is for domain-specific custom models, not general translation.

All APIs are accessed via REST or gRPC, and you must have appropriate IAM permissions (e.g., cloudtranslate.user, speech.client).

Walk-Through

1

Configure Speech-to-Text Request

First, you define the audio source and recognition parameters. You must specify the encoding (e.g., LINEAR16, FLAC, MULAW) and sample rate (e.g., 16000 Hz). If the audio is from a phone call, use MULAW at 8000 Hz. You also set the language code (e.g., 'en-US'). Optional features include `enableAutomaticPunctuation`, `enableSpeakerDiarization` (to separate speakers), and `maxAlternatives` (number of transcript hypotheses). The audio can be provided inline (base64-encoded) or as a Cloud Storage URI. For audio longer than 60 seconds, you must use the `LongRunningRecognize` method.

2

Call Speech-to-Text API

You send the request to the Speech-to-Text API endpoint. For synchronous recognition, the API processes the audio and returns a response within seconds. The response includes a list of `SpeechRecognitionResult` objects, each containing `alternatives` with the transcript text and a confidence score between 0 and 1. The API also returns `languageCode` if auto-detection was enabled. For asynchronous recognition, you get an operation name that you can poll to check status. Once complete, you retrieve the results from the operation.

3

Process Speech-to-Text Output

You extract the best transcript (highest confidence) from the response. If speaker diarization was enabled, you also get speaker tags for each word. You may need to clean the text (e.g., remove punctuation if not needed). This transcript is then passed to the Translation API. If the source language is unknown, you can use the `languageCode` from the STT response or use Translation API's detection feature.

4

Translate Text with Translation API

You call the Translation API v3 with the source text, source language (optional), target language, and optional glossary. The API returns the translated text. If glossaries are used, they override the model's default translations for specified terms. For batch translation, you submit a job that reads from Cloud Storage and writes output to Cloud Storage. The API supports text/plain and text/html mime types. The response includes the translated text and the detected source language if not provided.

5

Synthesize Translated Text to Speech

You pass the translated text to the Text-to-Speech API. You specify the voice (language code, name, and gender) and audio encoding (e.g., MP3, LINEAR16, OGG). You can also provide SSML for prosody control. The API returns the audio content as base64-encoded bytes. You then decode and play or stream the audio to the user. For real-time applications, you may use streaming TTS, which sends audio chunks as they are generated.

What This Looks Like on the Job

Enterprise Scenario 1: Multilingual Customer Support Chatbot

A global e-commerce company wants to provide customer support in 10 languages. They use Dialogflow CX with phone gateway integration. When a customer calls, the audio is streamed to Speech-to-Text (using MULAW at 8000 Hz). The transcript is then sent to Translation API to convert to English (the backend language). Dialogflow processes the intent and generates a response in English, which is then translated to the customer's language and synthesized via Text-to-Speech. The entire pipeline must complete in under 2 seconds to avoid noticeable delay. In production, they use Cloud Run to orchestrate the APIs, with Cloud Storage for logging. A common issue is latency from Translation API; they mitigate by using a cache for frequent phrases. Misconfiguration often occurs when glossaries are not applied consistently, leading to incorrect translations of product names.

Enterprise Scenario 2: Video Content Localization

A media company wants to automatically generate subtitles and dubbing for videos in multiple languages. They use Speech-to-Text (asynchronous) to transcribe the original audio, then Translation API (batch) to translate the transcript into 20 languages. The translated text is then used to generate subtitles (stored as VTT files) and synthesized via Text-to-Speech for dubbing. The audio files are large (up to 2 hours), so they use asynchronous STT with Cloud Storage I/O. They also use AutoML Translation for domain-specific terms (e.g., movie titles). A common mistake is not setting the correct sample rate for STT; if the video audio is 48000 Hz but they set 16000 Hz, the transcription quality degrades. They also must handle punctuation carefully—STT's automatic punctuation is crucial for natural-sounding TTS.

Enterprise Scenario 3: Real-Time Meeting Transcription and Translation

A multinational corporation uses Google Meet with live captions and translation. The Meet client sends audio to Speech-to-Text in real time (streaming recognition). The transcript is displayed as captions and also fed to Translation API for real-time translation into each participant's language. The translated text is then displayed as overlays. This requires low latency (under 500ms per chunk). They use streaming STT with single_utterance mode for turn-based meetings. A common failure is when multiple people speak simultaneously; they enable enableSpeakerDiarization to separate speakers. They also use custom vocabulary to recognize industry jargon. Misconfiguration of the streaming session (e.g., not sending audio fast enough) can cause the session to time out after 5 minutes of inactivity.

How GCDL Actually Tests This

What the GCDL Exam Tests

Objective 3.2 focuses on AI and Machine Learning services, including Translation and Speech APIs. Specifically, you must understand:

Use cases for Translation API (basic vs. advanced, AutoML)

Use cases for Speech-to-Text (transcription, real-time vs. batch)

Use cases for Text-to-Speech (voice response, accessibility)

Integration patterns (e.g., STT -> Translation -> TTS)

IAM roles (e.g., cloudtranslate.user, speech.client)

Quotas and limitations (e.g., synchronous audio limit of 60 seconds)

Common Wrong Answers

1.

"Translation API can translate audio directly." Wrong. Translation API only handles text. You must first use Speech-to-Text to convert audio to text.

2.

"Speech-to-Text can process unlimited audio length synchronously." Wrong. Synchronous recognition is limited to 60 seconds. Longer audio requires asynchronous LongRunningRecognize.

3.

"Text-to-Speech only supports English." Wrong. It supports over 40 languages and 220 voices.

4.

"AutoML Translation is better than Translation API for all use cases." Wrong. AutoML is for custom domain-specific models; Translation API's pre-trained models are sufficient for general translation and are easier to use.

Key Numbers and Terms

Synchronous STT audio limit: 60 seconds

Asynchronous STT max duration: 480 minutes

Default sample rate: 16000 Hz

Telephony audio: MULAW at 8000 Hz

Translation API v3 supports glossaries and batch translation

WaveNet voices are more natural but more expensive

Free tier: 2 million characters/month for Translation API

Edge Cases and Exam Traps

If a question asks about translating a live phone call, the correct pipeline is: STT (streaming) -> Translation -> TTS (streaming).

If a question mentions a custom model for medical terminology, the answer is AutoML Translation, not the standard Translation API.

For audio with multiple speakers, you must enable enableSpeakerDiarization in STT.

The exam may test that Translation API basic (v2) does not support glossaries, only advanced (v3).

How to Eliminate Wrong Answers

If an answer says "use Translation API to translate audio," eliminate it—Translation API is text-only.

If an answer suggests using synchronous STT for a 2-hour recording, eliminate it—use asynchronous.

If an answer mentions "custom model" for translation, check if it says AutoML Translation; if not, it's likely wrong.

For cost questions, remember that WaveNet voices cost more per character than standard voices.

Key Takeaways

Translation API translates text only; for audio, use Speech-to-Text first.

Speech-to-Text synchronous recognition is limited to 60 seconds; use asynchronous for longer audio.

Text-to-Speech offers both standard and WaveNet voices; WaveNet is more natural and more expensive.

AutoML Translation is for custom domain-specific models, not general translation.

Common pipeline: STT -> Translation -> TTS for multilingual voice applications.

IAM roles: cloudtranslate.user for Translation API, speech.client for Speech-to-Text.

Glossaries are only supported in Translation API v3 (advanced).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Translation API Basic (v2)

Only supports text translation

No glossaries or custom models

No batch translation

Simpler API, fewer features

Lower cost per character

Translation API Advanced (v3)

Supports text, HTML, and document translation

Supports glossaries and AutoML models

Supports batch translation with Cloud Storage

More complex API with additional parameters

Higher cost per character but more capabilities

Watch Out for These

Mistake

Translation API can directly convert speech from one language to another.

Correct

Translation API only handles text. To convert speech, you must first use Speech-to-Text to get text, then Translation API, then optionally Text-to-Speech.

Mistake

Speech-to-Text can process audio of any length synchronously.

Correct

Synchronous recognition is limited to 60 seconds of audio. For longer audio, you must use the asynchronous `LongRunningRecognize` method.

Mistake

Text-to-Speech only provides standard robotic voices.

Correct

Text-to-Speech offers both standard voices and high-quality WaveNet voices that sound very natural.

Mistake

AutoML Translation is always better than the standard Translation API.

Correct

AutoML Translation is designed for domain-specific custom models. For general translation, the standard API's pre-trained models are faster and easier to use.

Mistake

All Google Cloud Speech-to-Text features are available in every language.

Correct

Features like speaker diarization, automatic punctuation, and word-level confidence are only available for certain languages (e.g., en-US, es-ES). Always check the documentation.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the maximum audio length for synchronous Speech-to-Text?

Synchronous recognition is limited to 60 seconds. For audio longer than 60 seconds, you must use the asynchronous `LongRunningRecognize` method, which can handle up to 480 minutes. The exam often tests this limit, so remember 60 seconds for sync, 480 minutes for async.

Can I use Translation API to translate a PDF document?

Yes, but only with Translation API Advanced (v3). It supports document translation for PDF, Word, and other formats. The basic v2 API only supports plain text. You can submit a document via Cloud Storage and get the translated document back.

What is the difference between WaveNet and standard voices in Text-to-Speech?

WaveNet voices use deep neural networks to produce more natural, human-like speech with better intonation and prosody. Standard voices are older, more robotic, and cheaper. The exam may ask which is higher quality (WaveNet) or which is more cost-effective (standard).

How do I handle multiple speakers in Speech-to-Text?

Enable the `enableSpeakerDiarization` feature in the `RecognitionConfig`. This returns speaker tags for each word in the transcript. The feature is available for some languages like en-US. The exam may test that you need to enable diarization to separate speakers.

What is a glossary in Translation API?

A glossary is a custom mapping of terms to their preferred translations. For example, you can map 'Google Cloud' to 'Google Cloud' in Spanish instead of 'Nube de Google'. Glossaries are only supported in Translation API v3 (advanced). The exam may ask about use cases for glossaries, like brand names.

Can I use Speech-to-Text for real-time transcription?

Yes, using streaming recognition. You send audio in chunks via a gRPC or REST streaming request. The API returns interim results and final results. Streaming is ideal for live captioning or voice assistants. The exam may contrast streaming vs. batch (asynchronous).

What IAM roles are needed to use Translation API?

The `cloudtranslate.user` role grants access to Translation API. For advanced features like glossaries, you may also need `cloudtranslate.editor` or `cloudtranslate.admin`. For Speech-to-Text, the `speech.client` role is typically sufficient. The exam may test basic IAM.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Google Translation and Speech APIs — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.

Done with this chapter?