DVA-C02Chapter 85 of 101Objective 1.6

Amazon Polly, Rekognition, and AI Services

This chapter covers Amazon Polly for text-to-speech and Amazon Rekognition for image and video analysis, along with other AWS AI services like Amazon Transcribe, Translate, and Comprehend. These fully managed AI services are critical for the DVA-C02 exam because they enable developers to add advanced capabilities without building ML models. Expect 3-5 questions on these services, focusing on use cases, feature limitations, and integration patterns.

25 min read
Intermediate
Updated May 31, 2026

The Voice Actor and Photo Analyst Studio

Imagine a media production studio with two specialized departments: a voice-over booth and a photo analysis lab. The voice-over booth (Amazon Polly) takes a script (text) and a voice actor profile (speech synthesis parameters) and produces a recorded audio file. The actor can speak in different styles (neural vs. standard), adjust pitch, speed, and emphasis using markup annotations (SSML). The studio manager can request the audio in various formats (MP3, OGG) and even ask for a "whispered" or "newscaster" style. The photo analysis lab (Amazon Rekognition) receives images or video streams. A technician uploads a photo and requests a list of objects it contains, faces detected, or text extracted. The lab can also compare two faces to see if they match (face comparison) or search a database of known faces (face search). Both departments operate as fully managed services: the studio provides the infrastructure, models, and expertise; you just bring the content and pay per use. You don't need to hire voice actors or train image classifiers. The studio exposes REST APIs and SDKs so you can integrate them into your production pipeline. Just as you wouldn't build your own recording studio for a single ad, you shouldn't build your own TTS or image analysis engine when AWS offers these scalable, accurate AI services.

How It Actually Works

Amazon Polly Overview

Amazon Polly is a cloud service that converts text into lifelike speech. It uses advanced deep learning technologies to synthesize speech that sounds natural. Polly supports multiple languages and voices, including both standard (concatenative) and Neural (NTTS) voices. Neural voices produce higher quality, more natural speech and are recommended for conversational applications.

How Polly Works

You send text input to Polly via the AWS SDK, CLI, or Management Console. Polly returns an audio stream of the spoken text. You can specify output format (MP3, OGG Vorbis, PCM, etc.), sample rate, and voice. Optionally, you can use Speech Synthesis Markup Language (SSML) to control pronunciation, pitch, speed, and volume. Polly also supports lexicons (custom dictionaries) to specify how certain words are pronounced, e.g., acronyms or foreign names.

Key Features and Limits

Voices: Over 60 voices across 30+ languages. Neural voices are available for many languages.

SSML Tags: <speak>, <break>, <prosody>, <emphasis>, <say-as>, <phoneme>, <lang>, etc.

Lexicons: Store in S3, maximum 1 MB per lexicon, up to 100 lexicons per account per region.

Speech Marks: You can request a speech marks stream that provides metadata (word boundaries, sentence boundaries, SSML events) for synchronizing with animations or subtitles.

SynthesizeSpeech API: Returns audio in real-time (streaming) or as a file. Maximum text length is 3,000 characters (including SSML tags) for the SynthesizeSpeech API. For longer texts, use StartSpeechSynthesisTask which can handle up to 200,000 characters (billed per character).

Caching: Polly does not cache results; each request generates audio. You should cache audio files yourself (e.g., in S3) if you expect repeated requests for the same text.

Pricing: Pay per character of input text (including SSML tags). Neural voices cost more per character than standard voices.

Amazon Rekognition Overview

Amazon Rekognition is a service for image and video analysis. It can detect objects, scenes, faces, text, and activities. It also provides face comparison and face search capabilities. Rekognition uses pre-trained deep learning models that require no ML expertise.

How Rekognition Works

You provide an image or video to Rekognition via API operations. For images, you can pass the image bytes directly (max 5 MB) or reference an S3 object. For videos, you must store the video in S3 and use the StartLabelDetection, StartFaceDetection, etc., APIs to start asynchronous analysis. Rekognition returns results in JSON format.

Key Features and Limits

- Image Analysis: - DetectLabels: Returns labels (objects, scenes, concepts) with confidence scores. Max 10 MB image size (5 MB if base64). - DetectFaces: Detects faces and returns attributes like age range, gender, emotions, landmarks, pose. Up to 100 faces per image. - CompareFaces: Compares a source face with faces in a target image. Returns similarity score. - DetectText: Detects text in images (e.g., street signs, documents). Returns bounding boxes and text content. - DetectModerationLabels: Detects adult/offensive content. Returns confidence scores for categories like Explicit Nudity, Violence, etc. - RecognizeCelebrities: Detects celebrities in images. - Video Analysis: - Asynchronous operations: StartLabelDetection, StartFaceDetection, StartContentModeration, StartPersonTracking, StartCelebrityRecognition, StartTextDetection. - You receive results via SNS topic or by calling GetXxx APIs. The job status can be checked with GetXxx. - Video input must be in S3 bucket (same region as Rekognition). Supported formats: MP4, MOV, AVI, etc. Max video length: 30 minutes for most operations. - Face Search: - Requires creating a collection (max 20 million faces per collection). You index faces using IndexFaces. Then search with SearchFaces (by face ID) or SearchFacesByImage (by image). - Collection ID must be unique per account per region. - Limits: - For DetectLabels and DetectModerationLabels, the image must be in JPEG or PNG format. - For DetectText, the image can be JPEG or PNG, max 5 MB. - Video analysis jobs have a maximum duration of 30 minutes. If the video is longer, you must split it. - Throttling: 5 requests per second (RPS) for most image APIs, 2 RPS for video APIs (varies by region).

Other AI Services

Amazon Transcribe: Converts speech to text. Supports real-time streaming and batch transcription. Use for call analytics, subtitles, etc.

Amazon Translate: Neural machine translation. Supports 75+ languages. Can be used with Transcribe to translate speech.

Amazon Comprehend: NLP service that extracts insights (entities, key phrases, sentiment, syntax) from text. Supports 100+ languages for some features.

Amazon Textract: Extracts text, handwriting, and data from scanned documents. Goes beyond OCR to extract form data and tables.

Integration Patterns

These AI services integrate via SDK, CLI, and through AWS services like Lambda, Step Functions, and S3. Common patterns:

Use S3 event notifications to trigger Lambda that calls Rekognition on new images.

Use Polly to generate audio for Alexa skills or IVR systems.

Use Transcribe and Translate together to transcribe and translate customer calls in real-time.

Use Comprehend to analyze social media sentiment.

Security and Encryption

Data in transit: All API calls are over HTTPS.

Data at rest: Services do not store your content unless you explicitly use features like Rekognition collections. You can use KMS to encrypt S3 buckets where you store input/output.

IAM permissions: You must grant appropriate permissions (e.g., rekognition:DetectLabels, polly:SynthesizeSpeech) to IAM roles/users.

VPC endpoints: Available for Rekognition, Transcribe, Comprehend, etc., to keep traffic within your VPC.

Walk-Through

1

Select AI Service

Determine which AWS AI service fits your use case. For text-to-speech, use Polly. For image analysis, use Rekognition. For speech-to-text, use Transcribe. For translation, use Translate. For NLP, use Comprehend. For document extraction, use Textract. The exam tests your ability to choose the correct service based on requirements.

2

Prepare Input Data

For Polly: prepare the text string (max 3000 characters for synchronous, 200,000 for async). Optionally add SSML tags. For Rekognition: ensure image is JPEG or PNG, max 5 MB (or 10 MB for some APIs). Store in S3 or pass as base64 bytes. For Transcribe: audio file in S3 or real-time audio stream. For Translate: source text (max 5,000 bytes per request). For Comprehend: text (max 5,000 characters for most APIs).

3

Make API Call

Use the AWS SDK (e.g., boto3 in Python, AWS SDK for JavaScript) to call the appropriate API. For synchronous operations (e.g., `DetectLabels`, `SynthesizeSpeech`), the response is immediate. For asynchronous operations (e.g., video analysis, long speech synthesis), you start a job and poll for completion or receive an SNS notification.

4

Process Response

Parse the JSON response. For Rekognition, extract labels, faces, text, etc. For Polly, save the audio stream to a file or play it directly. For Transcribe, parse the transcription JSON. For Translate, retrieve the translated text. For Comprehend, extract entities, sentiment, etc. Handle errors (e.g., invalid image format, throttling).

5

Handle Asynchronous Jobs

For long-running tasks (Polly async synthesis, Rekognition video analysis), you must poll the `GetXxx` API with the job ID until the status is 'SUCCEEDED' or 'FAILED'. Alternatively, configure an SNS topic to receive notifications. The job ID is returned by the `StartXxx` API. Check for completion every few seconds.

What This Looks Like on the Job

Scenario 1: E-Learning Platform with Text-to-Speech

An e-learning startup wants to provide audio versions of course materials. They use Amazon Polly to convert lesson text into MP3 files. They store the generated audio in S3 and serve it via CloudFront. They use SSML to adjust pronunciation of technical terms and control pacing. They cache audio files to avoid re-synthesizing the same text. Challenges: handling long texts (use async synthesis), managing costs (neural voices are more expensive), and ensuring low latency for real-time previews. They set up a Lambda function triggered by S3 uploads to automatically generate audio for new lessons. They monitor CloudWatch metrics for errors and throttling.

Scenario 2: Social Media Moderation with Rekognition

A social media company needs to automatically moderate user-uploaded images. They use S3 event notifications to trigger a Lambda function that calls Rekognition's DetectModerationLabels on each image. If the confidence score for explicit content exceeds a threshold (e.g., 70%), the image is flagged for manual review or automatically rejected. They also use DetectText to catch offensive text in images. They batch process videos using StartContentModeration. Performance considerations: Rekognition has rate limits (5 RPS for image APIs); they implement a queue (SQS) to decouple uploads from processing. They also use Rekognition's face search to identify banned users from a collection.

Scenario 3: Multilingual Customer Support with Transcribe and Translate

A global e-commerce company wants to transcribe and translate customer support calls in real-time. They use Amazon Transcribe's streaming API to transcribe audio from the call center. The transcription is streamed to a Lambda function that calls Amazon Translate to translate the text into the agent's preferred language. The translated text is displayed on the agent's screen. For post-call analytics, they store transcriptions in S3 and use Comprehend to extract sentiment and key phrases. Challenges: handling multiple languages, ensuring low latency (streaming requires sub-second processing), and dealing with accents and background noise. They use custom language models with Transcribe for domain-specific terminology.

How DVA-C02 Actually Tests This

What DVA-C02 Tests

Domain 1 (Development) Objective 1.6: "Integrate AWS AI and ML services into application code." The exam focuses on:

Choosing the correct service for a given use case (e.g., Polly for TTS, Rekognition for image analysis, Transcribe for STT, Translate for translation, Comprehend for NLP, Textract for document extraction).

Understanding API limits (e.g., Polly synchronous max 3000 chars, Rekognition image max 5 MB, video analysis max 30 minutes).

Knowing when to use synchronous vs. asynchronous API calls.

Recognizing integration patterns (S3 + Lambda, SNS for job completion).

Understanding IAM permissions needed (e.g., rekognition:DetectLabels, polly:SynthesizeSpeech).

Common Wrong Answers and Traps

1.

Wrong service selection: Candidates often confuse Polly (text-to-speech) with Transcribe (speech-to-text). Remember: Polly speaks, Transcribe listens. Another trap: using Rekognition for document text extraction when Textract is better suited (Textract handles forms and tables).

2.

Ignoring limits: The exam loves to test the 3000-character limit for synchronous Polly. If a question mentions a 5000-character text, you must use StartSpeechSynthesisTask (async). Similarly, for video longer than 30 minutes, you must split it.

3.

Synchronous vs. asynchronous: For image analysis, all Rekognition image APIs are synchronous. For video, they are asynchronous. Candidates sometimes try to use synchronous APIs for video.

4.

Face search vs. face comparison: CompareFaces compares two images and returns a similarity score. SearchFacesByImage searches a collection of faces. The exam may ask which to use for identifying a person from a database of known faces.

Key Numbers and Terms

Polly: 3000 characters synchronous, 200,000 async, 1 MB lexicon size, 100 lexicons per account.

Rekognition: 5 MB image limit (most APIs), 10 MB for some (e.g., DetectLabels via S3 object), 5 RPS for image APIs, 2 RPS for video APIs.

Transcribe: supports FLAC, MP3, WAV, etc. Real-time streaming requires WebSocket or HTTP/2.

Translate: 5,000 bytes per request, 75+ languages.

Comprehend: 5,000 characters per request for most APIs, 100+ languages for some features.

Edge Cases

Polly SSML tags count toward character limit.

Rekognition DetectLabels can return up to 10 labels per call by default (adjustable via MaxLabels parameter).

Rekognition face search collections are regional; you cannot search across regions.

Transcribe custom language models require a training dataset of at least 10 hours of audio.

Key Takeaways

Amazon Polly converts text to speech (TTS); synchronous API limit is 3,000 characters; async limit is 200,000 characters.

Amazon Rekognition image APIs are synchronous; video APIs are asynchronous (start job, poll for results).

Rekognition image max size is 5 MB (10 MB for S3 objects); video max length is 30 minutes.

Use SSML in Polly to control pronunciation, pitch, speed, and volume.

Rekognition face search requires a collection; use SearchFacesByImage to find a face in the collection.

Amazon Transcribe converts speech to text; supports real-time streaming and batch transcription.

Amazon Translate is for language translation; can be combined with Transcribe for cross-language communication.

Amazon Comprehend extracts entities, sentiment, key phrases, and syntax from text.

Amazon Textract extracts text, handwriting, and data from documents (forms and tables).

All AI services integrate with Lambda, S3, and SNS for event-driven workflows.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Amazon Polly (Synchronous API)

Max input text: 3,000 characters (including SSML).

Returns audio immediately in the response.

Best for short texts like notifications or greetings.

Billed per character of input.

Cannot handle very long documents.

Amazon Polly (Asynchronous API)

Max input text: 200,000 characters.

Starts a synthesis task; you poll or get SNS notification.

Best for long documents like articles or books.

Billed per character of input (same rate).

Requires S3 bucket for output storage.

Amazon Rekognition Image APIs

Synchronous: response in same call.

Max image size: 5 MB (most APIs) or 10 MB (via S3).

Supported formats: JPEG, PNG.

Rate limit: 5 RPS (varies by region).

Use cases: real-time moderation, object detection in static images.

Amazon Rekognition Video APIs

Asynchronous: start job, then poll.

Max video length: 30 minutes.

Supported formats: MP4, MOV, AVI, etc.

Rate limit: 2 RPS.

Use cases: analyzing surveillance footage, video content moderation.

Watch Out for These

Mistake

Amazon Polly caches audio output, so subsequent requests for the same text are faster.

Correct

Polly does NOT cache results. Each request generates new audio. You must implement your own caching (e.g., store in S3) to avoid repeated charges and latency.

Mistake

Amazon Rekognition can analyze videos synchronously.

Correct

All video analysis APIs (e.g., StartLabelDetection) are asynchronous. You must start a job and then poll for results or receive an SNS notification.

Mistake

Amazon Transcribe can be used for text-to-speech.

Correct

Transcribe converts speech to text, not the reverse. For text-to-speech, use Amazon Polly.

Mistake

Amazon Rekognition's CompareFaces can search a face collection.

Correct

CompareFaces compares two images and returns a similarity score. To search a collection of faces, use SearchFacesByImage or SearchFaces.

Mistake

All AWS AI services support real-time streaming.

Correct

Only some services support real-time streaming (e.g., Transcribe, Polly can stream audio output). Rekognition image APIs are synchronous request-response, not streaming.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Amazon Polly and Amazon Transcribe?

Polly converts text to speech (text-to-speech, TTS). Transcribe converts speech to text (speech-to-text, STT). Use Polly when you need to generate audio from text, e.g., for voice responses. Use Transcribe when you need to transcribe audio recordings, e.g., for call analytics.

Can I use Amazon Rekognition to detect text in images?

Yes, Rekognition's DetectText API can detect text in images. However, if you need to extract text from scanned documents or forms, Amazon Textract is more suitable because it can extract form data and tables.

What happens if I call Polly with more than 3000 characters?

The synchronous SynthesizeSpeech API will return a ValidationException. You must use StartSpeechSynthesisTask (asynchronous) for texts longer than 3000 characters (up to 200,000 characters).

How do I get notified when an asynchronous Rekognition video analysis job completes?

You can configure an SNS topic when starting the job (e.g., StartLabelDetection) by providing a NotificationChannel parameter with an SNS topic ARN and a role ARN. Alternatively, you can poll the GetLabelDetection API using the job ID.

What is the difference between Rekognition CompareFaces and SearchFacesByImage?

CompareFaces compares a source face with faces in a target image and returns a similarity score for each match. SearchFacesByImage searches a pre-existing collection of faces (created via IndexFaces) for matches to the input image. Use CompareFaces for one-to-one comparison; use SearchFacesByImage for one-to-many identification.

Can I use Amazon Translate to translate audio?

Translate translates text, not audio. To translate speech, first transcribe the audio with Amazon Transcribe, then translate the resulting text with Translate, and optionally synthesize the translated text with Polly.

What are the IAM permissions required for Polly?

To use Polly, you need permissions for polly:SynthesizeSpeech, polly:StartSpeechSynthesisTask, polly:GetSpeechSynthesisTask, and polly:DescribeVoices. If using lexicons, also polly:GetLexicon, polly:PutLexicon, etc. You can use AWS managed policy AmazonPollyFullAccess.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Amazon Polly, Rekognition, and AI Services — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?