AI-900Chapter 75 of 100Objective 5.1

Types of Generative AI: Text, Image, Code, Audio

Which four generative AI types—text, image, code, and audio—are tested in AI-900 domain 5.1? Generative AI is a rapidly growing area of Azure AI services, and understanding the capabilities and use cases of each type is critical for the exam. Approximately 10-15% of exam questions touch on generative AI concepts, with a focus on identifying appropriate services for specific scenarios. By the end of this chapter, you will be able to distinguish between text generation, image generation, code generation, and audio generation models, and recommend the correct Azure service for each task.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Generative AI as a Versatile Artist

A master artist has studied millions of paintings, photographs, musical scores, and architectural blueprints. This artist doesn't just copy—they learn the underlying patterns, styles, and rules of each medium. When you ask for "a watercolor of a cat in the style of Monet," the artist's brain activates neural pathways specific to watercolor technique, feline anatomy, and Monet's impressionist brushstrokes. For text generation, the artist writes a novel after reading every book in the library; for code, they write a program after studying millions of code repositories; for audio, they compose a symphony after analyzing countless recordings. Each type of generative AI is like a specialized version of this artist, trained on massive datasets of that specific medium. The artist doesn't just remix—they generate novel combinations that follow the statistical patterns learned during training. Just as the artist might struggle if asked to paint in a style they never studied, generative AI models can only produce outputs within the domain of their training data. The key mechanism is that the artist (model) has learned a compressed representation of the training data's distribution, and generates new samples by sampling from that learned distribution, conditioned on the input prompt.

How It Actually Works

What is Generative AI?

Generative AI refers to a class of artificial intelligence models that can create new content—text, images, code, audio, video, and more—rather than simply analyzing or classifying existing data. Unlike discriminative models that learn boundaries between classes (e.g., is this image a cat or a dog?), generative models learn the underlying distribution of the training data and can sample from it to produce novel outputs. The AI-900 exam expects you to understand the distinction between generative and discriminative AI, and to recognize the different types of generative models available in Azure.

Text Generation

Text generation models produce human-like text based on a prompt. They are built on large language models (LLMs) like GPT-4, which are transformer-based neural networks trained on vast corpora of text from the internet, books, articles, and other sources. The core mechanism is next-token prediction: given a sequence of tokens, the model predicts the most likely next token, then iteratively generates subsequent tokens. The model uses self-attention to weigh the importance of each token in the input sequence, enabling it to maintain context over long passages. Key parameters include:

Temperature: Controls randomness. Lower values (e.g., 0.1) produce more deterministic, focused outputs; higher values (e.g., 0.9) produce more creative, varied outputs.

Top-p (nucleus sampling): Limits the cumulative probability of token choices. For example, top-p=0.9 means the model considers only tokens that make up the top 90% of probability mass.

Max tokens: Caps the length of the generated response.

Stop sequences: Tokens that signal the model to stop generating.

Azure services for text generation include Azure OpenAI Service (with models like GPT-3.5, GPT-4, and GPT-4 Turbo) and the Language Service for summarization and conversation. The exam may ask you to choose between these services based on customization needs (e.g., fine-tuning vs. pre-built models).

Image Generation

Image generation models create visual content from text descriptions (text-to-image) or from other images (image-to-image). The dominant architecture is diffusion models, such as DALL-E (available in Azure OpenAI Service) and Stable Diffusion. These models work by gradually adding noise to an image during training, then learning to reverse the process to generate images from pure noise, conditioned on a text prompt. Key steps:

Forward diffusion: During training, the model learns to predict noise added to images at various timesteps.

Reverse diffusion: At inference, the model starts with random noise and iteratively denoises it, guided by the text prompt, to produce a coherent image.

Parameters include: - Number of inference steps: More steps generally yield higher quality but take longer. - Guidance scale: Controls how strongly the model adheres to the prompt. Higher values (e.g., 15-20) produce more prompt-faithful images but may reduce creativity. - Seed: For reproducibility, a fixed seed ensures the same output for the same prompt.

Azure services: Azure OpenAI Service with DALL-E 3, and the Computer Vision service for image analysis (not generation). The exam focuses on knowing that DALL-E is the image generation model in Azure.

Code Generation

Code generation models produce programming code from natural language descriptions or partial code snippets. These models are typically LLMs fine-tuned on large code corpora, such as GitHub repositories, documentation, and Stack Overflow. Examples include GitHub Copilot (powered by OpenAI Codex) and Azure OpenAI Service with GPT-4's code capabilities. The mechanism is similar to text generation—next-token prediction—but the token vocabulary includes programming language syntax, keywords, and variable names. Code models understand context like function signatures, comments, and surrounding code to generate syntactically and semantically correct code. Key considerations:

Language support: Most models support multiple languages (Python, JavaScript, TypeScript, C#, Java, etc.).

Context window: The model uses the current file and open tabs to inform suggestions.

Completion vs. generation: Code models can generate whole functions or complete lines.

Azure services: Azure OpenAI Service (for code generation via GPT-4) and GitHub Copilot (integrated with IDEs). The exam may ask which service to use for code generation in Azure.

Audio Generation

Audio generation models create speech, music, or sound effects from text or other audio. Two main subcategories:

Text-to-Speech (TTS): Converts text into spoken audio. Azure Cognitive Services Speech Service provides neural TTS with natural-sounding voices. It uses deep neural networks to model prosody, intonation, and rhythm. Key parameters: voice (e.g., en-US-JennyNeural), speaking style, rate, pitch.

Music and sound generation: Models like OpenAI's Jukebox or Meta's MusicGen can generate music from text descriptions. Azure does not currently offer a dedicated music generation service, but the Speech Service can create custom voices and speech.

For the exam, focus on Azure's Speech Service for TTS and the ability to create custom neural voices using the Custom Voice capability. Audio generation also includes speech recognition (STT), but that is discriminative, not generative.

How Generative AI Models Are Trained

All generative models share a common training paradigm: they learn a probability distribution over the training data. For text, this is the distribution of word sequences; for images, the distribution of pixel arrangements; for code, the distribution of programming patterns. Training involves:

Data collection: Massive datasets (e.g., Common Crawl for text, LAION-5B for images).

Preprocessing: Tokenization, normalization, filtering.

Model architecture: Transformers for text/code, diffusion or GANs for images, WaveNet or Tacotron for audio.

Training objective: For LLMs, it's next-token prediction; for diffusion models, it's noise prediction; for GANs, it's adversarial loss.

Fine-tuning: Adapting a pre-trained model to a specific domain (e.g, medical text, legal documents) with smaller datasets.

Azure Services for Generative AI

Azure offers several services that support generative AI workloads:

Azure OpenAI Service: Provides access to GPT-4, GPT-3.5, DALL-E, and Embeddings models. It is the primary service for text and image generation on Azure.

Azure Cognitive Services Speech: Offers text-to-speech (neural TTS) and speech-to-text. It is the go-to for audio generation.

Azure Machine Learning: Allows you to train and deploy custom generative models (e.g., fine-tune a GPT model).

Azure Bot Service: Can integrate with Azure OpenAI to create conversational bots.

The exam expects you to match these services to use cases: use Azure OpenAI for text and image generation, Speech for audio generation, and ML for custom models.

Ethical Considerations and Responsible AI

Generative AI raises important ethical issues, including: - Bias: Models can perpetuate stereotypes present in training data. - Misinformation: Generated text/images can be used to create fake news or deepfakes. - Copyright: Training on copyrighted data raises legal questions. - Transparency: Users should know when content is AI-generated.

Azure provides tools like Content Safety and Responsible AI dashboards to mitigate these risks. The exam may test your understanding of Microsoft's responsible AI principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability.

Comparison with Discriminative AI

| Aspect | Generative AI | Discriminative AI | |--------|---------------|------------------| | Goal | Generate new data similar to training data | Classify or predict labels | | Example | GPT-4 writing an essay | A model classifying emails as spam or not spam | | Training data | Unlabeled or self-supervised | Labeled examples | | Output | New content (text, image, etc.) | A label or probability |

This distinction is fundamental for the exam.

Walk-Through

Choose a Generative AI Type

Identify the type of content you need to generate: text, image, code, or audio. This determines which Azure service and model to use. For text, use Azure OpenAI with GPT models. For images, use DALL-E via Azure OpenAI. For code, use Azure OpenAI or GitHub Copilot. For audio, use Speech Service for TTS. The exam will present scenarios where you must select the correct service based on the output type.

Select the Azure Service

For text and image generation, create an Azure OpenAI resource in the Azure portal. For audio, create a Speech service resource. For custom models, use Azure Machine Learning. Each service has its own pricing tier, region availability, and quota limits. For example, Azure OpenAI requires application for access and has rate limits (e.g., 40 requests per minute for GPT-4). The exam may ask about these limits.

Configure the Model Parameters

Set parameters like temperature, max tokens, top-p, and stop sequences for text generation. For image generation, set size (e.g., 1024x1024), quality, and style. For TTS, choose voice, language, and speaking style. These parameters significantly affect output quality and should be tuned for the specific use case. The exam expects you to know the purpose of temperature and max tokens.

Send a Prompt or Input

Submit a text prompt for text/image generation or a text string for TTS. The prompt should be clear and specific to get desired results. For code generation, include context like existing code and comments. The model processes the input through its neural network and generates output token by token. The response time depends on model size and output length; GPT-4 is slower than GPT-3.5.

Review and Refine Output

Evaluate the generated content for quality, accuracy, and safety. Use Azure's Content Safety service to filter harmful content. If output is not satisfactory, adjust parameters or prompt. For iterative generation, you can use chat completions to maintain context. The exam may test your understanding of how to improve output quality through prompt engineering.

What This Looks Like on the Job

Enterprise Scenario 1: Customer Support Chatbot

A large e-commerce company wants to deploy a chatbot that can answer customer queries, generate personalized responses, and draft emails. They use Azure OpenAI Service with GPT-4 for text generation. The chatbot is integrated with Azure Bot Service and Cognitive Search to ground responses in product documentation. The system processes thousands of queries daily, with each response costing fractions of a cent. Challenges include managing token limits (4096 tokens for GPT-4) and ensuring responses are factually accurate. The team uses prompt engineering with system messages to enforce brand tone and restrict topics. Misconfiguration—like setting temperature too high—can lead to hallucinated or off-topic responses, requiring manual review and fine-tuning.

Enterprise Scenario 2: Automated Image Creation for Marketing

A marketing agency uses DALL-E 3 via Azure OpenAI to generate product images and social media graphics from text descriptions. They need consistent branding, so they use a fixed seed and similar prompts. The service handles up to 50 requests per minute under their tier. They also use Azure Content Safety to filter inappropriate imagery. Common issues: images may not match the prompt exactly (requires prompt refinement), and generation time is ~10-20 seconds per image. The team must balance quality with cost, as higher resolution images cost more.

Enterprise Scenario 3: Code Assistance for Developers

A software development firm uses GitHub Copilot integrated with Visual Studio Code to accelerate coding. Copilot suggests code snippets and entire functions based on comments and context. It uses OpenAI Codex, which is trained on public GitHub repositories. The team observes that Copilot is most effective for boilerplate code and common patterns but may produce insecure code (e.g., SQL injection vulnerabilities) if not reviewed. They implement mandatory code review for AI-generated code. The service operates as a cloud API with local IDE integration, requiring internet connectivity. Misconfiguration of the IDE extension can lead to poor suggestions or latency.

How AI-900 Actually Tests This

AI-900 Exam Focus on Generative AI

The AI-900 exam tests your ability to identify the correct Azure service for different generative AI tasks. Specific objectives under domain 5.1 include:

Recognize the capabilities of generative AI models for text, image, code, and audio.

Identify Azure services that support generative AI: Azure OpenAI Service, Azure Cognitive Services Speech, Azure Machine Learning.

Understand the difference between generative and discriminative AI.

Know the responsible AI principles.

Common Wrong Answers

Choosing Computer Vision for image generation: Many candidates select Computer Vision because it deals with images, but Computer Vision is for analysis (classification, object detection), not generation. Image generation is done by DALL-E in Azure OpenAI.

Selecting Language Understanding (LUIS) for text generation: LUIS is for intent recognition and entity extraction, not generation. Use Azure OpenAI for text generation.

Confusing text-to-speech with speech-to-text: TTS is generative (audio from text), while STT is discriminative (text from audio). The exam may ask which is generative.

Assuming all generative AI models are in Azure OpenAI: While Azure OpenAI covers text and image, audio generation is in the Speech service, and custom models can be built in Azure ML.

Key Terms and Values

GPT-4, GPT-3.5: Text generation models.

DALL-E: Image generation model.

Neural TTS: Text-to-speech in Speech service.

Temperature, max tokens, top-p: Parameters for text generation.

Prompt engineering: Technique to improve output.

Responsible AI: Six principles.

Edge Cases

Multimodal models: GPT-4 can accept images as input (vision), but the exam may ask about output types.

Fine-tuning: Customizing models for specific domains is possible but not required for the exam basics.

Content filtering: Azure applies default filters; you can configure content filters.

Eliminating Wrong Answers

If the question asks about generating new content, eliminate any service that only analyzes or classifies.

If the output is text, eliminate Speech and Computer Vision.

If the output is code, eliminate services that don't generate code (e.g., Language Service).

If responsible AI is mentioned, look for principles like fairness, transparency.

Key Takeaways

Generative AI creates new content; discriminative AI classifies or predicts.

Azure OpenAI Service is the primary service for text (GPT-4) and image (DALL-E) generation.

Azure Speech Service provides text-to-speech (generative) and speech-to-text (discriminative).

Code generation is done via Azure OpenAI (GPT-4) or GitHub Copilot.

Key parameters: temperature (randomness), max tokens (output length), top-p (nucleus sampling).

Responsible AI principles: fairness, reliability, privacy, inclusiveness, transparency, accountability.

Prompt engineering is critical for getting quality outputs from generative models.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure OpenAI Service

Generates text and images using GPT-4 and DALL-E.

Requires special access application.

Pricing based on tokens processed.

Supports fine-tuning for custom models.

Ideal for chatbots, content creation, and code generation.

Azure Cognitive Services Speech

Generates audio (speech) from text using neural TTS.

Available without special access.

Pricing based on characters or audio duration.

Offers custom voice creation (Custom Neural Voice).

Ideal for voice assistants, audiobooks, and accessibility.

Watch Out for These

Mistake

Generative AI only produces text.

Correct

Generative AI can produce text, images, code, audio, video, and more. The AI-900 exam covers text, image, code, and audio generation specifically.

Mistake

Azure Computer Vision can generate images.

Correct

Azure Computer Vision is for analyzing images (classification, object detection, OCR). Image generation is done using DALL-E via Azure OpenAI Service.

Mistake

Text-to-speech is not generative AI.

Correct

Text-to-speech is generative AI because it creates new audio content from text. The Speech Service's neural TTS is a generative model.

Mistake

All generative AI models are available in Azure OpenAI Service.

Correct

Azure OpenAI Service provides text and image generation models (GPT-4, DALL-E). Audio generation is provided by the Speech Service, and custom generative models can be built in Azure Machine Learning.

Mistake

Generative AI always produces accurate and safe content.

Correct

Generative AI can produce biased, inaccurate, or harmful content. Azure provides Content Safety and responsible AI tools to mitigate risks. Output should always be reviewed.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between generative AI and discriminative AI?

Generative AI learns the distribution of training data and can create new samples (e.g., text, images). Discriminative AI learns boundaries between classes to classify or predict labels (e.g., spam detection). For the exam, remember: generative produces, discriminative decides.

Which Azure service should I use for text generation?

Use Azure OpenAI Service with GPT-4 or GPT-3.5 models. It provides powerful text generation for chatbots, summarization, and content creation. The Language Service offers pre-built text analysis but not generation.

Can Azure generate images from text?

Yes, using DALL-E 3 via Azure OpenAI Service. You provide a text prompt and receive an image. Computer Vision cannot generate images; it only analyzes them.

Is text-to-speech considered generative AI?

Yes, text-to-speech (TTS) generates new audio content from text, making it generative. Speech-to-text (STT) is discriminative because it transcribes audio to text without creating new content.

What are common parameters for controlling text generation?

Temperature (0-1): lower = more deterministic, higher = more creative. Max tokens: limits response length. Top-p (0-1): nucleus sampling. Stop sequences: tokens that halt generation. These are tested on the exam.

How does Azure ensure responsible use of generative AI?

Azure provides Content Safety for filtering harmful content, and follows six responsible AI principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. The exam may ask you to identify these principles.

What is the difference between GPT-3.5 and GPT-4?

GPT-4 is more advanced, with better reasoning, larger context window (up to 32K tokens), and multimodal capabilities (accepts images as input). GPT-3.5 is faster and cheaper. Both are available in Azure OpenAI.

Terms Worth Knowing

Artificial intelligence Generative AI Machine learning Natural language processing Responsible AI

Ready to put this to the test?

You've just covered Types of Generative AI: Text, Image, Code, Audio — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Power Virtual Agents and Azure Bot Framework

Foundation Models and Fine-Tuning

See the full AI-900 study guide