This chapter covers DALL-E, OpenAI's generative AI model for creating images from text descriptions. For the AI-900 exam, understanding DALL-E's capabilities, use cases, and limitations is part of Objective 5.2: Describe capabilities of generative AI. While not a major focus, expect 1-2 questions that test your knowledge of what DALL-E can do, how it differs from other AI models, and its responsible AI considerations. This chapter provides the depth needed to answer those questions confidently.
Jump to a section
Imagine you hire a world-class sketch artist who has studied millions of images. You give them a detailed description: "A cat wearing a top hat, sitting in a Victorian parlor, oil painting style." The artist doesn't just copy an existing photo—they create a new image from scratch. First, they understand the key elements: cat, top hat, Victorian parlor, oil painting. Then they generate a rough sketch, layer by layer: first the background, then the cat's shape, then the hat, then textures and lighting. At each step, they refine the details, ensuring consistency—the hat sits on the cat's head, the lighting matches the parlor. Finally, they add fine details like brush strokes to mimic oil paint. The result is a unique, coherent image that matches your description. DALL-E works similarly: it uses a transformer model to understand text prompts and a diffusion model to generate images from noise, iteratively refining them until they match the prompt. Just as the artist can't draw a cat in a top hat if they've never seen one, DALL-E relies on its training data to generate plausible images.
What is DALL-E and Why Does It Exist?
DALL-E is a generative AI model developed by OpenAI that creates original images from natural language text descriptions. The name is a portmanteau of the artist Salvador Dalí and the Pixar character WALL-E, reflecting its blend of artistic creativity and automation. DALL-E was first introduced in January 2021 (DALL-E 1), followed by DALL-E 2 in April 2022, and DALL-E 3 in October 2023. Each version improved image quality, resolution, and prompt adherence. DALL-E exists to enable anyone to generate high-quality images without needing artistic skills or expensive software, democratizing visual content creation. On the AI-900 exam, you need to recognize DALL-E as a text-to-image generative AI model under the broader category of generative AI.
How DALL-E Works Internally
DALL-E 3, the latest version, uses a combination of a transformer-based language model and a diffusion model. The process has two main stages:
Text Understanding (Language Model): The text prompt is processed by a large language model (likely GPT-4) to extract and interpret the semantic meaning. This model converts the prompt into a rich embedding vector that captures the objects, attributes, relationships, and style. For example, the prompt "a cat wearing a top hat" is parsed to identify the subject (cat), the accessory (top hat), and the action (wearing). The language model also handles rephrasing and disambiguation. DALL-E 3 is optimized for detailed prompts and can handle complex descriptions with multiple objects and spatial relationships.
Image Generation (Diffusion Model): The embedding is fed into a diffusion model, which generates an image from random noise. The diffusion model starts with a canvas of random pixels (pure noise) and iteratively removes noise over many steps (typically 50-1000 steps) to produce a coherent image. At each step, the model predicts the noise to subtract, guided by the text embedding. This process is called "denoising." The model is trained on millions of image-text pairs to learn how to map text descriptions to visual features. The final output is a 1024x1024 pixel image for DALL-E 3 (DALL-E 2 produced 1024x1024 as well, but DALL-E 1 had lower resolution).
Key Components, Values, and Defaults
Model Versions: DALL-E 3 is the current version available through Azure OpenAI Service and ChatGPT Plus. DALL-E 2 is still available but considered legacy.
Resolution: DALL-E 3 generates images at 1024x1024 pixels. Other supported sizes include 1792x1024 (landscape) and 1024x1792 (portrait).
Cost: On Azure OpenAI, pricing is per image generated. As of 2025, DALL-E 3 costs $0.040 per image for standard quality and $0.080 for HD quality (higher resolution and detail). DALL-E 2 costs $0.016 per image.
Prompt Length: DALL-E 3 can handle prompts up to 4000 characters (including spaces and punctuation). Longer prompts may be truncated.
Number of Images: You can generate 1 to 10 images per API call (default is 1).
Response Format: The API returns URLs to the generated images (or base64-encoded JSON if requested). Images are stored temporarily (typically 1 hour) on OpenAI servers.
Content Filtering: DALL-E 3 includes built-in content filters that block prompts and images containing violence, hate, sexual content, or other policy violations. The filter is applied both to the prompt (before generation) and to the output (after generation). If a prompt is filtered, the API returns an error.
Configuration and Verification Commands (Azure OpenAI)
To use DALL-E 3 via Azure OpenAI, you must deploy a model in your Azure resource. The deployment name is typically "dall-e-3" (or a custom name). Here's a sample Python code snippet using the OpenAI Python library (v1.x):
import openai
openai.api_type = "azure"
openai.api_base = "https://your-resource-name.openai.azure.com/"
openai.api_version = "2024-02-15-preview" # Check for latest
openai.api_key = "your-api-key"
response = openai.Image.create(
prompt="A cat wearing a top hat in a Victorian parlor, oil painting style",
n=1,
size="1024x1024",
quality="standard", # or "hd"
model="dall-e-3"
)
image_url = response['data'][0]['url']
print(image_url)To verify the deployment, you can list models:
models = openai.Model.list()
for model in models['data']:
print(model['id'])Look for "dall-e-3" in the list.
Interaction with Related Technologies
DALL-E is part of the broader Azure OpenAI Service, which also includes GPT-4 (text generation) and Whisper (speech-to-text). These models can be combined: for example, use GPT-4 to refine a user's prompt before sending it to DALL-E, or use DALL-E to generate images for a chatbot response. DALL-E can also be integrated with Azure Cognitive Search to retrieve relevant images from a database. On the AI-900 exam, you should know that DALL-E is a generative AI model for images, distinct from discriminative models that classify or analyze images. DALL-E does not perform image recognition or object detection; it only generates new images.
Limitations and Considerations
Prompt Adherence: DALL-E 3 is significantly better than DALL-E 2 at following complex prompts, but it can still misinterpret ambiguous phrases or ignore certain details.
Faces and Text: DALL-E 3 struggles with generating realistic human faces (often producing distorted features) and rendering legible text within images. It may generate gibberish or incorrect text.
Bias: Like all AI models, DALL-E can reflect biases present in its training data, potentially generating stereotypical or inappropriate images. OpenAI has implemented safety mitigations but biases may still appear.
Copyright: Generated images may resemble copyrighted works if the prompt describes a specific style or character. Users are responsible for ensuring they do not infringe intellectual property rights.
On the AI-900 exam, be prepared to identify appropriate use cases for DALL-E (e.g., generating marketing visuals, concept art, product prototypes) and inappropriate ones (e.g., generating realistic photos of people, medical imaging, or any application requiring high accuracy or safety).
Submit Text Prompt to API
The user provides a text description of the desired image. This prompt is sent to the DALL-E API via an HTTP POST request. The API endpoint is typically `https://{resource}.openai.azure.com/openai/images/generations:submit?api-version=2024-02-15-preview`. The request includes parameters such as prompt, n (number of images), size, quality, and model. The API validates the prompt against content filters. If the prompt violates policy, the API returns an error immediately. Otherwise, the request is queued for processing.
Language Model Interprets Prompt
The text prompt is processed by a large language model (likely GPT-4) to extract semantic meaning. The model parses the prompt into a structured representation, identifying key objects, attributes, relationships, and style. For example, 'a cat wearing a top hat' becomes: subject=cat, accessory=top hat, action=wearing. The model also handles rephrasing and disambiguation. This embedding vector is then passed to the diffusion model.
Diffusion Model Generates Image
The diffusion model starts with a random noise tensor (1024x1024x3 for RGB). Over a series of steps (typically 50-1000), the model iteratively denoises the tensor, guided by the text embedding. At each step, the model predicts the noise to remove and subtracts it. This process gradually reveals a coherent image. The number of steps is a trade-off between quality and speed; more steps produce higher quality but take longer. The final output is a clean image that matches the prompt.
Content Filtering on Output
After generation, the output image is checked by a safety classifier that detects harmful content (e.g., violence, hate, sexual). If the image violates policy, it is blocked and an error is returned. This is a second layer of defense beyond prompt filtering. If the image passes, it is compressed and stored temporarily (typically for 1 hour) on OpenAI servers. The API returns a URL to the image or base64-encoded data.
Return Image URL to User
The API response includes a JSON object with an array of image data. Each entry contains a 'url' field pointing to the generated image (or 'b64_json' if requested). The user can then download the image or use the URL directly. The URL is valid for approximately 1 hour. For subsequent requests, the image must be regenerated. The API also returns metadata such as the model version and generation timestamp.
Enterprise Scenario 1: Marketing and Advertising
A large retail company uses DALL-E to generate product images for e-commerce listings. Instead of hiring photographers for every product, they describe the product and desired background (e.g., 'red sneakers on a white background, studio lighting'). DALL-E generates multiple variations, which are then reviewed by a human designer. The company integrates DALL-E via Azure OpenAI Service, processing thousands of prompts daily. They use the 'hd' quality option for high-resolution images (1792x1024) for hero banners. A common issue is that DALL-E sometimes adds extra objects or misinterprets colors, requiring prompt engineering and manual curation. The system is configured with content filtering to block inappropriate prompts, and all generated images are logged for compliance.
Enterprise Scenario 2: Game Development Concept Art
A game studio uses DALL-E to rapidly prototype character designs and environments. Designers input prompts like 'a futuristic city with neon lights, cyberpunk style, rain, reflections'. DALL-E generates concept art that inspires final designs. The studio uses the API to generate batches of 10 images per prompt, then selects the best ones. They find that DALL-E 3 handles complex scenes well but struggles with consistent character features across multiple images. To mitigate this, they use the same seed parameter (if available) to generate similar images. The studio also combines DALL-E with GPT-4 to refine prompts automatically based on feedback.
Enterprise Scenario 3: Educational Content Creation
An online learning platform uses DALL-E to generate illustrations for course materials. For example, a biology course might need an image of 'a cross-section of a plant cell with labeled organelles'. DALL-E can generate the image, but text rendering is poor, so labels are added manually using image editing software. The platform uses the standard quality option to keep costs low, as images are used at small sizes. They have encountered issues with DALL-E generating anatomically incorrect structures, so all images are verified by subject matter experts before publication. The platform also uses Azure's content moderation to ensure images are appropriate for all ages.
Common Misconfigurations
Overly Complex Prompts: Prompts with too many details can confuse DALL-E, leading to missing elements or visual clutter. Best practice is to keep prompts concise and focused.
Ignoring Content Filters: Some users try to bypass filters with euphemisms, but the model is trained to detect circumvention attempts. This results in rejected prompts or blocked images.
Assuming Perfect Accuracy: DALL-E is not suitable for applications requiring precise visual accuracy, such as medical or technical diagrams. It is a creative tool, not a simulation.
AI-900 Exam Focus on DALL-E
The AI-900 exam tests DALL-E under Objective 5.2: Describe capabilities of generative AI. You are expected to understand:
That DALL-E is a generative AI model that creates images from text.
That it is part of the Azure OpenAI Service.
Its capabilities: generating original images, varying styles, and handling complex prompts.
Its limitations: cannot recognize images, cannot generate accurate text, may produce biased or inappropriate content.
Responsible AI considerations: content filtering, bias, transparency.
Common Wrong Answers and Why Candidates Choose Them
"DALL-E can analyze images and identify objects." This is false. DALL-E generates images; it does not perform image recognition. Candidates confuse it with computer vision models like Azure Computer Vision or GPT-4 with vision.
"DALL-E can generate text with 100% accuracy." False. DALL-E often produces illegible or incorrect text. Candidates assume that because it is from OpenAI, it inherits GPT-4's text capabilities, but DALL-E's text rendering is poor.
"DALL-E is a discriminative model." False. DALL-E is generative. Candidates might mix up generative vs. discriminative AI. Remember: generative creates new content; discriminative classifies or predicts.
"DALL-E can be used for real-time video generation." False. DALL-E generates single images, not videos. Candidates may think of DALL-E as similar to video generation models like Sora (also by OpenAI), but Sora is a separate model.
Specific Numbers and Terms on the Exam
Model name: DALL-E (often spelled with hyphen).
Provider: OpenAI.
Service: Azure OpenAI Service.
Input: Text prompt (natural language).
Output: Image (URL or base64).
Resolution: 1024x1024 pixels (default).
Versions: DALL-E 2 and DALL-E 3 (DALL-E 3 is current).
Pricing: Per image (e.g., $0.040 for DALL-E 3 standard).
Content filtering: Both prompt and output are filtered.
Edge Cases and Exam Traps
DALL-E vs. GPT-4 with Vision: GPT-4 with Vision can analyze images, but DALL-E cannot. The exam may present a scenario where you need to choose between them.
DALL-E vs. Azure Computer Vision: Computer Vision extracts information from images; DALL-E creates images. They are complementary but different.
Responsible AI: The exam may ask about preventing harmful content. The correct answer is to use built-in content filters and human review.
Prompt Engineering: The exam might ask how to improve DALL-E output. Best practices include being specific, using descriptive language, and stating the style.
How to Eliminate Wrong Answers
If the answer says DALL-E can "classify" or "detect" something, it's wrong.
If the answer says DALL-E can generate "accurate text" or "realistic faces," it's likely wrong (though DALL-E 3 improved faces, they are still imperfect).
If the answer says DALL-E is used for "image recognition," it's wrong.
If the answer mentions "real-time" or "video," it's wrong.
Focus on the core function: text-to-image generation.
DALL-E is a generative AI model that creates images from text prompts.
It is available through Azure OpenAI Service and must be deployed in an Azure resource.
DALL-E 3 is the current version; DALL-E 2 is legacy.
Default output resolution is 1024x1024 pixels.
DALL-E cannot perform image recognition or analysis.
Content filtering is applied to both prompts and generated images.
Pricing is per image generated, with different costs for standard and HD quality.
DALL-E is suitable for creative tasks but not for applications requiring high accuracy.
These come up on the exam all the time. Here's how to tell them apart.
DALL-E 3
Better prompt adherence and understanding of complex descriptions.
Supports higher resolution options (1024x1024, 1792x1024, 1024x1792).
Can generate images with more detail and fewer artifacts.
Integrated with ChatGPT for iterative prompt refinement.
Costs $0.040 per image (standard) vs $0.016 for DALL-E 2.
DALL-E 2
Less accurate at following detailed prompts; may miss elements.
Only supports 1024x1024 resolution.
Images can be less sharp and may contain more errors.
Not integrated with ChatGPT; requires manual prompt engineering.
Lower cost at $0.016 per image.
Mistake
DALL-E can recognize objects in images.
Correct
DALL-E is a generative model that creates images; it does not perform image recognition. Use Azure Computer Vision or GPT-4 with Vision for that.
Mistake
DALL-E generates images that are always photorealistic.
Correct
DALL-E can generate various styles (oil painting, cartoon, etc.), not just photorealistic. The style is controlled by the prompt.
Mistake
DALL-E can generate accurate text inside images.
Correct
DALL-E often produces illegible or incorrect text. For accurate text, use a text rendering engine.
Mistake
DALL-E is free to use on Azure.
Correct
DALL-E is a paid service on Azure OpenAI Service. Each image generation incurs a cost based on model and quality.
Mistake
DALL-E can generate videos.
Correct
DALL-E generates still images only. For video generation, see OpenAI's Sora or other video models.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
DALL-E is used to generate original images from text descriptions. Common use cases include creating marketing visuals, concept art, product prototypes, and educational illustrations. It is not used for image analysis or recognition. For the AI-900 exam, remember that DALL-E is a generative AI model for text-to-image generation.
DALL-E generates images, while GPT-4 generates text. Both are generative AI models from OpenAI, but they serve different modalities. GPT-4 can also process images (with vision), but it analyzes them rather than generating new ones. On the exam, distinguish between text generation (GPT-4) and image generation (DALL-E).
DALL-E is not free. On Azure OpenAI Service, you pay per image generated. There is a free trial tier for new Azure accounts, but it includes limited credits. After that, you pay standard rates. On ChatGPT Plus, DALL-E 3 is included in the subscription, but usage limits apply.
DALL-E struggles with generating realistic human faces, accurate text, and complex scenes with many objects. It may produce biased or inappropriate content. It cannot recognize images or generate videos. For the exam, know that DALL-E is not suitable for tasks requiring high precision or safety.
Use detailed, specific prompts that describe the subject, style, lighting, composition, and colors. Avoid ambiguous language. Use the 'hd' quality for higher detail. For consistent results, consider using the same seed parameter (if available). Iterate by refining prompts based on output. The exam may ask about prompt engineering best practices.
Yes, DALL-E is available through Azure OpenAI Service. You need to apply for access and create a resource in the Azure portal. Then you deploy the DALL-E model and use the API to generate images. The exam may test that DALL-E is part of Azure OpenAI Service.
Content filtering is a safety mechanism that blocks prompts and images containing harmful content such as violence, hate, or sexual material. It is applied automatically before generation (prompt filtering) and after generation (output filtering). If a prompt is filtered, the API returns an error. The exam may ask about responsible AI practices.
You've just covered DALL-E for Image Generation — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?