GCDLChapter 60 of 101Objective 3.3

Google Gemini Models and Multimodal AI

Objective 3.3 of the GCDL exam (Data Analytics AI domain) covers Google's Gemini models and their multimodal AI capabilities, a key topic. Understanding Gemini is critical because it represents Google's latest foundation model family, designed to natively process and reason across text, images, audio, video, and code. Approximately 10-15% of exam questions in the Data Analytics AI domain touch on multimodal AI and foundation models, with specific focus on Gemini's architecture, capabilities, and use cases.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Gemini as a Multilingual Swiss Army Knife

A Swiss Army knife that processes multiple languages, not just text, is a remarkably versatile tool. The knife has different blades: one for text (like a standard language model), one for images (a vision encoder), one for audio (a speech encoder), and one for video (a video encoder). When you give it a task that involves a picture and a text question, the knife doesn't just use the text blade; it simultaneously uses the image blade to extract visual features and the text blade to understand the query. These blades feed into a central 'processor' (the multimodal fusion layer) that combines the information. For example, if you show it a photo of a dog and ask 'What breed is this?', the image blade identifies shapes and colors, while the text blade parses the question. The fusion layer correlates the visual features with the concept 'breed' from the text, then activates the appropriate output blade (text) to generate 'Golden Retriever'. The knife can also handle audio: you speak a question, the audio blade converts speech to text, then the fusion layer processes it alongside any visual input. Crucially, the knife doesn't need separate tools for each combination; it dynamically selects which blades to use based on the input modalities. This is exactly how Gemini works: it natively processes text, images, audio, video, and code through specialized encoders, then fuses them in a shared representation space to generate responses.

How It Actually Works

What is Gemini?

Gemini is Google's most capable and general-purpose AI model family, announced in December 2023 and developed by Google DeepMind. Unlike previous models that were primarily text-based or required separate models for different modalities, Gemini was designed from the ground up as a multimodal model. This means it can understand and reason across text, images, audio, video, and code simultaneously, without needing separate components glued together. The GCDL exam expects you to know that Gemini is a foundation model — a large-scale neural network trained on diverse data that can be adapted for a wide range of tasks.

Gemini Model Variants

Gemini comes in three sizes to serve different deployment scenarios: - Gemini Ultra: The largest and most capable model, designed for highly complex tasks. It achieves state-of-the-art results on 30 of 32 academic benchmarks used in LLM research, including MMLU (Massive Multitask Language Understanding) where it scored 90.0%, surpassing human experts. Ultra is intended for data centers and large-scale enterprise applications. - Gemini Pro: The best model for scaling across a wide range of tasks. It powers Google's Bard (now Gemini) chatbot and is available via the Gemini API. Pro balances performance and cost, making it suitable for most production use cases. - Gemini Nano: The most efficient model, designed for on-device deployment. It runs directly on smartphones, starting with the Pixel 8 Pro. Nano is optimized for tasks like summarization, smart reply, and proofreading without requiring a cloud connection.

Multimodal Architecture

Gemini's architecture is based on a Transformer decoder with modifications for multimodal processing. The key innovation is that it processes different modalities natively rather than converting everything to text. Here's how it works:

1. Input Encoders: Gemini uses separate encoders for each modality: - Text: Standard tokenizer and embedding layer. - Image: A Vision Transformer (ViT) or similar architecture that breaks images into patches and encodes them as sequences. - Audio: A convolutional encoder (e.g., USM — Universal Speech Model) that processes raw audio waveforms into spectrogram-like features. - Video: Treated as a sequence of frames, each encoded similarly to images, plus temporal embeddings to capture motion. - Code: Tokenized similarly to text but with specialized tokens for syntax and structure.

Multimodal Fusion: The encoded representations from all modalities are concatenated or interleaved into a single sequence of tokens. For example, an input with an image and text becomes: [image_token_1, image_token_2, ..., text_token_1, text_token_2, ...]. The model then applies self-attention across this combined sequence, allowing it to learn cross-modal relationships. This is fundamentally different from earlier approaches that used separate models for vision and language and fused them only at the output.

Output Generation: The model generates output tokens autoregressively, using the fused representation as context. The output can be text, but Gemini can also generate other modalities (e.g., code, structured data) depending on the task.

Key Capabilities for the Exam

The GCDL exam tests your understanding of what Gemini can do. The most important capabilities include:

Multimodal understanding: Gemini can answer questions about images, videos, and audio. For example, given a video of a basketball game, it can describe the play, identify players, and even understand the game's rules.

Code generation and understanding: Gemini can generate code in multiple languages (Python, Java, C++, etc.), explain code, debug it, and convert between languages. It achieved a 74.4% pass rate on the Python coding benchmark HumanEval.

Reasoning across modalities: Gemini can combine information from different modalities. For instance, given a chart (image) and a text question, it can extract data from the chart and perform calculations.

Long context window: Gemini 1.5 Pro introduced a context window of up to 1 million tokens, allowing it to process entire books, long videos, or massive codebases in a single prompt.

Tool use and function calling: Gemini can interact with external APIs and tools, enabling it to perform actions like searching the web, querying databases, or controlling smart devices.

Training and Data

Gemini was trained on a diverse dataset including text, images, audio, video, and code. Google used its TPU v5p accelerators for training, which are custom-designed for large-scale machine learning. The training process involved multiple stages: 1. Pre-training: The model is trained on a large corpus of multimodal data to predict the next token in a sequence. This teaches the model language, visual concepts, and cross-modal relationships. 2. Instruction tuning: The model is fine-tuned on a dataset of instructions and responses to improve its ability to follow user prompts. 3. Reinforcement learning from human feedback (RLHF): The model is further refined using human preferences to align its outputs with helpfulness and safety.

Integration with Google Cloud

Gemini is deeply integrated into Google Cloud services, which is a key point for the GCDL exam: - Vertex AI: Gemini models are available through Vertex AI, Google's unified ML platform. Developers can access Gemini Pro and Gemini Ultra via the Vertex AI API, with features like model tuning, grounding, and safety filters. - Gemini API: A dedicated API for accessing Gemini models, with SDKs for Python, JavaScript, and other languages. The API supports streaming, function calling, and multimodal inputs. - Google Workspace: Gemini is integrated into Workspace products like Gmail, Docs, Sheets, and Meet, providing features like 'Help me write', summarization, and smart compose. - Google Cloud Console: Gemini provides an AI-powered assistant within the Cloud Console to help with troubleshooting, generating CLI commands, and explaining resources.

Use Cases and Examples

The exam expects you to know real-world applications: - Customer support: A company uploads product manuals (PDFs with images) and uses Gemini to answer customer queries across text and images. For example, 'How do I replace the filter?' — Gemini can find the relevant diagram and explain the steps. - Content moderation: Gemini analyzes video streams for inappropriate content, combining visual cues (objects, actions) with audio (speech, sounds) to make moderation decisions. - Medical imaging: A radiologist uploads an X-ray and asks Gemini to describe findings. Gemini can identify anomalies and generate a preliminary report. - Code review: A developer pastes a code snippet and asks Gemini to find bugs. Gemini can analyze the code, understand the logic, and suggest fixes.

Performance Benchmarks

Memorize these key numbers for the exam: - MMLU: Gemini Ultra scored 90.0%, surpassing human experts (89.8%) and GPT-4 (86.4%). - HumanEval: Gemini Ultra scored 74.4% on Python code generation. - Natural2Code: Gemini Ultra scored 74.9% on code generation without relying on web solutions. - Math: Gemini Ultra scored 94.4% on GSM8K (grade school math) and 53.2% on MATH (competition-level math). - Video understanding: Gemini Ultra achieved 59.4% on the VATEX video captioning benchmark, a 10% improvement over prior state-of-the-art.

Safety and Responsibility

Google has implemented several safety measures for Gemini: - Safety filters: Built-in classifiers that detect and block harmful content (hate speech, violence, sexual content) in both input and output. - Grounding: The ability to ground responses in Google Search or user-provided data to reduce hallucinations. - Red-teaming: Extensive testing by internal and external teams to identify vulnerabilities. - Model cards: Detailed documentation of model capabilities, limitations, and evaluation results.

Comparison with Other Models

While the exam focuses on Gemini, it's useful to understand how it compares: - GPT-4: Both are multimodal, but Gemini was designed as native multimodal, whereas GPT-4 initially used separate vision and language models that were later combined. Gemini Ultra outperforms GPT-4 on many benchmarks. - Claude 3: Anthropic's model family also has multimodal capabilities, but Claude 3 Opus is more focused on safety and longer context. Gemini 1.5 Pro's 1M token context is unique. - LLaMA: Open-source models that are primarily text-only. Gemini offers stronger multimodal capabilities out of the box.

Exam Tips

Know the three model sizes (Ultra, Pro, Nano) and their primary use cases.

Understand that Gemini is natively multimodal, not a collection of separate models.

Remember the key benchmarks: MMLU 90.0%, HumanEval 74.4%.

Be aware of integration points: Vertex AI, Gemini API, Workspace, Cloud Console.

Know that Gemini can process video natively (not just frames extracted separately).

Walk-Through

1. Input Acquisition

The user provides input in one or more modalities: text, image, audio, video, or code. For text, the model receives a string of characters. For images, it receives raw pixel data (e.g., JPEG, PNG). For audio, it receives a waveform (e.g., WAV, MP3). For video, it receives a sequence of frames with timestamps. The input is preprocessed into a format the model can understand: text is tokenized, images are resized and normalized, audio is converted to spectrograms, and video is sampled into frames at a specific rate (e.g., 1 frame per second). This preprocessing happens on the client side or server side depending on the API.

2. Modality-Specific Encoding

Each modality is processed by a dedicated encoder. Text is encoded by a transformer-based language model that converts tokens into embedding vectors. Images are encoded by a Vision Transformer (ViT) that divides the image into patches (e.g., 16x16 pixels) and projects them into a sequence of embeddings. Audio is encoded by a convolutional neural network (e.g., USM) that extracts features like Mel-frequency cepstral coefficients (MFCCs). Video is encoded by processing each frame through the image encoder and adding positional embeddings to capture temporal order. The output of each encoder is a sequence of vectors (tokens) that represent the input in a high-dimensional space.

3. Multimodal Fusion via Attention

The encoded tokens from all modalities are concatenated into a single sequence. For example, if the input has 10 text tokens and 20 image tokens, the combined sequence has 30 tokens. The model applies a transformer decoder with self-attention across this entire sequence. During self-attention, each token can attend to every other token, allowing the model to learn relationships between modalities. For instance, the text token 'dog' can attend to image patches that contain fur or a tail. This fusion layer is the core innovation — it creates a joint representation where information from different modalities is mixed at every layer, rather than only at the output.

4. Autoregressive Output Generation

After the fused representation is computed, the model generates output tokens one at a time in an autoregressive manner. At each step, it uses the fused context and the previously generated tokens to predict the next token. The output can be text, but the model can also generate structured data like JSON or code. For multimodal outputs (e.g., generating an image), the process would require a decoder specific to that modality, but Gemini primarily outputs text. The generation continues until an end-of-sequence token is emitted or a maximum token limit is reached. The model uses techniques like top-k sampling or beam search to produce coherent responses.

5. Post-processing and Safety Filtering

The generated output is passed through safety classifiers that check for harmful content (e.g., hate speech, violence, sexually explicit material). If the output violates safety policies, it is blocked or replaced with a default response. Additionally, the output may be formatted for the specific application (e.g., markdown for chat, JSON for API responses). The model also supports grounding — the ability to verify facts against Google Search or user-provided documents to reduce hallucinations. Finally, the output is returned to the user via the API or user interface.

What This Looks Like on the Job

Enterprise Scenario 1: Multimodal Customer Support at a Retail Company

A large e-commerce company deploys Gemini Pro via Vertex AI to power its customer support chatbot. The chatbot handles queries that include text, images, and even short video clips. For example, a customer sends a photo of a damaged product and asks, 'Can I get a replacement?' The Gemini model processes the image to identify the product (e.g., a blender) and the damage (cracked base), and simultaneously understands the text query. It then generates a response that includes a return label and instructions. In production, the company uses Vertex AI's model endpoint with autoscaling to handle peak loads of 10,000 requests per minute. They also implement grounding with their product catalog to ensure the model only references available items. A common misconfiguration is not setting appropriate safety filters, leading to the model generating inappropriate responses when customers upload images with offensive content. Properly configured, Gemini reduces average handle time by 40% and increases customer satisfaction scores by 25%.

Enterprise Scenario 2: Code Analysis and Generation at a Fintech Startup

A fintech startup uses Gemini Ultra through the Gemini API to assist developers with code reviews and documentation. Developers paste code snippets (Python, Java, or Go) into an internal tool, and Gemini analyzes the code for bugs, security vulnerabilities, and adherence to best practices. It also generates unit tests and documentation. The startup uses Gemini's 1 million token context window to analyze entire codebases — for example, a pull request that modifies 500 files. They integrate Gemini via a custom Slack bot that sends code to the API and returns comments. Performance considerations include API latency (typically 2-5 seconds for complex analyses) and cost (Gemini Ultra is priced per token). A common issue is that developers rely too heavily on Gemini's suggestions without reviewing them, leading to security flaws being introduced. The startup mitigates this by requiring human review of all AI-generated code.

Enterprise Scenario 3: Video Content Moderation at a Social Media Platform

A social media platform uses Gemini Nano on-device to moderate live video streams for policy violations. The model runs directly on users' smartphones, analyzing video frames and audio in real-time. If it detects nudity, violence, or hate speech, it flags the stream for human review or automatically stops the broadcast. This architecture reduces cloud costs and latency, as no data leaves the device. The platform uses Gemini Nano because it is optimized for on-device inference with minimal power consumption. They fine-tune the model on their specific policy definitions using Vertex AI's model tuning service. A challenge is balancing accuracy with false positives — overly sensitive filtering frustrates users. The platform continuously updates the model based on feedback. When misconfigured (e.g., too low a threshold), legitimate content like news reports about violence can be incorrectly flagged.

How GCDL Actually Tests This

What GCDL Tests on Gemini and Multimodal AI (Objective 3.3)

The exam focuses on high-level understanding rather than deep technical implementation. Specifically, you need to know:

The three Gemini model sizes (Ultra, Pro, Nano) and their primary use cases.

That Gemini is natively multimodal — it processes text, images, audio, video, and code together.

Key benchmark numbers: MMLU 90.0%, HumanEval 74.4%.

Integration points: Vertex AI, Gemini API, Google Workspace, Cloud Console.

That Gemini can handle long context (1 million tokens for Gemini 1.5 Pro).

Common Wrong Answers and Why Candidates Choose Them

'Gemini is a text-only model like previous LLMs.' — This is wrong because Gemini was built as multimodal from the start. Candidates confuse it with earlier models like PaLM 2, which were text-only. The exam may list 'text-only' as a distractor.

'Gemini Pro is the most capable model.' — Wrong. Gemini Ultra is the largest and most capable. Pro is the best for scaling across tasks, but Ultra achieves the highest benchmarks. Candidates may think 'Pro' implies 'professional' or 'best'.

'Gemini can only process one modality at a time.' — Wrong. Gemini can process multiple modalities simultaneously in a single prompt. For example, you can provide an image and a text question together.

'Gemini Nano is only for cloud deployment.' — Wrong. Nano is designed for on-device deployment, not cloud. Candidates may assume all models are cloud-based.

Specific Numbers and Terms That Appear on the Exam

MMLU score of 90.0% for Gemini Ultra.

HumanEval score of 74.4% for Gemini Ultra.

1 million token context window for Gemini 1.5 Pro.

Three sizes: Ultra, Pro, Nano.

Integration with Vertex AI and the Gemini API.

Edge Cases and Exceptions the Exam Loves

Gemini is not just for text generation — it can also generate code, analyze images, and process video.

Gemini Ultra is not yet available to everyone — it was initially limited to select customers and is more expensive.

Gemini Nano runs on-device — it does not require an internet connection after initial model download.

Gemini 1.5 Pro introduced the 1M token context — earlier versions had shorter contexts (e.g., 32k tokens).

How to Eliminate Wrong Answers

If an answer says Gemini is 'text-only', eliminate it immediately.

If an answer says Gemini Pro is the 'most capable', eliminate it — that's Ultra.

If an answer says Gemini cannot process video or audio, eliminate it — it can.

If an answer says all Gemini models are cloud-based, eliminate it — Nano is on-device.

Focus on the word 'native' — Gemini is a natively multimodal model, not a collection of separate models.

Key Takeaways

Gemini is a natively multimodal foundation model family from Google DeepMind, processing text, images, audio, video, and code.

Three model sizes: Ultra (most capable), Pro (best for scaling), Nano (on-device efficiency).

Gemini Ultra achieved 90.0% on MMLU, surpassing human experts and GPT-4.

Gemini 1.5 Pro supports a context window of up to 1 million tokens.

Integration points include Vertex AI, Gemini API, Google Workspace, and Cloud Console.

Gemini Nano runs on-device, enabling offline AI features on smartphones.

The model uses separate encoders for each modality and fuses them via self-attention in a shared representation space.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Gemini Pro

Best for scaling across a wide range of tasks.

Available via Gemini API and Vertex AI.

Lower cost per token than Ultra.

Suitable for most production use cases.

Benchmarks: MMLU 83.7%, HumanEval 67.7%.

Gemini Ultra

Most capable model for highly complex tasks.

Limited availability, higher cost.

State-of-the-art benchmarks: MMLU 90.0%, HumanEval 74.4%.

Requires more computational resources.

Best for research and high-stakes applications.

Watch Out for These

Mistake

Gemini is just a renamed version of Bard.

Correct

Bard was Google's earlier chatbot, initially powered by LaMDA and later PaLM 2. Gemini is a new, more powerful foundation model family. Bard was rebranded as Gemini, but the underlying model is now Gemini Pro.

Mistake

Gemini can only process text and images, not audio or video.

Correct

Gemini is natively multimodal and can process text, images, audio, video, and code simultaneously. For example, it can analyze a video with audio track and answer questions about both visual and auditory content.

Mistake

Gemini Ultra is the most widely available model.

Correct

Gemini Pro is the most widely available model, accessible via the Gemini API and Vertex AI. Gemini Ultra is currently limited to select customers and is more expensive.

Mistake

Gemini Nano requires a constant internet connection.

Correct

Gemini Nano is designed for on-device deployment and can run offline after the model is downloaded. It is optimized for low latency and privacy.

Mistake

All Gemini models have the same capabilities.

Correct

The three sizes (Ultra, Pro, Nano) have different capabilities, performance, and cost. Ultra is the most capable, Pro is balanced, and Nano is optimized for on-device efficiency.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is Gemini and how does it differ from previous Google AI models?

Gemini is Google's most advanced foundation model family, designed as natively multimodal from the ground up. Unlike previous models like PaLM 2, which were text-only, Gemini can process and reason across text, images, audio, video, and code simultaneously. It comes in three sizes: Ultra, Pro, and Nano. For the exam, remember that Gemini is not just a text model — it's multimodal.

What are the three sizes of Gemini and their use cases?

Gemini Ultra is the largest and most capable, designed for highly complex tasks and achieving state-of-the-art benchmarks. Gemini Pro is the best for scaling across a wide range of tasks, powering the Gemini chatbot and API. Gemini Nano is the most efficient, designed for on-device deployment on smartphones. For the exam, know that Ultra is the most capable, Pro is the most widely available, and Nano is for on-device.

How does Gemini process multiple modalities simultaneously?

Gemini uses separate encoders for each modality (text, image, audio, video, code) to convert inputs into token embeddings. These tokens are then concatenated into a single sequence and processed by a transformer decoder with self-attention, allowing cross-modal relationships to be learned. This is different from earlier approaches that used separate models for each modality and fused them only at the output.

What are the key benchmark scores for Gemini Ultra?

Gemini Ultra achieved 90.0% on MMLU (massive multitask language understanding), surpassing human experts (89.8%) and GPT-4 (86.4%). On HumanEval (Python code generation), it scored 74.4%. On GSM8K (grade school math), it scored 94.4%. These numbers are exam-relevant.

How is Gemini integrated with Google Cloud services?

Gemini is available through Vertex AI for building and deploying custom AI applications, through the Gemini API for direct access, and is integrated into Google Workspace (Gmail, Docs, Sheets, Meet) for productivity features. It also powers the Cloud Console assistant. For the exam, remember these integration points.

What is the context window of Gemini 1.5 Pro?

Gemini 1.5 Pro supports a context window of up to 1 million tokens. This allows it to process entire books, long videos, or massive codebases in a single prompt. This is a key differentiator from other models that have smaller context windows (e.g., 128k tokens for GPT-4 Turbo).

Can Gemini run on-device without an internet connection?

Yes, Gemini Nano is designed for on-device deployment and can run offline after the model is downloaded. It is optimized for low latency and privacy, making it suitable for features like smart reply and summarization on smartphones like the Pixel 8 Pro.

Terms Worth Knowing

BigQuery Cloud computing Cloud IAM Cloud storage Machine learning Region

Ready to put this to the test?

You've just covered Google Gemini Models and Multimodal AI — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.

Try GCDL practice questions Back to all chapters

Done with this chapter?

Vertex AI Studio for Generative AI

PaLM API and Google AI APIs

See the full GCDL study guide