AI-900Chapter 44 of 100Objective 2.3

RNNs and Transformer Architecture

Two foundational sequence-processing models in deep learning — Recurrent Neural Networks (RNNs) and the Transformer architecture — are the focus of this chapter. Understanding these architectures is critical for the AI-900 exam, as they underpin many Azure AI services for language, translation, and time-series analysis. Approximately 10-15% of exam questions touch on sequence model concepts, either directly or through related services like Azure Cognitive Services for text and speech. You will learn how RNNs handle sequential data, their limitations, and how Transformers overcome these limitations with self-attention, forming the basis for modern large language models.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

The Office Memo Chain vs. The Conference Call

A sequence of memos passes through a company's employees. In the RNN approach, each employee reads a memo, writes a brief summary, and passes both the summary and the next memo to the next employee. The summary is like the hidden state—it captures key information from all previous memos. However, as the chain grows, the summaries become vague and lose details from early memos, especially if the chain is long. This is the vanishing gradient problem. Now consider a Transformer approach: instead of a chain, all employees gather in a conference room. Each memo is written on a whiteboard, and each employee can directly read any memo at any time. They use a system of colored sticky notes (attention scores) to highlight which memos are most relevant to their own task. This allows every employee to access information from any memo, regardless of its position in the sequence, without losing detail. The conference room scales better because all employees work in parallel, while the chain forces sequential processing. This mirrors how Transformers use self-attention to process all tokens simultaneously, avoiding the sequential bottleneck and long-range dependency issues of RNNs.

How It Actually Works

What are RNNs and Why Do They Exist?

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data where the order of elements matters. Unlike feedforward networks, which assume all inputs are independent, RNNs maintain a hidden state that captures information about previous elements in the sequence. This makes them suitable for tasks like time-series forecasting, natural language processing (NLP), speech recognition, and machine translation.

The key innovation of RNNs is the recurrent connection: at each time step, the network takes the current input and the previous hidden state, and produces a new hidden state and an output. This allows information to persist across time steps. However, this recurrence also introduces challenges, particularly the vanishing and exploding gradient problems, which make it difficult for RNNs to learn long-range dependencies.

How RNNs Work Internally

At each time step t, an RNN computes: - h_t = activation(W_h * h_{t-1} + W_x * x_t + b) - y_t = activation(W_y * h_t + b_y)

Where: - x_t is the input at time t (e.g., a word embedding vector) - h_{t-1} is the hidden state from the previous time step - h_t is the new hidden state - W_h, W_x, W_y are weight matrices (shared across all time steps) - b, b_y are bias vectors - activation is typically tanh or ReLU

The hidden state h_t acts as a memory that encodes information about the sequence up to time t. The output y_t can be used for prediction at each step (e.g., next character in text) or only at the final step (e.g., sentiment classification).

Key Components: Vanilla RNN, LSTM, GRU

- Vanilla RNN: The simplest form, but suffers from vanishing gradients when sequences are long (typically >10-20 steps). Gradients during backpropagation through time (BPTT) become extremely small, preventing the network from learning long-range dependencies. - Long Short-Term Memory (LSTM): Introduces a cell state and three gates (input, forget, output) to control information flow. The cell state can carry information across many time steps with minimal decay. LSTMs are effective for sequences up to hundreds of steps. - Forget gate: decides what to discard from the cell state - Input gate: decides what new information to store - Output gate: decides what to output based on the cell state - Gated Recurrent Unit (GRU): A simplified version of LSTM with two gates (reset and update). It has fewer parameters and can be faster to train, but often performs similarly to LSTM.

The Transformer Architecture

Transformers, introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017), revolutionized sequence modeling by eliminating recurrence altogether. Instead, they rely entirely on self-attention mechanisms to capture dependencies between all positions in the sequence. This allows parallel processing and better handling of long-range dependencies.

Key Components of a Transformer:

Self-Attention (Scaled Dot-Product Attention): For each input token, the model computes a weighted sum of all other tokens' values, where the weights (attention scores) are determined by the similarity between the token's query and other tokens' keys. Mathematically:

- Attention(Q,K,V) = softmax(Q * K^T / sqrt(d_k)) * V - Q, K, V are matrices derived from the input embeddings via learned linear projections - d_k is the dimension of the keys (used for scaling to prevent large dot products) - Multi-Head Attention: Instead of a single attention function, the model uses multiple heads (typically 8-16) that learn different representation subspaces. The outputs are concatenated and linearly projected. - Positional Encoding: Since there is no recurrence, the model needs information about token positions. Positional encodings are added to the input embeddings, using sine and cosine functions of different frequencies. - Feed-Forward Networks: Each layer contains a fully connected feed-forward network applied independently to each position. - Layer Normalization and Residual Connections: Used to stabilize training and allow deeper networks.

Transformer Variants: - Encoder-Only (e.g., BERT): Used for tasks like text classification, named entity recognition, and question answering. - Decoder-Only (e.g., GPT): Used for autoregressive generation tasks like language modeling and text completion. - Encoder-Decoder (e.g., T5): Used for sequence-to-sequence tasks like translation and summarization.

Comparison: RNN vs. Transformer

Sequential vs. Parallel: RNNs process tokens one by one, making them slower for long sequences. Transformers process all tokens simultaneously, enabling efficient GPU utilization.

Long-Range Dependencies: RNNs (even LSTMs) struggle with very long sequences (e.g., >500 tokens) due to vanishing gradients. Transformers can capture dependencies across any distance through attention, though computational cost grows quadratically with sequence length.

Computational Complexity: RNNs have O(n) complexity per layer (n = sequence length). Transformers have O(n^2) due to full self-attention, but optimizations like sparse attention reduce this.

Memory: RNNs have a fixed-size hidden state, limiting capacity. Transformers have access to all token representations, providing more memory.

Training Considerations

Backpropagation Through Time (BPTT): RNNs are trained by unrolling the network through time and applying backpropagation. This is computationally expensive for long sequences.

Gradient Clipping: Used to prevent exploding gradients in RNNs by scaling gradients when their norm exceeds a threshold (e.g., 5.0).

Learning Rate Scheduling: Transformers typically use a warm-up phase (e.g., linear increase for first 4,000 steps) followed by decay.

Regularization: Dropout is commonly applied to attention weights and feed-forward layers. Label smoothing is also used.

Azure AI Services Using These Architectures

Azure Cognitive Services for Language: Uses Transformer-based models for sentiment analysis, key phrase extraction, language detection, and entity recognition.

Azure Translator: Employs Transformer encoder-decoder models for machine translation.

Azure Speech Services: Uses both RNNs (for acoustic modeling) and Transformers (for language modeling) in speech-to-text and text-to-speech.

Azure Machine Learning: Provides tools to train custom RNN and Transformer models using frameworks like TensorFlow and PyTorch.

Exam Relevance

For AI-900, you do not need to implement these architectures from scratch. Instead, focus on:

Understanding the difference between RNNs and Transformers in terms of sequence processing.

Knowing that Transformers enable parallel processing and handle long-range dependencies better.

Recognizing that Azure's pre-built AI services are built on Transformer architectures.

Identifying use cases where each architecture is appropriate (e.g., RNNs for real-time streaming data, Transformers for offline batch processing).

Walk-Through

Input Embedding and Encoding

Each token in the input sequence is mapped to a dense vector representation (embedding). In RNNs, embeddings are fed sequentially. In Transformers, embeddings are added with positional encodings to retain order information. The embedding dimension is typically 512 or 768. For example, in BERT-base, the embedding size is 768. The positional encoding uses sine and cosine functions of different frequencies, allowing the model to learn relative positions.

Recurrent Computation (RNN)

For each time step t, the RNN combines the current input embedding x_t and the previous hidden state h_{t-1} using weight matrices and a non-linear activation (usually tanh). The new hidden state h_t is computed. This step is repeated sequentially for all tokens. The hidden state dimension is a hyperparameter (e.g., 256). Gradients are computed via BPTT, which can lead to vanishing or exploding gradients over long sequences.

Self-Attention (Transformer)

The Transformer computes attention scores between all pairs of tokens. For each token, a query, key, and value vector are derived from the input. The dot product of the query with all keys gives attention weights after softmax. These weights are used to compute a weighted sum of values. This is done in parallel for all tokens. Multi-head attention runs multiple such attention mechanisms in parallel, allowing the model to focus on different aspects of the sequence.

Feed-Forward and Layer Normalization

After self-attention, each token's representation passes through a feed-forward network (FFN) consisting of two linear transformations with a ReLU activation in between. The FFN is applied identically to each position. Residual connections and layer normalization are applied around both the attention and FFN sublayers. This helps in training deep networks by mitigating vanishing gradients.

Output Generation

In an RNN, the output at each time step can be used for predictions (e.g., next word). In a Transformer decoder, the output of the final layer is passed through a linear layer and softmax to produce probabilities over the vocabulary. For encoder-only models like BERT, the output is used for classification or token-level tasks. The architecture determines whether outputs are generated autoregressively (one token at a time) or in parallel.

What This Looks Like on the Job

Enterprise Scenario 1: Real-Time Sentiment Analysis for Customer Support

A large e-commerce company wants to analyze customer chat messages in real-time to detect negative sentiment and escalate issues. They initially deployed an LSTM-based model because of its ability to process streaming data token by token. The model was trained on historical chat logs with sentiment labels. However, they encountered latency issues when dealing with long conversations (>200 tokens) because the LSTM processed each token sequentially. They later migrated to a Transformer-based model (DistilBERT) using Azure Cognitive Services for Language. The Transformer processed entire messages in parallel, reducing inference time from 500ms to 50ms per message. The key challenge was handling variable-length inputs; they padded all sequences to a maximum length of 512 tokens. The solution scaled to 10,000 requests per second using Azure Kubernetes Service with GPU nodes. Misconfiguration of batch sizes led to memory exhaustion, requiring careful tuning of max sequence length and batch size.

Enterprise Scenario 2: Machine Translation for Multilingual Support

A global software company uses Azure Translator to provide real-time translation in 60+ languages for their customer-facing documentation. The underlying model is a Transformer encoder-decoder architecture. During peak usage, the service receives millions of translation requests per day. The Transformer's ability to parallelize across tokens allows high throughput. However, the quadratic complexity of self-attention becomes a bottleneck for very long documents (>2000 tokens). They implemented document splitting with overlapping context to handle long texts. A common misconfiguration is not setting the correct region endpoint, causing higher latency. The company monitors translation quality using BLEU scores and periodically retrains the model with new data using Azure Machine Learning.

Scenario 3: Time-Series Forecasting with LSTMs

A utility company uses LSTMs to forecast electricity demand based on historical consumption data. The data is a univariate time series with hourly readings. The LSTM captures temporal dependencies up to 168 hours (one week). They used a sequence length of 168 (input) to predict the next 24 hours. The model was trained on 5 years of data. Key performance considerations: LSTMs require careful hyperparameter tuning (number of layers, hidden units, learning rate). Overfitting was mitigated with dropout (0.2). A common mistake is not normalizing the input data, leading to poor convergence. The model is deployed using Azure Machine Learning endpoints with autoscaling. They also monitor for concept drift using Azure Monitor and retrain monthly.

How AI-900 Actually Tests This

What AI-900 Tests on RNNs and Transformers

The AI-900 exam focuses on conceptual understanding rather than implementation details. Key objective codes: Domain 2 (Machine Learning), Objective 2.3 (Identify deep learning and conversational AI workloads). Specific topics include:

Recognizing that RNNs are used for sequential data (time series, text, audio).

Understanding that Transformers are more efficient for long sequences because they process data in parallel.

Knowing that Azure Cognitive Services for Language and Translator use Transformer-based models.

Identifying that LSTM and GRU are variants of RNNs that address the vanishing gradient problem.

Common Wrong Answers and Why Candidates Choose Them

"RNNs process all tokens simultaneously" – This is false; RNNs process sequentially. Candidates confuse RNNs with Transformers. The correct answer is that Transformers process in parallel.

"Transformers cannot handle variable-length sequences" – This is false; Transformers can handle variable-length sequences through padding and attention masking. Candidates may think that because they use fixed-size embeddings, they require fixed-length inputs.

"LSTM is a type of Transformer" – This is false; LSTM is an RNN variant. Candidates may mix up terminology.

"RNNs are better for long sequences than Transformers" – This is false; Transformers handle long-range dependencies better. Candidates may recall that RNNs are designed for sequences but forget their limitations.

Specific Numbers and Terms to Know

The Transformer paper: "Attention Is All You Need" (2017).

Key values: d_model = 512 (base Transformer), d_k = 64 (dimension per head), number of heads = 8.

BERT uses 12 or 24 layers (base/large).

GPT-3 has 96 layers and 175 billion parameters.

The vanishing gradient problem is especially severe beyond 10-20 time steps for vanilla RNNs.

Edge Cases and Exceptions

For very short sequences (e.g., 2-3 tokens), RNNs and Transformers may perform similarly, but Transformers still have higher computational overhead.

In real-time streaming applications where the full sequence is not available, RNNs are still preferred because they can process token-by-token as data arrives, whereas Transformers require the full sequence.

Transformers with full attention are quadratic in memory; for extremely long sequences (>10k tokens), sparse attention or hierarchical approaches are needed.

How to Eliminate Wrong Answers

If the question mentions "parallel processing" or "self-attention", the answer is likely Transformer.

If the question mentions "sequential processing" or "hidden state", the answer is likely RNN.

If the question asks about Azure service for translation or language understanding, it's likely Transformer-based.

If the question asks about time-series forecasting, both RNN and Transformer are possible, but RNN is more traditional for streaming data.

Key Takeaways

RNNs process sequences step-by-step; Transformers process all tokens simultaneously via self-attention.

Transformers use positional encodings to retain order information; RNNs inherently capture order through recurrence.

LSTM and GRU are RNN variants that address vanishing gradients using gating mechanisms.

The Transformer architecture consists of multi-head self-attention, feed-forward networks, layer normalization, and residual connections.

Azure Cognitive Services for Language and Translator are built on Transformer models.

For AI-900, focus on conceptual differences, use cases, and Azure service mappings.

The vanishing gradient problem limits vanilla RNNs to sequences of about 10-20 steps; LSTMs extend to hundreds of steps.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Recurrent Neural Network (RNN)

Processes tokens sequentially, one at a time.

Uses hidden state to capture history.

Suffers from vanishing gradients on long sequences.

Computational complexity O(n) per layer.

Well-suited for real-time streaming data.

Transformer

Processes all tokens in parallel using self-attention.

Captures dependencies via attention weights.

Handles long-range dependencies effectively.

Computational complexity O(n^2) per layer (full attention).

Requires entire sequence for inference; higher latency.

Watch Out for These

Mistake

RNNs and Transformers are completely different and unrelated.

Correct

Both are used for sequence modeling. Transformers were inspired by RNNs but replace recurrence with attention. Many modern models combine elements of both, such as Transformer-XL which uses recurrence across segments.

Mistake

Transformers do not need positional information because attention captures order.

Correct

Self-attention is permutation-invariant; without positional encoding, the model would treat the sequence as a bag of words. Positional encodings are essential to inject order information.

Mistake

LSTM solves all vanishing gradient problems.

Correct

LSTMs significantly mitigate vanishing gradients but do not eliminate them entirely. For very long sequences (e.g., >1000 steps), gradients can still vanish, and LSTMs may struggle. Transformers are generally preferred for very long sequences.

Mistake

Transformers are always better than RNNs for all sequence tasks.

Correct

For tasks with strict latency requirements or streaming input, RNNs (or their variants) can be more suitable because they process one token at a time without needing the full sequence. Transformers require the entire sequence for attention, causing higher latency.

Mistake

The hidden state in an RNN is the same as the cell state in an LSTM.

Correct

In an LSTM, the hidden state (h) and cell state (c) are separate. The cell state carries long-term memory, while the hidden state is the output for the current step. In a vanilla RNN, there is only one hidden state.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the main difference between an RNN and a Transformer?

The main difference is how they process sequence order. RNNs process tokens one by one, maintaining a hidden state that carries information forward. Transformers process all tokens in parallel using self-attention, which computes weighted sums of all token representations. This allows Transformers to capture long-range dependencies more effectively and enables parallel computation, making them faster on modern hardware. However, RNNs are still useful for real-time streaming applications where the full sequence is not available.

Does Azure AI use RNNs or Transformers for language services?

Azure Cognitive Services for Language (e.g., sentiment analysis, key phrase extraction, language detection) and Azure Translator use Transformer-based models. These models are pre-trained on large corpora and fine-tuned for specific tasks. Azure also supports custom training of RNNs and Transformers through Azure Machine Learning using frameworks like TensorFlow and PyTorch.

What is the vanishing gradient problem in RNNs?

During training, gradients are propagated backwards through time (BPTT). For long sequences, repeated multiplication of gradients (often less than 1) causes them to shrink exponentially, becoming vanishingly small. This prevents the network from learning long-range dependencies. LSTMs and GRUs mitigate this by introducing gating mechanisms that allow gradients to flow more easily. Transformers avoid this entirely by using attention, which provides direct connections between all positions.

Why are Transformers considered more efficient than RNNs for NLP tasks?

Transformers are more efficient in terms of training time because they process all tokens in parallel, leveraging GPU parallelism. RNNs require sequential computation, which is harder to parallelize. Additionally, Transformers can capture dependencies between any two tokens regardless of distance, whereas RNNs struggle with long-range dependencies. However, Transformers have higher memory complexity (O(n^2)) and may be slower for very long sequences without optimization.

What are the key components of a Transformer model?

The key components are: (1) Self-attention mechanism (multi-head attention), (2) Positional encoding, (3) Feed-forward neural networks, (4) Layer normalization, and (5) Residual connections. The encoder consists of a stack of layers each containing self-attention and feed-forward sublayers. The decoder additionally includes cross-attention to the encoder output and masked self-attention to prevent looking ahead.

Can I use RNNs for image data?

Yes, RNNs can be applied to image data by treating image patches or pixel rows as a sequence. For example, in image captioning, an RNN can generate a caption based on features extracted from a CNN. However, for most image tasks, CNNs or Transformers (Vision Transformers) are more common. RNNs are primarily designed for one-dimensional sequential data.

What is the difference between LSTM and GRU?

LSTM has three gates (input, forget, output) and a separate cell state. GRU has two gates (reset and update) and merges the cell and hidden states. GRU has fewer parameters, making it faster to train and less prone to overfitting on small datasets. In practice, both perform similarly on many tasks, but LSTM may be slightly better for very long sequences due to its explicit cell state.

Terms Worth Knowing

Artificial intelligence Computer vision Generative AI Machine learning Natural language processing Responsible AI

Ready to put this to the test?

You've just covered RNNs and Transformer Architecture — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Convolutional Neural Networks (CNN)

Transfer Learning and Pre-Trained Models

See the full AI-900 study guide