AI-900Chapter 63 of 100Objective 4.1

Tokenization and Text Normalization

This chapter covers tokenization and text normalization, two fundamental preprocessing steps in Natural Language Processing (NLP). These techniques transform raw text into a structured format that machine learning models can process. For the AI-900 exam, understanding these concepts is critical because they appear in roughly 10-15% of questions related to NLP workloads on Azure. You will learn the exact mechanisms, key parameters, and common pitfalls tested on the exam.

25 min read
Intermediate
Updated May 31, 2026

Tokenization: Breaking Text into Puzzle Pieces

Imagine you are a chef preparing a complex recipe from a handwritten letter. The letter is written in a continuous script without spaces or punctuation. To follow the recipe, you must first identify each ingredient and instruction — you need to chop the text into meaningful pieces. This process is like tokenization. The chef first scans the letter and marks boundaries at spaces, commas, or periods to create 'tokens' — individual words or punctuation marks. For example, the phrase '2cupsofflour' becomes ['2', 'cups', 'of', 'flour']. But sometimes words like 'don't' are tricky: the chef must decide whether to split it into ['do', 'n't'] or keep it as ['don't']. This decision is like the tokenization algorithm's rules for handling contractions. After tokenization, the chef normalizes the tokens: converting 'Flour' to 'flour' (lowercasing), removing punctuation like periods, and perhaps stemming 'cups' to 'cup' to reduce variations. This normalization ensures the recipe is easier to follow — just as text normalization reduces the vocabulary size and improves model performance. Without proper tokenization and normalization, the chef would misinterpret ingredients or miss steps, leading to a failed dish. Similarly, NLP models would struggle to learn patterns from raw, unprocessed text.

How It Actually Works

What is Tokenization and Why Does It Exist?

Tokenization is the process of splitting a string of text into smaller units called tokens. These tokens can be words, subwords, characters, or punctuation marks. The primary purpose is to convert unstructured text into a sequence of discrete elements that a machine learning model can process. Without tokenization, models would receive raw strings and would not know where one word ends and another begins — they would have to learn boundaries implicitly, which is inefficient and often inaccurate.

Tokenization exists because most NLP models operate on fixed-size input vectors. A model cannot directly ingest a variable-length sentence; it must first break the sentence into tokens, map each token to an integer ID via a vocabulary, and then feed those IDs into the model. The choice of tokenization directly impacts vocabulary size, out-of-vocabulary (OOV) handling, and model performance.

How Tokenization Works Internally

Tokenization algorithms follow a deterministic set of rules. The simplest is whitespace tokenization: split on spaces, tabs, and newlines. For example:

Input: "Hello world!"

Tokens: ["Hello", "world!"]

Notice that punctuation is attached to the word. A more advanced approach is punctuation-aware tokenization, which splits on punctuation as well:

Tokens: ["Hello", "world", "!"]

Subword tokenization, used in models like BERT and GPT, splits words into smaller units. For instance, the word "unhappiness" might become ["un", "happiness"] or ["un", "happi", "ness"]. The most common subword algorithm is Byte-Pair Encoding (BPE). BPE starts with a base vocabulary of individual characters and iteratively merges the most frequent pair of adjacent tokens. For example, if "th" appears often, it becomes a new token. This process continues until a desired vocabulary size is reached (e.g., 30,000 for BERT).

Key Components, Values, and Defaults

Vocabulary Size: The number of unique tokens the model recognizes. Typical sizes: 30,000 (BERT), 50,000 (GPT-2), 100,000 (some multilingual models). A larger vocabulary can represent more words but increases model size and training time.

Special Tokens: Most tokenizers add special tokens like [CLS] (classification), [SEP] (separator), [PAD] (padding), [UNK] (unknown), and [MASK] (masking). For example, BERT's tokenizer always starts a sentence with [CLS] and ends with [SEP].

Maximum Sequence Length: Models have a maximum number of tokens they can process (e.g., BERT: 512, GPT-3: 2048). Inputs longer than this must be truncated or split.

Text Normalization

Text normalization is a set of preprocessing steps applied before or after tokenization to reduce variability in text. Common techniques include:

Lowercasing: Convert all characters to lowercase. Example: "Hello" -> "hello". This reduces vocabulary size but can lose information (e.g., "Apple" the company vs "apple" the fruit).

Stemming: Reduce words to their root form by removing suffixes. Example: "running" -> "run", "ran" -> "ran" (note: stemming often produces non-words). The Porter Stemmer is a classic algorithm.

Lemmatization: Reduce words to their dictionary base form using vocabulary and morphological analysis. Example: "running" -> "run", "better" -> "good". Lemmatization is more accurate than stemming but slower.

Stop Word Removal: Remove common words like "the", "is", "at" that carry little meaning. However, modern NLP models often keep stop words because they can provide context.

Punctuation Removal: Remove or separate punctuation marks. This can help reduce noise but may affect meaning (e.g., "hello!" vs "hello?").

Configuration and Verification

In Azure AI Language, tokenization and normalization are built into the text analytics APIs. For example, the Azure Text Analytics service automatically tokenizes text when performing sentiment analysis or key phrase extraction. You do not configure tokenization directly; instead, you specify the language and the API handles the rest. However, when using Azure Machine Learning to train custom models, you may use libraries like Hugging Face's transformers or spaCy, where you explicitly load a tokenizer:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("Hello world!")
print(tokens)  # Output: ['hello', 'world', '!']

To verify tokenization, you can inspect the token IDs:

ids = tokenizer.encode("Hello world!")
print(ids)  # Output: [101, 7592, 2088, 999, 102]
# 101 = [CLS], 102 = [SEP], 7592 = 'hello', 2088 = 'world', 999 = '!'

Interaction with Related Technologies

Tokenization interacts with embedding layers, which convert token IDs into dense vectors. The embedding layer is a lookup table of size (vocab_size x embedding_dim). If the vocabulary size is 30,000 and embedding dimension is 768, the embedding matrix has 30,000 * 768 = 23 million parameters. Tokenization also affects attention mechanisms: the attention layer computes relationships between every pair of tokens, so sequence length directly impacts computational cost (O(n^2)). Therefore, efficient tokenization that produces shorter sequences (e.g., subword vs character) is crucial for performance.

Edge Cases and Exceptions

Out-of-Vocabulary (OOV): If a token is not in the vocabulary, it is replaced with [UNK]. Subword tokenization reduces OOV by breaking unknown words into known subwords. For example, "xylophone" might be unknown but broken into ["x", "ylo", "phone"] if those subwords exist.

Language-Specific Rules: Tokenization differs by language. For example, Japanese does not use spaces, so tokenizers must use morphological analysis. Azure AI Language supports tokenization for over 100 languages with language-specific models.

Case Sensitivity: Lowercasing is common for English, but for languages like German where nouns are capitalized, lowercasing can lose grammatical information. Some models use cased tokenizers (e.g., 'bert-base-cased').

Trap Patterns on the Exam

Confusing tokenization with embedding: Tokenization produces discrete IDs; embedding produces continuous vectors. The exam may ask which step converts text to numbers — the answer is tokenization (to IDs), not embedding (to vectors).

Stop word removal in modern models: Older NLP pipelines removed stop words, but modern transformer models keep them. The exam may test that BERT does not remove stop words.

Tokenization vs vectorization: Tokenization is the first step; vectorization (embedding) is the second. The exam may ask for the correct sequence: tokenization → mapping to IDs → embedding.

Summary of Key Numbers

BERT vocabulary size: 30,000 (uncased) or 28,996 (cased)

BERT max sequence length: 512 tokens

GPT-3 vocabulary size: 50,257

GPT-3 max sequence length: 2048 tokens

Typical special tokens: [CLS], [SEP], [PAD], [UNK], [MASK]

Walk-Through

1

Raw Text Input

The process begins with raw text, which may come from documents, web pages, or user queries. The text is a string of characters with spaces, punctuation, and possibly HTML tags or special characters. At this stage, the text is unstructured and cannot be understood by a machine learning model. The system must first clean the text, for example by removing HTML tags or decoding Unicode characters. This step ensures the tokenizer receives a clean string.

2

Text Normalization

Before tokenization, the text may be normalized. This includes lowercasing all characters (if using an uncased model), removing or separating punctuation, and possibly applying stemming or lemmatization. For example, the sentence 'I am running!' might become 'i am running' after lowercasing and punctuation removal. Normalization reduces the number of unique tokens and helps the model generalize better. However, some models (like BERT) keep punctuation as separate tokens, so normalization may be minimal.

3

Tokenization Algorithm Applied

The tokenizer applies a specific algorithm to split the normalized text into tokens. For word-level tokenization, it splits on whitespace and punctuation. For subword tokenization (e.g., BPE), it uses a learned merge table. The algorithm outputs a list of tokens. For example, using BERT's WordPiece tokenizer, the word 'unhappiness' might become ['un', '##happiness'] where '##' indicates a subword continuation. The tokenizer also adds special tokens like [CLS] at the start and [SEP] at the end.

4

Token-to-ID Mapping

Each token is looked up in a vocabulary file to find its integer ID. The vocabulary is a mapping from token strings to unique integers (e.g., 'hello' -> 7592). If a token is not found, it is replaced with the [UNK] token ID. For BERT, the [UNK] token ID is 100. The output is a sequence of integers representing the input text. This sequence is what the model actually processes.

5

Padding and Truncation

The sequence of token IDs must be the same length for all inputs in a batch. If the sequence is shorter than the maximum length (e.g., 512), it is padded with [PAD] tokens (ID 0 for BERT) to reach the maximum length. If the sequence is longer, it is truncated from the end (or from the middle for some models). The attention mask is also created to indicate which tokens are real (1) and which are padding (0). This step ensures the model can process inputs in parallel.

What This Looks Like on the Job

Enterprise Scenario 1: Customer Support Chatbot

A large e-commerce company deploys an Azure AI-powered chatbot to handle customer inquiries. The chatbot uses a pre-trained BERT model fine-tuned on support tickets. The raw customer messages contain typos, emojis, and mixed case. Tokenization and normalization are critical: the tokenizer must handle misspellings (e.g., 'ordr' might become ['ord', '##r']), and lowercasing ensures 'Order' and 'order' are the same token. The company uses Azure AI Language's built-in tokenization, which automatically normalizes text. However, they discovered that emojis like 😊 are tokenized as [UNK] if not in the vocabulary, so they added a preprocessing step to remove emojis. The chatbot handles thousands of messages per second; tokenization must be fast. Azure's tokenizers are optimized for throughput, and the company uses batch processing to maximize efficiency.

Enterprise Scenario 2: Legal Document Analysis

A law firm uses Azure Machine Learning to classify legal documents. The documents are lengthy (50,000+ words) and contain specialized legal jargon. Tokenization must handle long sequences: the model's max length is 512 tokens, so the firm uses a sliding window approach — they split documents into overlapping chunks of 512 tokens, tokenize each chunk, and aggregate results. They also use a custom vocabulary that includes legal terms like 'habeas corpus' as single tokens to improve accuracy. Text normalization is minimal because case and punctuation carry legal meaning (e.g., 'U.S.' vs 'us'). The firm uses a cased tokenizer to preserve this information. Misconfiguration of tokenization (e.g., using an uncased model) led to a 15% drop in classification accuracy, so they carefully validated their preprocessing pipeline.

Scenario 3: Multilingual News Aggregator

A media monitoring service aggregates news articles in 50 languages. They use Azure's multilingual BERT model, which supports 104 languages. Tokenization is language-specific: for Chinese, the tokenizer uses character-level tokens because words are not space-separated. For Arabic, it handles right-to-left text. The service must normalize text differently per language — e.g., lowercasing is not appropriate for German nouns. They use Azure AI Language's language detection to route text to the correct tokenizer. A common issue is tokenizing mixed-language content (e.g., 'I love Berlin' in English with a German city name). The tokenizer may split 'Berlin' into ['Ber', '##lin'], which still works but may lose the entity. They mitigate this by adding entity-specific tokens to the vocabulary. Performance considerations: tokenization for 50 languages requires more memory for multiple vocabularies, but Azure's cloud infrastructure scales horizontally.

How AI-900 Actually Tests This

AI-900 Exam Focus on Tokenization and Text Normalization

The AI-900 exam (Objective 4.1: Identify features of NLP workloads on Azure) tests your understanding of preprocessing steps. Specifically, you need to know:

What tokenization is and why it is necessary

Common normalization techniques: lowercasing, stemming, lemmatization, stop word removal

How Azure AI Language handles tokenization (automatically, no configuration needed)

The difference between tokenization and vectorization (embedding)

Most Common Wrong Answers and Why Candidates Choose Them

1.

Tokenization and vectorization are the same. Wrong. Tokenization produces token IDs; vectorization (embedding) maps those IDs to dense vectors. Candidates confuse them because both convert text to numbers. The exam may ask: 'Which step converts words to integers?' Answer: tokenization.

2.

Stop words are always removed. Wrong. Modern NLP models like BERT do not remove stop words; they keep all tokens. Candidates remember older NLP pipelines that removed stop words. The exam tests that BERT uses all tokens.

3.

Stemming and lemmatization are identical. Wrong. Stemming chops off suffixes without considering morphology (e.g., 'running' -> 'runn'), while lemmatization uses a dictionary to return the base form (e.g., 'running' -> 'run'). The exam may ask which produces real words: lemmatization.

4.

Tokenization splits text into words only. Wrong. Subword tokenization splits into subwords, and character tokenization splits into characters. The exam may test that BERT uses WordPiece (subword) tokenization.

Specific Numbers and Terms That Appear Verbatim

BERT vocabulary size: 30,000 (uncased)

BERT max sequence length: 512 tokens

Special tokens: [CLS], [SEP], [PAD], [UNK], [MASK]

WordPiece tokenization (for BERT)

Byte-Pair Encoding (for GPT)

Edge Cases and Exceptions

Unknown words: If a word is not in the vocabulary, it becomes [UNK]. Subword tokenization reduces this.

Language differences: Tokenization for Chinese is character-based; for English, it is subword-based.

Case sensitivity: Uncased models lower case; cased models preserve case. The exam may ask which model to use if case matters (e.g., named entity recognition).

How to Eliminate Wrong Answers Using the Underlying Mechanism

If a question asks about preprocessing steps, think about the purpose: to convert raw text into a format the model can process. Tokenization is always the first step. Normalization (like lowercasing) may or may not be applied. If the question mentions 'reducing vocabulary size', the answer is likely lowercasing or stemming. If it mentions 'handling out-of-vocabulary words', the answer is subword tokenization. Use the mechanism to eliminate answers that describe embedding or model training.

Key Takeaways

Tokenization is the first step in NLP preprocessing; it splits text into tokens (words, subwords, or characters).

Subword tokenization (e.g., WordPiece for BERT) reduces out-of-vocabulary issues by breaking rare words into known subwords.

BERT uses a vocabulary of 30,000 tokens and a maximum sequence length of 512 tokens.

Special tokens [CLS], [SEP], [PAD], [UNK], and [MASK] are added during tokenization for BERT-like models.

Text normalization techniques include lowercasing, stemming, lemmatization, and punctuation removal.

Modern transformer models do not remove stop words; they use all tokens for context.

Tokenization produces integer IDs; embedding converts those IDs to dense vectors.

Azure AI Language automatically handles tokenization and normalization; no manual configuration is needed.

For multilingual models, tokenization is language-specific (e.g., character-level for Chinese).

Padding and truncation ensure all sequences in a batch have the same length.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Word-Level Tokenization

Splits text into whole words based on spaces and punctuation.

Large vocabulary needed to cover all words (e.g., 100,000+).

Handles out-of-vocabulary words poorly: they become [UNK].

Simple to implement and fast.

Used in older models like Word2Vec and GloVe.

Subword Tokenization (BPE/WordPiece)

Splits words into smaller units based on frequency.

Smaller vocabulary (e.g., 30,000) covers most text via subwords.

Handles out-of-vocabulary words by breaking into known subwords.

More complex but more efficient for rare words.

Used in modern models like BERT, GPT, and T5.

Stemming

Uses heuristic rules to remove suffixes.

Produces non-words (e.g., 'running' -> 'runn').

Faster and simpler.

May merge words with different meanings (e.g., 'meeting' -> 'meet').

Common in information retrieval and search engines.

Lemmatization

Uses vocabulary and morphological analysis.

Produces real dictionary words (e.g., 'running' -> 'run').

Slower and more computationally expensive.

More accurate; preserves meaning better.

Used in tasks requiring precise word forms (e.g., question answering).

Watch Out for These

Mistake

Tokenization and embedding are the same process.

Correct

Tokenization splits text into tokens and maps them to integer IDs. Embedding converts these integer IDs into dense vectors. They are two distinct steps; tokenization precedes embedding.

Mistake

All NLP models remove stop words.

Correct

Modern transformer models like BERT and GPT do not remove stop words. They use all tokens because stop words can provide contextual information. Stop word removal is common in older bag-of-words models.

Mistake

Stemming and lemmatization produce the same output.

Correct

Stemming uses heuristic rules to chop off suffixes, often producing non-words (e.g., 'running' -> 'runn'). Lemmatization uses a vocabulary to return the dictionary form (e.g., 'running' -> 'run'). Lemmatization is more accurate but slower.

Mistake

Tokenization always splits on spaces.

Correct

Many tokenizers, especially for languages like Chinese or Japanese, do not use spaces. Subword tokenizers like BPE and WordPiece split based on frequency, not spaces. Also, punctuation is often split separately.

Mistake

The vocabulary size is the number of words in the training data.

Correct

The vocabulary size is a hyperparameter set before training (e.g., 30,000 for BERT). It is the number of unique tokens the model can recognize. Words not in the vocabulary become [UNK] or are split into subwords.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is tokenization in NLP?

Tokenization is the process of breaking raw text into smaller units called tokens, which can be words, subwords, or characters. It is the first step in preprocessing text for machine learning models. For example, the sentence 'I love AI!' might be tokenized into ['I', 'love', 'AI', '!']. Tokenization converts unstructured text into a structured sequence that can be mapped to integer IDs for model input.

What is the difference between tokenization and embedding?

Tokenization produces a sequence of integer IDs representing tokens (e.g., word IDs). Embedding then maps each integer ID to a dense vector of real numbers. Tokenization is a discrete mapping; embedding is a continuous representation. In a typical pipeline, tokenization happens first, then the token IDs are passed to an embedding layer.

Does BERT remove stop words?

No, BERT does not remove stop words. BERT uses all tokens, including stop words, because they provide contextual information. Removing stop words would lose important signals for attention mechanisms. This is a key difference from older bag-of-words models.

What is the vocabulary size of BERT?

BERT's uncased model has a vocabulary size of 30,000 tokens. The cased model has 28,996 tokens. The vocabulary includes subword units, special tokens ([CLS], [SEP], etc.), and common words. This vocabulary is learned during pre-training using WordPiece tokenization.

How does Azure AI Language handle tokenization?

Azure AI Language services (e.g., Text Analytics, Language Understanding) automatically tokenize text as part of their processing. You do not need to configure tokenization; it is built into the APIs. The service uses language-specific tokenizers for over 100 languages. For custom models in Azure Machine Learning, you must load a tokenizer explicitly (e.g., from Hugging Face).

What is the difference between stemming and lemmatization?

Stemming uses heuristic rules to remove suffixes, often producing non-words (e.g., 'running' -> 'runn'). Lemmatization uses a dictionary and morphological analysis to return the base form (e.g., 'running' -> 'run'). Lemmatization is more accurate but slower. For the exam, know that lemmatization produces real words.

What happens when a word is not in the tokenizer's vocabulary?

If a word is not in the vocabulary, it is either replaced with the [UNK] token or split into subwords that are in the vocabulary. Subword tokenization (like BPE or WordPiece) reduces the occurrence of [UNK] by breaking unknown words into known subwords. For example, 'xylophone' might become ['x', 'ylo', 'phone'].

Terms Worth Knowing

Ready to put this to the test?

You've just covered Tokenization and Text Normalization — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?