AI-900Chapter 43 of 100Objective 2.3

Convolutional Neural Networks (CNN)

This chapter covers Convolutional Neural Networks (CNNs), a specialized deep learning architecture designed for processing grid-like data such as images. CNNs are a core topic in the AI-900 exam under Objective 2.3 (Computer Vision workloads), and questions about their components, operation, and applications typically account for 5-10% of the exam. You will learn the internal mechanics of CNNs—convolution, pooling, and fully connected layers—and understand why they are exceptionally effective for image classification, object detection, and segmentation. This knowledge is essential for passing AI-900 and for building a foundation in Azure's Computer Vision services.

25 min read
Intermediate
Updated May 31, 2026

CNN as a Visual Inspection Assembly Line

Imagine a factory assembly line where products move past a series of inspection stations, each staffed by a specialist with a specific stencil. The first station uses a small stencil to detect simple features like edges: a vertical edge, a horizontal edge, or a diagonal corner. Each worker slides their stencil across the product, checking for a match at every position. When a match is found, they mark that location on a map. The next station receives the map and uses a larger stencil to detect combinations of edges, like a corner or a curve. Each subsequent station builds on the previous maps, detecting more complex patterns like a wheel shape, a window shape, or a door shape. The final station, using the most abstract map, decides if the product is a car, a truck, or a bicycle. Importantly, all workers at a given station use the same stencil, so they are all looking for the same feature—this is weight sharing. The product never sees the whole assembly line at once; each station only sees a small region of the map from the previous station, which is the receptive field. The assembly line is trained by showing it thousands of products with known labels, and the stencils are adjusted over time to minimize errors. Once trained, the line can inspect new products and classify them with high accuracy.

How It Actually Works

What is a Convolutional Neural Network (CNN) and Why Does It Exist?

A Convolutional Neural Network (CNN) is a class of deep neural networks specifically designed for processing data with a grid-like topology, such as images (2D grid of pixels) or time series (1D grid of samples). Traditional fully connected (dense) neural networks are impractical for image data because they treat each pixel as an independent feature, ignoring the spatial structure. For a 224x224 RGB image, a dense network would have over 150,000 input neurons, and the first hidden layer with 1,000 neurons would require 150 million weights—a computational and overfitting nightmare. CNNs solve this by exploiting three key ideas: local connectivity, parameter sharing, and pooling.

How a CNN Works Internally: The Convolution Operation

The core building block of a CNN is the convolution layer. The convolution operation applies a set of learnable filters (also called kernels) to the input. Each filter is a small matrix of weights, typically 3x3 or 5x5 in size. The filter slides (convolves) across the input image, performing an element-wise multiplication at each position and summing the result to produce a single output value. This process generates a feature map (activation map) that highlights where a particular feature (e.g., an edge, a corner, a texture) appears in the input.

For example, a 3x3 filter with weights [-1,0,1] in the first row, [-2,0,2] in the second, and [-1,0,1] in the third is a vertical Sobel edge detector. When this filter is applied to an image region where there is a vertical edge (sharp transition from dark to light), the output value is large positive; if the transition is reversed, it's large negative; if the region is uniform, the output is near zero. By learning these weights during training, the network discovers filters that are most useful for the task.

Convolution Layer Parameters

Several hyperparameters control the behavior of a convolution layer: - Filter size: Typically 3x3 or 5x5. Smaller filters capture finer details; larger filters capture broader patterns. - Stride: The step size with which the filter moves. Stride 1 means the filter moves one pixel at a time; stride 2 reduces the output spatial dimensions by half. - Padding: Adding zeros around the input border. 'Valid' padding means no padding, so output size shrinks. 'Same' padding ensures output size equals input size (when stride=1). - Number of filters: The depth of the output volume (number of feature maps). Common values: 32, 64, 128, 256.

For an input of size H x W x D (height, width, depth/channels), a filter of size F x F x D, stride S, and padding P, the output spatial dimensions are:

Output height = (H - F + 2P) / S + 1

Output width = (W - F + 2P) / S + 1

Output depth = K (number of filters)

Example: Input 32x32x3, filter 5x5x3, stride 1, padding 0 → output 28x28xK.

Activation Function: ReLU

After each convolution, an activation function is applied element-wise. The most common is the Rectified Linear Unit (ReLU): f(x) = max(0, x). ReLU introduces non-linearity, allowing the network to learn complex patterns. It is computationally efficient and helps mitigate the vanishing gradient problem. Other activations like sigmoid or tanh are rarely used in CNNs.

Pooling Layers: Dimensionality Reduction

Pooling layers reduce the spatial dimensions of the feature maps, decreasing the number of parameters and computation, and providing translation invariance. The most common type is max pooling, which slides a window (e.g., 2x2) with a stride (e.g., 2) and outputs the maximum value in each window. This reduces the feature map size by half (e.g., 32x32 → 16x16). Average pooling takes the average instead. Pooling operates independently on each feature map (depth channel).

Fully Connected Layers and Classification

After several convolution and pooling layers, the high-level reasoning in the network is performed by fully connected (dense) layers. The final feature maps are flattened into a 1D vector and fed into one or more dense layers. The last layer typically uses a softmax activation for multi-class classification, outputting a probability distribution over classes.

Training a CNN: Backpropagation and Gradient Descent

CNNs are trained using backpropagation and gradient descent, similar to other neural networks. The loss function (e.g., categorical cross-entropy) measures the difference between predicted and true labels. Gradients of the loss with respect to each weight are computed via the chain rule, and weights are updated to minimize the loss. The convolution filters are updated so that they learn to detect features that are discriminative for the task.

Data Augmentation: Preventing Overfitting

CNNs have many parameters and can easily overfit. Data augmentation artificially increases the training set size by applying random transformations: rotations, shifts, flips, zooms, brightness adjustments, etc. This improves generalization.

Transfer Learning: Using Pretrained CNNs

Training a CNN from scratch requires large datasets (e.g., ImageNet with 1.2 million images) and significant compute. Transfer learning takes a pretrained CNN (e.g., ResNet, VGG, Inception) and fine-tunes it on a smaller, task-specific dataset. The early layers (edge detectors, texture detectors) are often frozen, and only the later layers are retrained. This is extremely effective for many real-world applications.

Common CNN Architectures

LeNet-5 (1998): 7 layers, for handwritten digit recognition.

AlexNet (2012): 8 layers, used ReLU, dropout, and data augmentation.

VGGNet (2014): 16-19 layers, all 3x3 filters.

GoogLeNet/Inception (2014): 22 layers with inception modules.

ResNet (2015): Introduced skip connections, up to 152 layers.

MobileNet: Lightweight architecture for mobile/edge devices.

CNNs in Azure AI Services

Azure's Computer Vision service uses CNNs for image classification, object detection, OCR, and more. Custom Vision allows users to train custom CNNs on their own images. The underlying models are based on state-of-the-art architectures like ResNet and EfficientNet. Azure also provides pre-trained models via the Cognitive Services APIs, which are CNNs trained on massive datasets.

Practical Considerations

Input size: Typically resize images to a fixed size (e.g., 224x224, 227x227, 299x299).

Normalization: Subtract mean and divide by standard deviation (per channel).

Batch size: Number of images processed together. Larger batches allow faster training but require more memory.

Learning rate: Controls step size in gradient descent. Common values: 0.01, 0.001, 0.0001.

Regularization: Dropout (randomly drop neurons during training) and L2 regularization help prevent overfitting.

Summary of Key Defaults and Values

Filter size: 3x3 (most common in modern networks)

Stride: 1 (convolution), 2 (pooling)

Padding: 'same' or 'valid'

Pooling: 2x2 max pooling, stride 2

Activation: ReLU

Output activation: Softmax (classification)

Loss: Categorical cross-entropy

Optimizer: Adam, SGD with momentum

Initialization: He initialization (for ReLU), Xavier (for sigmoid/tanh)

Walk-Through

1

Input Image Preprocessing

The input image is resized to a fixed dimension (e.g., 224x224 pixels) and normalized by subtracting the mean pixel value (e.g., [123.68, 116.78, 103.94] for ImageNet) and dividing by the standard deviation. For RGB images, the input shape becomes (224, 224, 3). This ensures consistent scale and zero-centered data, which helps gradient descent converge faster. Data augmentation (random flips, rotations, crops) may be applied during training to increase dataset diversity.

2

First Convolution + ReLU

A set of 64 filters of size 3x3x3 is applied to the input with stride 1 and same padding. Each filter slides across the image, performing element-wise multiplication and summation, producing a 2D activation map. The 64 filters produce an output volume of size (224, 224, 64). ReLU activation is applied element-wise, setting all negative values to zero. This introduces non-linearity and sparsity. The network learns filters that detect low-level features like edges and color blobs.

3

First Max Pooling

A 2x2 max pooling layer with stride 2 downsamples the feature maps. For each 2x2 region, the maximum value is taken, reducing the spatial dimensions from 224x224 to 112x112. The depth remains 64. Pooling provides translation invariance: small shifts in the input produce the same output. It also reduces the number of parameters and computation for subsequent layers. The output shape is (112, 112, 64).

4

Second Convolution + Pooling

A second convolution layer applies 128 filters of size 3x3x64 (depth matches the input depth) with same padding and ReLU. The output shape is (112, 112, 128). Then a 2x2 max pooling with stride 2 reduces it to (56, 56, 128). This layer learns more complex patterns like corners, textures, and simple shapes. The number of filters doubles as spatial size halves, maintaining computational balance.

5

Third Convolution + Pooling

A third convolution layer applies 256 filters of size 3x3x128, outputting (56, 56, 256). After ReLU, 2x2 max pooling reduces to (28, 28, 256). At this stage, the network detects high-level features such as parts of objects (e.g., wheels, eyes, windows). The receptive field of each neuron is larger, allowing it to see more of the original image.

6

Flatten and Fully Connected Layers

The final feature maps (e.g., 7x7x512 from a deeper network) are flattened into a 1D vector of size 7*7*512 = 25088. This vector is fed into one or more fully connected (dense) layers. A common configuration is two dense layers: first with 4096 neurons and ReLU activation, second with 1000 neurons (for 1000-class ImageNet) and softmax activation. The dense layers perform high-level reasoning, combining features from the entire image to make a classification decision.

7

Output Classification

The final fully connected layer uses softmax activation to produce a probability distribution over classes. For each class, the output is a value between 0 and 1, and all outputs sum to 1. The class with the highest probability is the predicted label. During training, the loss (e.g., categorical cross-entropy) is computed between the predicted distribution and the true one-hot encoded label. Backpropagation updates all weights in the network to minimize this loss.

What This Looks Like on the Job

Enterprise Scenario 1: Automated Defect Detection in Manufacturing

A car manufacturer deploys a CNN-based vision system on the assembly line to detect surface defects (scratches, dents, paint bubbles) on car bodies. The system uses a high-resolution camera capturing 1920x1080 images. A pretrained ResNet-50, fine-tuned on 50,000 labeled defect images, runs on Azure GPU VMs (NC series). The CNN processes each image in under 100ms, classifying regions as 'defect' or 'no defect' using a sliding window approach. The model is deployed via Azure Kubernetes Service for scalability, handling up to 200 images per second. Common issues: false positives due to lighting variations (solved by data augmentation with brightness and contrast changes) and class imbalance (defects are rare, requiring weighted loss functions). Misconfiguration: using too large a stride can miss small defects; too small a stride causes redundant computations.

Enterprise Scenario 2: Medical Image Analysis for Radiology

A hospital network uses a CNN to analyze chest X-rays for signs of pneumonia, lung nodules, and tuberculosis. The model is based on DenseNet-121 pretrained on ImageNet, then fine-tuned on a dataset of 100,000 X-rays. Input images are resized to 224x224 and normalized using dataset-specific mean and std. The CNN outputs probabilities for multiple pathologies (multi-label classification). The system is deployed on Azure Healthcare APIs with HIPAA compliance. Performance considerations: inference must be fast (<2 seconds per image) for clinical workflow; batch processing is used during off-peak hours. Misconfiguration: using a model with too many parameters can cause overfitting on small medical datasets; transfer learning mitigates this. Edge case: X-rays with implants or foreign objects can confuse the model if not represented in training data.

Enterprise Scenario 3: Retail Inventory Management

A retail chain uses a CNN to recognize products on shelves from store camera feeds. The system uses a lightweight MobileNetV2 model deployed on Azure IoT Edge devices (NVIDIA Jetson) for real-time inference. The model is trained on 500,000 product images across 10,000 SKUs. The CNN outputs a product ID and bounding box (object detection via YOLO architecture). Each store processes 30 frames per second, sending only alerts (out-of-stock, misplaced items) to the cloud. Misconfiguration: using a model trained on studio-lit images fails in store lighting conditions; data augmentation with varying light levels is essential. Common failure: similar-looking products (e.g., different flavors of the same brand) cause confusion; adding a triplet loss for fine-grained recognition helps.

How AI-900 Actually Tests This

What AI-900 Tests on CNNs

The AI-900 exam (Objective 2.3: Computer Vision workloads) expects you to understand the purpose and basic components of CNNs, not to design or implement them. Specifically:

Recognize that CNNs are used for image classification, object detection, and segmentation.

Know the role of convolution, pooling, and fully connected layers.

Understand that CNNs are trained using labeled images (supervised learning).

Identify that Azure Custom Vision and Computer Vision APIs are built on CNNs.

Know that transfer learning allows using pretrained models for custom tasks.

Common Wrong Answers and Why Candidates Choose Them

1.

"CNNs are used for time series forecasting" – Candidates confuse CNNs with RNNs. CNNs can be applied to time series, but the exam context is computer vision. The correct answer for image tasks is CNN.

2.

"Pooling layers increase the size of feature maps" – This is backward. Pooling reduces spatial dimensions. Candidates may think pooling 'summarizes' information but mistakenly believe it enlarges.

3.

"Fully connected layers are the first layers in a CNN" – The order is convolution/pooling first, then fully connected. Candidates might think all layers are fully connected because of earlier neural network topics.

4.

"CNNs require all images to be the same size as the training images" – While resizing is common, CNNs can handle variable sizes by using global average pooling or fully convolutional architectures. The exam may test that input size flexibility exists.

Specific Numbers and Terms That Appear on the Exam

Filter size: typically 3x3 or 5x5

Stride: 1 or 2

Pooling: 2x2 max pooling with stride 2

ReLU activation function

Softmax for multi-class classification

Transfer learning: using a pretrained model (e.g., ResNet) and fine-tuning

Azure services: Custom Vision, Computer Vision API, Form Recognizer (uses CNNs for OCR)

Edge Cases the Exam Loves

What if you have a small dataset? Answer: Use transfer learning or data augmentation.

What if you need real-time inference on a mobile device? Answer: Use a lightweight architecture like MobileNet or Azure's on-device models.

What if images are not labeled? Answer: Use unsupervised or semi-supervised learning, or use a pretrained model and label a small subset.

How to Eliminate Wrong Answers Using the Underlying Mechanism

If an answer says 'CNNs process images as 1D vectors', eliminate it because CNNs preserve spatial structure.

If an answer says 'convolution reduces the number of parameters compared to dense layers', that is correct.

If an answer says 'pooling introduces non-linearity', that is wrong; activation functions (ReLU) introduce non-linearity.

If an answer says 'the number of filters equals the number of classes', eliminate it; the number of filters is independent of classes.

Key Takeaways

CNNs are specialized for grid-like data, especially images, using convolution, pooling, and fully connected layers.

Convolution layers apply learnable filters (kernels) to detect features like edges and textures.

Pooling layers reduce spatial dimensions, providing translation invariance and reducing computation.

Common filter size: 3x3; common pooling: 2x2 max pooling with stride 2.

ReLU is the standard activation function in CNN hidden layers; softmax is used for multi-class classification.

Transfer learning uses pretrained CNNs (e.g., ResNet) to achieve high accuracy with small datasets.

Azure Custom Vision and Computer Vision APIs are built on CNNs.

CNNs require labeled data for supervised training.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Convolutional Neural Network (CNN)

Preserves spatial structure: uses 2D/3D input shape

Parameter sharing: same filter applied across entire input

Local connectivity: each neuron connects to only a local region

Translation invariance via pooling

Typically fewer parameters due to weight sharing

Fully Connected (Dense) Neural Network

Flattens input to 1D, losing spatial relationships

Each neuron has a unique weight for every input, no sharing

Global connectivity: every neuron connects to all inputs

No inherent invariance to translations

Very large number of parameters for image-sized inputs

Watch Out for These

Mistake

CNNs are only for images.

Correct

CNNs can process any grid-like data, including 1D time series, 3D medical scans, and even text (character-level). However, the AI-900 exam focuses on image data.

Mistake

Pooling layers are required in every CNN.

Correct

Modern architectures like ResNet sometimes use strided convolutions instead of pooling to downsample. Pooling is common but not mandatory.

Mistake

All filters in a convolution layer are the same size.

Correct

Within a layer, all filters have the same spatial size (e.g., 3x3), but they differ in weight values. Different layers can have different filter sizes.

Mistake

CNNs cannot handle color images because they are 2D.

Correct

Color images are 3D (height, width, channels). Filters have depth equal to the input channels, so they process all channels simultaneously.

Mistake

Training a CNN from scratch always performs better than transfer learning.

Correct

Transfer learning often yields better results with limited data and less compute. Training from scratch requires massive datasets and careful tuning.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a CNN and a regular neural network?

A CNN uses convolution layers that exploit spatial structure by applying small filters that slide across the input. This results in local connectivity and parameter sharing, drastically reducing the number of parameters compared to a fully connected network. CNNs are specifically designed for grid-like data such as images, whereas regular neural networks treat each input feature independently and are not spatially aware.

What is the purpose of pooling in a CNN?

Pooling reduces the spatial dimensions (height and width) of the feature maps, which decreases the number of parameters and computation in the network. It also provides a form of translation invariance: small shifts in the input do not significantly change the pooled output. The most common type is max pooling, which takes the maximum value in each 2x2 window with stride 2.

What is transfer learning in the context of CNNs?

Transfer learning is the practice of taking a CNN that has been pretrained on a large dataset (like ImageNet) and fine-tuning it on a smaller, task-specific dataset. The early layers (which detect generic features like edges) are often frozen, and only the later layers are retrained. This allows achieving high accuracy with limited data and reduced training time.

What is the role of the fully connected layers in a CNN?

After several convolution and pooling layers, the high-level reasoning is performed by one or more fully connected (dense) layers. They take the flattened feature maps and learn to combine the detected features to produce the final classification. The last fully connected layer typically uses softmax activation to output a probability distribution over classes.

How does Azure Custom Vision use CNNs?

Azure Custom Vision is a service that allows users to train custom image classifiers and object detectors without needing deep learning expertise. It uses pretrained CNNs (like ResNet) as a starting point and fine-tunes them on the user's labeled images. The training process is automated, and the resulting model can be deployed as a REST API endpoint.

What is the difference between image classification and object detection in CNNs?

Image classification assigns a single label to the entire image (e.g., 'cat'). Object detection identifies multiple objects in an image and draws bounding boxes around them (e.g., 'cat' at coordinates (x1,y1,x2,y2)). Object detection architectures like YOLO or Faster R-CNN extend CNNs by adding region proposal and bounding box regression heads.

Why is ReLU preferred over sigmoid in CNNs?

ReLU (max(0,x)) is computationally efficient and helps mitigate the vanishing gradient problem, where gradients become very small in deep networks using sigmoid. ReLU also introduces sparsity by zeroing out negative values, which can improve learning. Sigmoid is rarely used in hidden layers of CNNs due to saturation and vanishing gradients.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Convolutional Neural Networks (CNN) — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?