AI-900Chapter 7 of 100Objective 2.3

Deep Learning and Neural Networks

This chapter dives deep into deep learning and neural networks, the engine behind modern AI breakthroughs like image recognition and natural language processing. For the AI-900 exam, approximately 15-20% of questions touch on machine learning concepts, with deep learning forming a key subset. You will learn the architecture, training process, and key terminologies—essential for answering scenario-based questions on Azure's AI services.

25 min read
Intermediate
Updated May 31, 2026

Neural Networks as a Factory Assembly Line

Imagine a factory assembly line that produces a complex product, say a smartphone. The raw materials (input data) arrive at the first station. Each station has a specialized worker (neuron) who performs a specific task on the material—like installing a screen, adding a battery, or testing sound. The worker doesn't see the whole product; they only see the part that arrives at their station. They apply their operation (weighted sum + activation) and pass the partially assembled product to the next station. The factory has many parallel lines (layers) and each line might have multiple workers (neurons in a layer). The output of one station becomes the input to the next. At the end, the final product (prediction) is inspected. If it's defective (error), the factory manager (backpropagation) adjusts each worker's technique slightly by sending feedback from the end back to the beginning. Over time, workers learn the optimal way to handle their part. Deep learning means having many stations (layers) so the product can be refined step by step. Just as a factory can learn to make any product by adjusting workers' actions, a neural network can learn any function by adjusting its weights and biases.

How It Actually Works

What is Deep Learning and Why Does It Exist?

Deep learning is a subset of machine learning that uses neural networks with multiple layers (hence "deep") to model complex patterns in data. Traditional machine learning algorithms like linear regression or decision trees often require manual feature engineering—where a human must decide which input variables are important. Deep learning automates feature extraction: the network learns hierarchical representations directly from raw data. For example, in image recognition, early layers detect edges, middle layers detect shapes like eyes or wheels, and final layers combine these into objects like faces or cars.

How Neural Networks Work Internally

A neural network consists of layers of interconnected nodes (neurons). Each neuron receives inputs, multiplies them by weights, adds a bias, and passes the sum through an activation function. The output becomes input for the next layer.

Step-by-step forward pass: 1. Input layer: Each neuron represents one feature of the input data (e.g., pixel value). 2. Hidden layers: Each neuron computes: z = Σ(w_i * x_i) + b where w are weights, x are inputs, b is bias. Then applies activation function f(z) (e.g., ReLU: f(z) = max(0, z)). 3. Output layer: Produces the final prediction (e.g., probability for each class using softmax).

Training via backpropagation: 1. Forward pass computes output and loss (error) using a loss function (e.g., cross-entropy for classification). 2. Backward pass calculates gradient of loss with respect to each weight using chain rule. 3. Weights are updated using an optimizer (e.g., stochastic gradient descent) with learning rate (typical default: 0.01) to minimize loss.

Key Components, Values, and Defaults

Neuron/Unit: Basic computational unit.

Layer: Collection of neurons. Common types: Dense (fully connected), Convolutional (for images), Recurrent (for sequences).

Activation Functions: ReLU (default for hidden layers), Sigmoid (output for binary classification), Softmax (output for multi-class).

Loss Functions: Mean Squared Error (regression), Cross-Entropy (classification).

Optimizer: Adam (adaptive learning rate, default learning rate 0.001), SGD (momentum often 0.9).

Epochs: One full pass through the training data (default varies, often 10-100).

Batch Size: Number of samples processed before weight update (typical powers of 2: 32, 64, 128).

Learning Rate: Controls step size (common default 0.01 for SGD, 0.001 for Adam). Too high = overshoot, too low = slow convergence.

Configuration and Verification in Azure

In Azure Machine Learning, you can create a neural network using the Designer (drag-and-drop) or SDK. For example, to create a two-layer network using Python SDK:

from azureml.train.dnn import TensorFlow

est = TensorFlow(
    source_directory='./',
    entry_script='train.py',
    framework_version='2.4',
    compute_target='gpu-cluster',
    script_params='--epochs 50 --batch-size 32'
)

To verify training, use Azure Monitor to track metrics like loss and accuracy.

Interaction with Related Technologies

Deep learning models often use GPUs (e.g., NVIDIA Tesla V100) for parallel computation. In Azure, you can provision GPU-based compute targets. Transfer learning (using pre-trained models like ResNet) reduces training time. Azure Cognitive Services offer pre-built deep learning models, while Azure Machine Learning enables custom model training.

Walk-Through

1

Forward Propagation

Input data flows from the input layer through each hidden layer to the output layer. At each neuron, the weighted sum of inputs plus bias is computed, then an activation function is applied. For example, in a dense layer, each neuron connects to all neurons in the previous layer. The output of layer L becomes the input of layer L+1. No training occurs during forward propagation; it simply computes the network's prediction given current weights.

2

Loss Calculation

The network's output (prediction) is compared to the true label using a loss function. For a classification task with 10 classes, softmax outputs probabilities, and cross-entropy loss is computed: L = -Σ y_i * log(p_i), where y_i is 1 for the correct class and 0 otherwise. The loss quantifies how far the prediction is from the truth. A perfect model would have loss 0.

3

Backpropagation

The gradient of the loss with respect to each weight is calculated using the chain rule of calculus. Starting from the output layer, the error is propagated backward through the network. For each weight, the partial derivative ∂L/∂w indicates how much a small change in that weight would affect the loss. This step is computationally intensive and is why GPUs excel—they parallelize matrix operations.

4

Weight Update

Weights are adjusted in the opposite direction of the gradient to minimize loss. Using stochastic gradient descent: w_new = w_old - learning_rate * gradient. The learning rate controls step size. With Adam optimizer, adaptive learning rates per parameter are computed using moving averages of gradients and squared gradients. After update, the next batch of data is fed forward, and the cycle repeats.

5

Iteration and Convergence

Steps 1-4 are repeated for many batches (mini-batch gradient descent) across multiple epochs. One epoch means the entire training dataset has been processed once. Training stops when loss converges (plateaus) or after a fixed number of epochs. Overfitting is monitored by evaluating on a validation set; if validation loss increases, early stopping can be applied.

What This Looks Like on the Job

Enterprise Scenario 1: Image Classification for Medical Diagnostics

A hospital wants to automatically classify X-ray images into categories (e.g., normal, pneumonia). They use a deep convolutional neural network (CNN) like ResNet-50 pre-trained on ImageNet. The problem: manual feature extraction is infeasible due to high variability in images. Solution: Transfer learning—they freeze early layers and retrain final layers on their X-ray dataset (10,000 images). They use Azure Machine Learning with GPU compute (NC6s_v3) for training. Configuration: batch size 32, learning rate 0.0001, 50 epochs. Common pitfall: class imbalance (90% normal, 10% pneumonia). They use weighted loss or oversampling. Misconfiguration: using too high learning rate causes divergence; validation accuracy drops. They monitor with Azure ML metrics and set early stopping on validation loss.

Enterprise Scenario 2: Customer Sentiment Analysis

A retail company wants to analyze customer reviews for sentiment (positive/negative). They use a recurrent neural network (RNN) with LSTM layers to handle variable-length text. Problem: traditional bag-of-words ignores word order. Solution: Embedding layer (300-dimensional GloVe vectors) followed by LSTM. They use Azure ML with CPU compute (Standard_D4s_v3) for training. Configuration: sequence length 100, batch size 64, learning rate 0.001, 10 epochs. They deploy as a real-time endpoint using Azure Kubernetes Service. Common issue: overfitting due to small dataset (5,000 reviews). They add dropout (0.5) and early stopping. Misconfiguration: forgetting to tokenize and pad sequences to fixed length causes shape mismatch.

Performance Considerations

Training deep networks requires significant compute. For large models (e.g., GPT-3), distributed training across multiple GPUs or nodes is used. In Azure, you can use Horovod or PyTorch DistributedDataParallel. Inference latency matters: for real-time applications, you might use ONNX Runtime to optimize. When misconfigured (e.g., too many layers without residual connections), gradients vanish (saturate to zero) and network stops learning. Batch normalization helps mitigate this.

How AI-900 Actually Tests This

Exactly What AI-900 Tests

Objective 2.3: Identify deep learning and neural network concepts. The exam expects you to:

Define neural network, neuron, layer, activation function, weights, bias.

Explain how deep learning differs from traditional ML (automated feature extraction vs. manual).

Identify scenarios suitable for deep learning (e.g., image, text, audio) vs. simpler algorithms.

Recognize that deep learning requires large datasets and computational resources.

Know that Azure offers pre-built deep learning models via Cognitive Services and custom models via Azure Machine Learning.

Common Wrong Answers and Why Candidates Choose Them

1.

"Deep learning always outperforms other ML algorithms." Wrong because deep learning needs large data; on small datasets, simpler models like decision trees may perform better.

2.

"Neural networks with one hidden layer are deep learning." False; deep learning implies multiple hidden layers. A single hidden layer is a shallow network.

3.

"Training a neural network requires no data preprocessing." Incorrect; normalization, handling missing values, and encoding are still needed.

4.

"Deep learning models are immune to overfitting." False; they are prone to overfitting without regularization (dropout, early stopping).

Specific Numbers and Terms on the Exam

Epoch: one full pass over the dataset.

Batch size: number of samples per update (e.g., 32).

Learning rate: typical values 0.01, 0.001.

Activation functions: ReLU, Sigmoid, Tanh, Softmax.

Loss functions: Cross-entropy, Mean Squared Error.

Optimizer: SGD, Adam.

GPU: required for large models.

Edge Cases and Exceptions

Deep learning can be used for regression (e.g., predicting house prices) with linear output activation.

Not all problems need deep learning; for tabular data with few features, gradient boosting may be better.

Transfer learning works when pre-trained model's domain is similar.

How to Eliminate Wrong Answers

If a question asks "which scenario is best for deep learning?" and options include a small dataset (100 rows) or simple linear relationship, eliminate those. Look for keywords like "large dataset," "complex patterns," "images," "text," "audio." If the answer mentions "requires manual feature engineering," it's not deep learning.

Key Takeaways

Deep learning uses neural networks with multiple hidden layers to automatically extract features from raw data.

Key components: neurons, weights, biases, activation functions (ReLU, Softmax), loss functions (cross-entropy), optimizers (Adam).

Training involves forward propagation, loss calculation, backpropagation, and weight update over multiple epochs and batches.

Deep learning requires large datasets and significant computational resources (GPUs) for training.

In Azure, pre-built deep learning models are available via Cognitive Services; custom models can be trained using Azure Machine Learning.

Common pitfalls: overfitting (use dropout, early stopping), vanishing gradients (use ReLU, batch normalization), and improper learning rate.

AI-900 focuses on understanding concepts, not implementing algorithms. Be able to identify appropriate use cases and distinguish from traditional ML.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Neural Network

Requires large amounts of data for good performance.

Automatically learns feature representations.

Computationally expensive to train; often needs GPU.

Handles unstructured data (images, text) well.

Model is a black box; low interpretability.

Decision Tree

Works well with small to medium datasets.

Requires manual feature engineering.

Fast to train on CPU.

Primarily used for structured/tabular data.

Highly interpretable; can visualize tree structure.

Watch Out for These

Mistake

Deep learning is the same as artificial intelligence.

Correct

Deep learning is a subset of machine learning, which is a subset of AI. AI encompasses rule-based systems, symbolic reasoning, etc. Deep learning specifically uses multi-layer neural networks.

Mistake

Neural networks can only be used for classification.

Correct

They can be used for regression (predict continuous values) and clustering (e.g., autoencoders). The output layer activation function determines the task: linear for regression, softmax for multi-class classification.

Mistake

The more layers, the better the model.

Correct

Adding layers increases capacity but also risk of overfitting and vanishing gradients. A model with too many layers may perform worse than a shallower one if data is limited. Residual connections (ResNet) help train very deep networks.

Mistake

Deep learning models cannot handle structured/tabular data.

Correct

They can, but often outperform simpler models only with very large datasets. For small tabular data, gradient boosting or random forests are more efficient and interpretable.

Mistake

Backpropagation is a separate algorithm from gradient descent.

Correct

Backpropagation is the method to compute gradients efficiently; gradient descent is the optimization algorithm that uses those gradients to update weights. They work together.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between deep learning and regular machine learning?

Deep learning uses multi-layer neural networks to automatically learn hierarchical feature representations from raw data, whereas traditional machine learning requires manual feature engineering. Deep learning excels on unstructured data like images, audio, and text but needs large datasets and more compute. Traditional ML algorithms (e.g., decision trees, SVM) work well on structured data with moderate size and are more interpretable.

What is the role of an activation function in a neural network?

An activation function introduces non-linearity into the network, allowing it to learn complex patterns. Without it, the network would be a linear model regardless of depth. Common functions: ReLU (f(x)=max(0,x)) for hidden layers (fast, avoids vanishing gradient), Sigmoid (outputs 0-1) for binary classification, Softmax (outputs probabilities summing to 1) for multi-class classification.

How does backpropagation work?

Backpropagation calculates the gradient of the loss function with respect to each weight using the chain rule. It propagates the error backward from the output layer to the input layer. For each weight, it computes how much a small change would affect the loss. These gradients are then used by an optimizer (e.g., SGD) to update weights in the direction that minimizes loss.

What is transfer learning and why is it useful?

Transfer learning takes a pre-trained model (e.g., ResNet trained on ImageNet) and fine-tunes it on a new, often smaller, dataset. It saves time and compute because the model already knows useful features (edges, shapes). It's particularly effective when the new dataset is similar to the original. In Azure, you can use pre-trained models from the Model Catalog.

What hardware is typically used for training deep learning models?

Graphics Processing Units (GPUs) are the standard because they can perform many matrix operations in parallel. NVIDIA GPUs like Tesla V100 or A100 are common. In Azure, you can provision GPU compute instances (e.g., NC, ND series). For very large models, multiple GPUs or even TPUs (Tensor Processing Units) are used.

What is overfitting and how can it be prevented?

Overfitting occurs when a model learns noise in the training data and performs poorly on new data. Symptoms: high training accuracy, low validation accuracy. Prevention techniques: use more data, reduce model complexity (fewer layers/neurons), apply regularization (L1/L2), dropout (randomly drop neurons during training), early stopping (stop when validation loss increases), and data augmentation.

What is the difference between an epoch and a batch?

An epoch is one complete pass through the entire training dataset. A batch is a subset of the dataset used in one iteration of weight update. For example, if you have 1000 samples and batch size 100, one epoch consists of 10 iterations (batches). Mini-batch gradient descent (batch size >1 and <full dataset) is common because it balances speed and stability.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Deep Learning and Neural Networks — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?