AI-900Chapter 93 of 100Objective 1.1

AI Infrastructure Costs and GPU Compute

This chapter covers the critical topic of AI infrastructure costs and GPU compute in Microsoft Azure, which is a key area for the AI-900 exam (Domain: AI Workloads, Objective 1.1). Understanding the cost models, GPU VM series, and scaling strategies is essential because roughly 10-15% of exam questions touch on infrastructure cost considerations for AI workloads. You will learn exactly what GPU compute is, how Azure pricing works for AI, and the common cost traps that candidates often misunderstand.

25 min read
Intermediate
Updated May 31, 2026

GPU Compute as Construction Crane Fleet

Imagine you are a construction company that builds skyscrapers (AI models). Each skyscraper requires lifting massive steel beams (large data batches) high into the air. You have two options: use many human workers (CPU cores) to carry beams up stairs, or rent a fleet of tower cranes (GPU compute). Cranes are expensive to rent by the hour, but each crane can lift a beam in one smooth motion that would take 100 workers an hour. However, cranes require special setup: a concrete base (NVIDIA CUDA drivers), a power supply (high-wattage power cables), and a certified operator (CUDA-trained code). If you only need to lift one beam per week, renting a crane is wasteful—you pay for idle time. But if you are lifting thousands of beams in parallel, the crane fleet is dramatically cheaper per beam. In Azure, you rent GPU virtual machines (cranes) by the minute, and you can choose a single massive crane (ND-series with 8 GPUs) for a single huge lift, or many smaller cranes (NC-series) for many parallel lifts. The key is that GPU compute is specialized: it excels at matrix multiplications (lifting beams) but is inefficient for sequential tasks like sorting nails (CPU tasks). Just as you would not use a crane to hammer a nail, you would not use a GPU to run a web server. Azure provides GPU instances in specific regions and availability zones, and you pay for the entire VM, not just the GPU. If you forget to turn off the crane after the lift, you keep paying—this is the #1 cost trap in AI workloads.

How It Actually Works

What Is GPU Compute and Why Does It Exist?

GPU (Graphics Processing Unit) compute refers to using specialized processors originally designed for rendering graphics to perform parallel mathematical computations. Unlike CPUs, which have a few powerful cores optimized for sequential tasks, GPUs contain thousands of smaller cores designed for simultaneous operations. For AI workloads—especially deep learning training and inference—the bulk of computation involves matrix multiplications and tensor operations. A single GPU can outperform a CPU by 10-100x on these tasks. Azure offers GPU-optimized virtual machines (VMs) that include one or more NVIDIA GPUs (e.g., Tesla K80, P40, P100, V100, A100, or newer models). These VMs are part of the N-series (NC, ND, NV, etc.) and are available in specific Azure regions.

How GPU Compute Works Internally

When you run an AI training job on a GPU VM, the following steps occur at a high level:

1.

Data Preparation: The CPU loads data (e.g., images, text) from storage (Azure Blob, Data Lake) and preprocesses it (resize, normalize). This is often done in parallel using multiple CPU cores.

2.

Transfer to GPU: The preprocessed data is copied from CPU RAM to GPU memory (VRAM) over the PCIe bus. This transfer is a bottleneck—VRAM is fast but limited (e.g., 16 GB on V100, 80 GB on A100).

3.

Kernel Execution: The GPU executes CUDA kernels (small programs) that perform the matrix multiplications, convolutions, etc. Thousands of threads run simultaneously.

4.

Result Retrieval: After computation, results are copied back to CPU RAM for further processing or saving.

Azure GPU VMs use NVIDIA drivers and CUDA libraries. The VM size determines how many GPUs you get, the amount of VRAM per GPU, and the CPU/RAM configuration.

Key Components, Values, and Defaults

Azure GPU VM Series: - NC-series: Older, based on NVIDIA Tesla K80. Good for entry-level GPU workloads. Example: Standard_NC6 (6 vCPU, 56 GB RAM, 1 GPU with 12 GB VRAM). - NCv3-series: Based on NVIDIA Tesla V100. Example: Standard_NC6s_v3 (6 vCPU, 112 GB RAM, 1 GPU with 16 GB VRAM). - ND-series: Based on NVIDIA Tesla P40/P100. Designed for deep learning training. Example: Standard_ND6s (6 vCPU, 112 GB RAM, 1 GPU with 24 GB VRAM). - NDv2-series: Based on NVIDIA Tesla V100 with NVLink for high-speed GPU-to-GPU communication. Example: Standard_ND40s_v2 (40 vCPU, 672 GB RAM, 8 GPUs each with 32 GB VRAM). - NV-series: Based on NVIDIA Tesla M60. Designed for visualization and remote desktop. Not ideal for compute-heavy AI.

Pricing Model: Azure charges per hour for the entire VM (CPU, RAM, GPU, storage). The GPU is not billed separately. You pay for the VM whether you use the GPU or not. Spot VMs can reduce costs by up to 90% but can be evicted with 30 seconds notice.

Region Availability: Not all regions have GPU VMs. For example, NCv3 is available in East US, West US 2, North Europe, etc. Always check regional availability before deployment.

Reserved Instances: You can reserve GPU VMs for 1 or 3 years to save up to 72% compared to pay-as-you-go.

Configuration and Verification Commands

To create a GPU VM via Azure CLI:

az vm create \
  --resource-group myResourceGroup \
  --name myGPUVM \
  --image UbuntuLTS \
  --size Standard_NC6s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

To verify GPU presence inside the VM:

nvidia-smi

Example output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86    Driver Version: 470.86    CUDA Version: 11.4           |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    23W / 250W |      0MiB / 16384MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

How GPU Compute Interacts with Related Technologies

Azure Machine Learning: When you use Azure ML, you can specify a GPU cluster as the compute target. Azure ML manages the VM lifecycle, scaling, and job distribution.

Azure Kubernetes Service (AKS): You can deploy GPU-enabled node pools for containerized AI workloads. This allows multi-tenant GPU sharing via NVIDIA MIG (Multi-Instance GPU) or time-slicing.

Azure Batch: For large-scale parallel training, Azure Batch can spin up a pool of GPU VMs, run your job, and tear them down automatically.

Cost Management: Azure Cost Management + Billing can tag GPU VMs and set budgets. Use Azure Advisor to get recommendations for reserved instances or right-sizing.

Cost Optimization Strategies

Right-size VMs: Choose the smallest GPU VM that fits your model's VRAM requirements. A model that fits in 16 GB VRAM does not need a 32 GB GPU.

Use Spot VMs: For non-critical training jobs that can be interrupted, Spot VMs reduce cost drastically. However, they can be evicted when Azure needs capacity.

Turn off idle VMs: GPU VMs are expensive when running but not doing work. Use Azure Automation or DevTest Labs auto-shutdown.

Use Reserved Instances: If you have predictable, steady-state GPU workloads, reserve for 1 or 3 years.

Consider serverless GPU: Azure Machine Learning serverless Spark or Azure Databricks with GPU clusters can be more cost-effective for intermittent workloads.

Common Cost Traps

Leaving VMs running overnight: Forgetting to deallocate a GPU VM can cost hundreds of dollars per month. Always set auto-shutdown.

Overprovisioning VRAM: Choosing a VM with 32 GB VRAM when your model only needs 8 GB. You pay for the entire VM.

Ignoring data transfer costs: Moving large training datasets to the GPU VM region can incur egress charges. Use Azure ExpressRoute or keep data in the same region.

Not using managed disks: Premium SSDs for OS and data disks add cost. Use standard HDD for infrequently accessed data.

Walk-Through

1

Select GPU VM Size

Begin by determining the VRAM requirements of your AI model. For example, a BERT-base model requires about 1.5 GB VRAM for inference, while GPT-3 175B requires multiple A100 GPUs with 80 GB each. Choose the smallest VM size that fits your model's peak VRAM usage. Azure offers NC, NCv3, ND, NDv2, and NV series. Use the Azure Pricing Calculator to estimate hourly costs. For training, prefer ND-series with NVLink for multi-GPU communication. For inference, NCv3 is often sufficient. Verify availability in your region using `az vm list-skus --location eastus --size Standard_NC6s_v3`.

2

Deploy VM with GPU Drivers

Create the VM using Azure CLI, PowerShell, or portal. Choose a Linux distribution that supports NVIDIA drivers (Ubuntu 18.04/20.04 LTS, CentOS 7). After creation, SSH into the VM and install NVIDIA drivers. Use the NVIDIA CUDA toolkit version compatible with your framework (e.g., CUDA 11.4 for TensorFlow 2.7). Run `nvidia-smi` to verify the GPU is detected. The output shows GPU model, driver version, memory usage, and temperature. If the GPU is not listed, check that the VM size has a GPU (not all N-series have GPUs; NV-series has GPUs but they are for visualization).

3

Install AI Framework and Libraries

Install deep learning frameworks like TensorFlow, PyTorch, or scikit-learn. Use conda or pip to create a virtual environment. For example, `conda create -n tf-gpu tensorflow-gpu=2.7 cudatoolkit=11.4`. Verify GPU is recognized by the framework: in TensorFlow, `tf.config.list_physical_devices('GPU')` should return a list with the GPU. In PyTorch, `torch.cuda.is_available()` should return True. If the framework does not see the GPU, check that CUDA and cuDNN versions match the framework requirements. This step is critical; many candidates fail to verify GPU acceleration and run on CPU, incurring higher costs for slower performance.

4

Run Training Job and Monitor Costs

Execute your training script. Monitor GPU utilization with `nvidia-smi -l 1` (updates every second). Low GPU utilization (e.g., <50%) indicates a bottleneck—often data loading or CPU preprocessing. Use `nvtop` or `watch -n 1 nvidia-smi` for real-time monitoring. Meanwhile, set up Azure Cost Management alerts for the VM resource group. Use tags like `Environment:Training` to track costs. If the job is long-running, consider using Azure Machine Learning to automatically manage the compute target and scale down when idle. After training, deallocate the VM to stop billing. Stopping (not deallocating) still incurs compute costs.

5

Scale Out with Multi-GPU or Cluster

If a single GPU is insufficient, scale to multiple GPUs within a VM (e.g., ND40s_v2 has 8 GPUs) or across VMs using distributed training frameworks like Horovod or PyTorch DDP. For multi-VM scaling, use Azure CycleCloud or AKS with GPU node pools. Each additional GPU adds cost linearly. Use Azure Batch for elastic scaling—it spins up VMs only when needed. Monitor network latency between GPUs; NVLink provides 300 GB/s bandwidth, while PCIe gen3 x16 provides 16 GB/s. For cross-VM communication, use InfiniBand (available on NDv2) for low latency. Always test scaling efficiency; adding GPUs does not always linearly reduce training time due to communication overhead.

What This Looks Like on the Job

Enterprise Scenario 1: Large-Scale Model Training with Azure ML

A healthcare company needs to train a medical image classification model on 10 million CT scans. They use Azure Machine Learning with a GPU compute cluster of 8 ND40s_v2 VMs (each with 8 V100 GPUs, total 64 GPUs). The training takes 2 weeks. The cost: $5.00 per hour per VM, so $5 * 8 * 24 * 14 = $13,440. To optimize, they use Spot VMs for 80% of the cluster, reducing cost to ~$4,000, but they must implement checkpointing to handle evictions. They also use Azure Blob storage with premium tier for fast data loading. The key challenge: data transfer from storage to GPU is a bottleneck. They use Azure Data Lake Storage Gen2 with high-throughput and mount it via blobfuse. They also use NVIDIA DALI for optimized data loading. Misconfiguration: Initially they used standard HDD for the OS disk, causing slow boot times and driver installation failures. They switched to Premium SSD.

Enterprise Scenario 2: Real-Time Inference with GPU

A financial services firm deploys a fraud detection model that must score transactions in under 10 ms. They use Azure Kubernetes Service (AKS) with a GPU node pool of NC6s_v3 VMs (each with 1 V100). They use NVIDIA Triton Inference Server for model serving. The cluster auto-scales based on CPU utilization, but GPU memory is the limiting factor—each V100 can handle 4 concurrent inference requests. They set a custom Horizontal Pod Autoscaler based on GPU memory usage. Cost: $0.90 per hour per node, with 10 nodes running 24/7 = $6,480/month. They optimize by using reserved instances for the base load (1 year, 60% discount) and spot instances for burst. Problem: When a new model version is deployed, the GPU driver must be compatible; they once deployed a model requiring CUDA 11.0 on a node with CUDA 10.2, causing inference failures. They now use containerized environments with specific CUDA base images.

Enterprise Scenario 3: GPU Compute for Research and Development

A university research lab uses Azure for AI experiments. They have limited budget and need to run many short jobs. They use Azure Batch with a pool of NC6 VMs (K80 GPUs) and set the pool to auto-scale down to zero when idle. They also use Azure DevTest Labs to set auto-shutdown at 6 PM daily. They store datasets on Azure Files with standard tier to minimize storage costs. The challenge: students often forget to deallocate VMs. They implement Azure Policy to enforce tags and auto-shutdown. They also use Azure Cost Management budgets with alerts at 80% and 100% of monthly budget. This scenario highlights the importance of governance in GPU cost management.

How AI-900 Actually Tests This

What AI-900 Tests on This Topic (Objective 1.1)

The AI-900 exam focuses on identifying appropriate Azure services for AI workloads, including compute options. Specifically, you need to know:

The difference between CPU and GPU for AI workloads.

Which Azure VM series are GPU-optimized (N-series).

The primary use case for GPU compute: training deep learning models.

Basic cost considerations: pay-as-you-go, reserved instances, spot VMs.

How to choose between GPU and CPU based on workload type (training vs. inference, batch vs. real-time).

Common Wrong Answers and Why Candidates Choose Them

1.

"GPUs are always faster and cheaper than CPUs for AI." This is false because for small models or inference on simple data, CPU may be sufficient and cheaper. Candidates overgeneralize from training use cases.

2.

"Azure charges extra for GPU usage." Actually, the GPU cost is bundled into the VM hourly rate. Candidates think it's an add-on like a software license.

3.

"You can only use GPU VMs with Azure Machine Learning." GPU VMs can be used directly via IaaS, or with AKS, Batch, etc. Azure ML is just one option.

4.

"Spot VMs are always the cheapest option." While cheaper, they can be evicted, so they are not suitable for critical production workloads. Candidates ignore the reliability trade-off.

Specific Numbers and Terms That Appear on the Exam

N-series: The family name for GPU VMs. Know that NC, ND, NV are sub-series.

CUDA: NVIDIA's parallel computing platform. The exam may ask which framework enables GPU acceleration (CUDA).

Pay-as-you-go vs. Reserved vs. Spot: Understand the trade-offs.

Auto-shutdown: A feature to reduce costs by automatically deallocating VMs.

Azure Cost Management: The tool to monitor and control spending.

Edge Cases and Exceptions

GPU not available in all regions: Always check regional SKU availability.

Driver compatibility: Not all AI frameworks work with all GPU drivers. The exam may present a scenario where a framework fails to detect GPU due to driver mismatch.

NV-series for visualization: NV-series GPUs (M60) are for remote desktop, not compute. Candidates might think all N-series are for AI.

How to Eliminate Wrong Answers Using the Underlying Mechanism

If a question asks about the best compute for training a large neural network, eliminate options that mention CPU-only VMs, or services like Azure Functions (which have time limits and no GPU). If the question emphasizes cost savings for intermittent workloads, look for Spot VMs or auto-scaling. If the question mentions real-time inference with low latency, GPU is preferred but consider also Azure ML endpoints with GPU. The key is to match the workload characteristics (batch vs. real-time, training vs. inference, critical vs. non-critical) to the appropriate compute and pricing model.

Key Takeaways

GPU compute is essential for training deep learning models due to massive parallelism.

Azure GPU VMs are part of the N-series: NC, ND, NV, etc.

GPU cost is included in the VM hourly rate; no separate GPU charge.

Spot VMs can reduce costs by up to 90% but risk eviction.

Reserved instances offer up to 72% discount for 1- or 3-year commitments.

Always verify GPU driver and CUDA compatibility with your AI framework.

Use auto-shutdown and Azure Cost Management to avoid runaway costs.

Data transfer to GPU VM can be a bottleneck; keep data in same region.

Multi-GPU scaling requires NVLink or InfiniBand for low latency.

For inference, consider CPU if latency and throughput requirements are low.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Pay-as-you-go GPU VM

No upfront cost; pay per hour of usage.

Highest hourly rate; no discount.

Flexible: can stop/start anytime without penalty.

Best for short-term or unpredictable workloads.

Can use with Spot VMs for additional savings.

Reserved GPU VM Instance

Requires 1- or 3-year commitment upfront.

Up to 72% discount compared to pay-as-you-go.

Less flexible; early termination incurs fees.

Best for steady-state, predictable GPU workloads.

Cannot be used with Spot pricing.

Watch Out for These

Mistake

GPU VMs are only available in Azure Machine Learning.

Correct

GPU VMs are available as regular IaaS VMs (N-series) in Azure. You can deploy them via portal, CLI, or ARM templates. Azure ML is one service that can use GPU compute, but it's not the only way.

Mistake

You pay extra for the GPU on top of the VM cost.

Correct

The GPU is included in the VM hourly rate. There is no separate charge for the GPU itself. However, you pay for the entire VM, including CPU, RAM, and GPU.

Mistake

All N-series VMs have the same GPU performance.

Correct

N-series includes different GPU models: NC (K80), NCv3 (V100), ND (P40/P100), NDv2 (V100 with NVLink), NV (M60). Performance varies significantly. For AI training, NDv2 is best; for visualization, NV is appropriate.

Mistake

Spot VMs are ideal for production AI workloads.

Correct

Spot VMs can be evicted with 30 seconds notice, so they are not suitable for production workloads that require reliability. They are best for batch jobs, dev/test, or fault-tolerant training with checkpointing.

Mistake

GPU compute is always cheaper than CPU for AI.

Correct

For small models or infrequent inference, CPU may be more cost-effective because GPU VMs are more expensive per hour. GPU is only cost-effective when utilization is high and parallelism is exploited.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What Azure VM sizes are available for GPU compute?

Azure offers N-series VMs for GPU compute. The main sub-series are NC (NVIDIA K80), NCv3 (NVIDIA V100), ND (NVIDIA P40/P100), NDv2 (NVIDIA V100 with NVLink), and NV (NVIDIA M60 for visualization). For AI training, NCv3 and NDv2 are most common. Always check regional availability as not all sizes are available in every region.

How much does a GPU VM cost per hour?

Costs vary by size and region. For example, Standard_NC6 (1 K80) is approximately $0.90/hr, Standard_NC6s_v3 (1 V100) is about $3.00/hr, and Standard_ND40s_v2 (8 V100) is about $20.00/hr. Use the Azure Pricing Calculator for exact rates. Reserved instances can reduce costs significantly.

Can I use GPU VMs with Azure Kubernetes Service (AKS)?

Yes, AKS supports GPU-enabled node pools. You can create a node pool with N-series VMs and deploy GPU workloads using Kubernetes. You need to install NVIDIA device plugin and configure your containers to request GPU resources. This is common for inference serving.

What is the difference between stopping and deallocating a GPU VM?

Stopping a VM (power off) keeps the VM allocated and you continue to pay for compute. Deallocating releases the VM and stops billing for compute, but you still pay for attached storage. To stop costs, you must deallocate (stop from Azure portal or CLI with `az vm deallocate`).

How do I check if my AI framework is using the GPU?

Use framework-specific commands: in TensorFlow, run `tf.config.list_physical_devices('GPU')`; in PyTorch, `torch.cuda.is_available()`. Also, run `nvidia-smi` to see GPU utilization. If the framework does not detect the GPU, check driver installation and CUDA version compatibility.

What is a Spot VM and when should I use it for GPU workloads?

A Spot VM uses unused Azure capacity and offers up to 90% discount but can be evicted with 30 seconds notice. Use Spot VMs for batch training jobs that are fault-tolerant (e.g., with checkpointing), dev/test, or non-critical workloads. Do not use for production inference or long-running critical training.

How can I reduce GPU VM costs?

Use right-sizing (choose smallest GPU VM that fits model), auto-shutdown, Spot VMs for interruptible jobs, reserved instances for steady workloads, and Azure Cost Management budgets. Also, consider serverless options like Azure ML serverless Spark or Azure Databricks with auto-scaling.

Terms Worth Knowing

Ready to put this to the test?

You've just covered AI Infrastructure Costs and GPU Compute — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?