A company is using Amazon SageMaker to train a deep learning model. The training job is failing with an error 'CUDA out of memory'. The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model architecture and batch size are appropriate for this instance size. What is the most likely cause of this error?
Trap 1: Reduce the number of epochs.
Epochs do not affect per-step memory usage; they affect training time.
Trap 2: Increase the number of GPUs by using a distributed training…
Adding more GPUs may not help if the memory per GPU is the same; the error is per GPU.
Trap 3: Use a smaller instance type to force lower memory usage.
A smaller instance has even less memory, making the problem worse.
- A
Reduce the number of epochs.
Why wrong: Epochs do not affect per-step memory usage; they affect training time.
- B
Increase the number of GPUs by using a distributed training instance type.
Why wrong: Adding more GPUs may not help if the memory per GPU is the same; the error is per GPU.
- C
Enable automatic mixed precision (AMP) training to reduce memory usage.
AMP uses FP16 where possible, cutting memory usage roughly in half, which often resolves out-of-memory errors.
- D
Use a smaller instance type to force lower memory usage.
Why wrong: A smaller instance has even less memory, making the problem worse.