Your Vertex AI custom training job is failing with an out-of-memory error on a single GPU. You need to reduce memory usage without changing the model architecture. Which approach should you try first?
Decreasing batch size reduces memory but may affect convergence; it is a valid approach but mixed precision is often tried first.
Why this answer
Mixed precision (FP16) training halves memory usage for activations and gradients. Gradient accumulation reduces effective batch size but doesn't reduce memory per step as effectively. Reducing batch size directly reduces memory.
Model parallelism is more complex. The simplest first step is to use mixed precision.