A team is fine-tuning a large language model (LLaMA 2) using Vertex AI with a custom container on a multi-node GPU cluster. They need to implement model parallelism to fit the model across multiple GPUs because it does not fit into a single GPU memory. Which distributed training strategy should they use?
Pipeline parallelism is the appropriate model parallelism technique for large models; it must be manually configured.
Why this answer
Model parallelism, specifically pipeline parallelism, splits the model layers across devices. For large models that don't fit on one GPU, this is necessary. Data parallelism (even with ZeRO) still requires the full model on each device.
Vertex AI does not natively support model parallelism; users must configure it manually using frameworks like Megatron-LM or DeepSpeed.