A data scientist is fine-tuning a large language model from Hugging Face using Vertex AI Training with a GPU. The model has 7 billion parameters and does not fit on a single GPU. They need to split the model across multiple GPUs and train with data parallelism. Which strategy should they use?
This combines model and data parallelism, suitable for large models.
Why this answer
Option B is correct because it combines pipeline parallelism (via DeepSpeed) to split the 7B-parameter model across multiple GPUs, with data parallelism (via PyTorch DDP) to replicate the model across workers for training on larger batches. Vertex AI distributed training coordinates the multi-worker setup, making this the only viable strategy for a model that exceeds single-GPU memory while requiring data parallelism.
Exam trap
The trap here is that candidates confuse 'data parallelism' (which replicates the model) with 'model parallelism' (which splits the model), and assume a single strategy like DDP or mirrored strategy suffices, ignoring that the model must first be partitioned across GPUs using pipeline or tensor parallelism before data parallelism can be applied.
How to eliminate wrong answers
Option A is wrong because Vertex AI AutoML is a no-code automated ML service that does not support custom model architectures or manual distribution of large language models; it cannot handle a 7B-parameter model that requires custom parallelism strategies. Option C is wrong because hyperparameter tuning optimizes training hyperparameters (e.g., learning rate) across multiple trials, but does not address the fundamental need to split a model across GPUs or enable data parallelism. Option D is wrong because a multi-worker mirrored strategy with TensorFlow requires the model to fit on a single GPU per worker (it mirrors the entire model), and the 7B-parameter model exceeds that limit; additionally, TF_CONFIG setup does not provide pipeline parallelism to split the model across devices.