A machine learning engineer is using Amazon SageMaker to train a deep learning model. The training job is taking longer than expected. The engineer notices that the GPU utilization is low (around 30%) while CPU utilization is high. Which action is most likely to improve training speed?
More workers can parallelize data loading and reduce I/O bottleneck, improving GPU utilization.
Why this answer
Low GPU utilization with high CPU utilization suggests a data loading bottleneck. Increasing the number of data loading workers keeps the GPU fed. Reducing batch size or using a smaller instance would not help.
Using Pipe mode (streaming) might help but not as directly as increasing workers.