A data scientist is preparing a large dataset (50 GB) for training a TensorFlow model on SageMaker. The dataset consists of many small CSV files. Training is slow due to I/O bottlenecks. Which data preparation strategy most effectively accelerates training?
Trap 1: Convert the dataset to Parquet format and use Apache Arrow for…
Parquet is good for analytics but not as efficient for TensorFlow training as TFRecord.
Trap 2: Compress the CSV files and decompress during data loading
Decompression adds overhead and does not solve file fragmentation.
Trap 3: Use a larger instance type with more vCPUs
This does not address I/O bottlenecks from many small files.
- A
Convert the dataset to TFRecord format and use tf.data pipeline with prefetching
TFRecord combines many records into a few large files, and prefetching improves data pipeline efficiency.
- B
Convert the dataset to Parquet format and use Apache Arrow for loading
Why wrong: Parquet is good for analytics but not as efficient for TensorFlow training as TFRecord.
- C
Compress the CSV files and decompress during data loading
Why wrong: Decompression adds overhead and does not solve file fragmentation.
- D
Use a larger instance type with more vCPUs
Why wrong: This does not address I/O bottlenecks from many small files.