A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?
Trap 1: Use SageMaker File input mode and increase the EBS volume size to 1…
Larger EBS volume does not reduce I/O wait time for downloading data.
Trap 2: Convert the CSV files to Parquet format and use File input mode.
Parquet reduces storage and improves read speed but File mode still downloads the entire dataset to EBS, causing I/O wait.
Trap 3: Load the data into an Amazon EFS file system and mount it to the…
EFS adds network latency and cost without addressing the fundamental I/O wait issue.
- A
Use SageMaker File input mode and increase the EBS volume size to 1 TB.
Why wrong: Larger EBS volume does not reduce I/O wait time for downloading data.
- B
Use SageMaker Pipe input mode to stream data directly from S3.
Pipe mode streams data on-the-fly, eliminating the need to download the full dataset, thus reducing I/O wait time.
- C
Convert the CSV files to Parquet format and use File input mode.
Why wrong: Parquet reduces storage and improves read speed but File mode still downloads the entire dataset to EBS, causing I/O wait.
- D
Load the data into an Amazon EFS file system and mount it to the training instance.
Why wrong: EFS adds network latency and cost without addressing the fundamental I/O wait issue.