0

Streaming datasets: 100x More Efficient

https://huggingface.co/blog/streaming-datasets(huggingface.co)
Hugging Face has significantly improved the performance of streaming datasets in its `datasets` library, accessible via the `streaming=True` flag. These enhancements allow for immediate training on terabyte-scale datasets without downloading them, addressing issues like slow startup times and excessive API requests. Key improvements include caching data file lists across DataLoader workers to reduce startup requests by 100x and enabling prefetching for Parquet files to double data throughput. The backend leverages Xet, a deduplication-based storage system, to accelerate data transfers, making streaming from the Hub faster than traditional cloud storage. These changes make streaming performance comparable to reading from local SSDs, eliminating previous data transfer bottlenecks in large-scale model training.
0 pointsby ogg2 days ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?