-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add secondary caching to newly-migrated datapipe-based datasets #1526
Comments
I would also add to Motivation that it helps improve workflow efficiency. With extracted files being cached on disk in the first iteration, datapipe would avoid doing the extraction in the consecutive iterations. Also the alternative would be In Memory caching in which case the files won't be dumped to the disk. @ejguan just to confirm this is indeed the right understanding? |
Correct. But, it requires more careful design. If we do in memory caching over file handlers, it won't actually work unless you explicitly rewind them. |
I think this can be closed since the previously-migrated datasets now have double caching and future migrations will require it as part of code review. Feel free to reopen as necessary. |
Thanks @erip for helping keep track of progress through this issue! |
This is mostly so things don't fall through the cracks...
🚀 Feature
Motivation
CI spends too much time unpacking cached datasets. We can also cache uncompressed and extracted files for relevant splits.
Pitch
Add secondary caching to datasets which have been migrated per #1494.
Alternatives
N/A
Additional context
See discussion in review of #1515
The text was updated successfully, but these errors were encountered: