Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add secondary caching to newly-migrated datapipe-based datasets #1526

Closed
erip opened this issue Jan 18, 2022 · 4 comments
Closed

Add secondary caching to newly-migrated datapipe-based datasets #1526

erip opened this issue Jan 18, 2022 · 4 comments

Comments

@erip
Copy link
Contributor

erip commented Jan 18, 2022

This is mostly so things don't fall through the cracks...

🚀 Feature

Motivation

CI spends too much time unpacking cached datasets. We can also cache uncompressed and extracted files for relevant splits.

Pitch

Add secondary caching to datasets which have been migrated per #1494.

Alternatives

N/A

Additional context

See discussion in review of #1515

@parmeet
Copy link
Contributor

parmeet commented Jan 18, 2022

I would also add to Motivation that it helps improve workflow efficiency. With extracted files being cached on disk in the first iteration, datapipe would avoid doing the extraction in the consecutive iterations. Also the alternative would be In Memory caching in which case the files won't be dumped to the disk. @ejguan just to confirm this is indeed the right understanding?

@ejguan
Copy link
Contributor

ejguan commented Jan 18, 2022

I would also add to Motivation that it helps improve workflow efficiency. With extracted files being cached on disk in the first iteration, datapipe would avoid doing the extraction in the consecutive iterations. Also the alternative would be In Memory caching in which case the files won't be dumped to the disk. @ejguan just to confirm this is indeed the right understanding?

Correct. But, it requires more careful design. If we do in memory caching over file handlers, it won't actually work unless you explicitly rewind them.

@erip
Copy link
Contributor Author

erip commented Jan 21, 2022

I think this can be closed since the previously-migrated datasets now have double caching and future migrations will require it as part of code review. Feel free to reopen as necessary.

@erip erip closed this as completed Jan 21, 2022
@parmeet
Copy link
Contributor

parmeet commented Jan 21, 2022

I think this can be closed since the previously-migrated datasets now have double caching and future migrations will require it as part of code review. Feel free to reopen as necessary.

Thanks @erip for helping keep track of progress through this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants