Add secondary caching to newly-migrated datapipe-based datasets #1526

erip · 2022-01-18T17:56:46Z

This is mostly so things don't fall through the cracks...

🚀 Feature

Motivation

CI spends too much time unpacking cached datasets. We can also cache uncompressed and extracted files for relevant splits.

Pitch

Add secondary caching to datasets which have been migrated per #1494.

Alternatives

N/A

Additional context

See discussion in review of #1515

parmeet · 2022-01-18T18:19:17Z

I would also add to Motivation that it helps improve workflow efficiency. With extracted files being cached on disk in the first iteration, datapipe would avoid doing the extraction in the consecutive iterations. Also the alternative would be In Memory caching in which case the files won't be dumped to the disk. @ejguan just to confirm this is indeed the right understanding?

ejguan · 2022-01-18T18:24:08Z

I would also add to Motivation that it helps improve workflow efficiency. With extracted files being cached on disk in the first iteration, datapipe would avoid doing the extraction in the consecutive iterations. Also the alternative would be In Memory caching in which case the files won't be dumped to the disk. @ejguan just to confirm this is indeed the right understanding?

Correct. But, it requires more careful design. If we do in memory caching over file handlers, it won't actually work unless you explicitly rewind them.

erip · 2022-01-21T20:42:41Z

I think this can be closed since the previously-migrated datasets now have double caching and future migrations will require it as part of code review. Feel free to reopen as necessary.

parmeet · 2022-01-21T21:03:11Z

I think this can be closed since the previously-migrated datasets now have double caching and future migrations will require it as part of code review. Feel free to reopen as necessary.

Thanks @erip for helping keep track of progress through this issue!

parmeet mentioned this issue Jan 18, 2022

Cache extraction for AmazonReviewPolarity #1527

Merged

This was referenced Jan 19, 2022

add double caching for yahoo to speed up extracted reading. #1528

Merged

add double caching for yelp full to speed up extracted reading. #1529

Merged

add double caching for yelp polarity to speed up extracted reading. #1530

Merged

This was referenced Jan 21, 2022

Migrate WikiText2 to datapipes #1519

Merged

Migrate WikiText103 to datapipes #1518

Merged

erip closed this as completed Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add secondary caching to newly-migrated datapipe-based datasets #1526

Add secondary caching to newly-migrated datapipe-based datasets #1526

erip commented Jan 18, 2022 •

edited

Loading

parmeet commented Jan 18, 2022 •

edited

Loading

ejguan commented Jan 18, 2022

erip commented Jan 21, 2022 •

edited

Loading

parmeet commented Jan 21, 2022

Add secondary caching to newly-migrated datapipe-based datasets #1526

Add secondary caching to newly-migrated datapipe-based datasets #1526

Comments

erip commented Jan 18, 2022 • edited Loading

🚀 Feature

parmeet commented Jan 18, 2022 • edited Loading

ejguan commented Jan 18, 2022

erip commented Jan 21, 2022 • edited Loading

parmeet commented Jan 21, 2022

erip commented Jan 18, 2022 •

edited

Loading

parmeet commented Jan 18, 2022 •

edited

Loading

erip commented Jan 21, 2022 •

edited

Loading