Support torch dataloader without torch formatting #5357

lhoestq · 2022-12-13T19:39:24Z

In #5084 we make the torch formatting consistent with the map-style datasets formatting: a torch formatted iterable dataset will yield torch tensors.

The previous behavior of the torch formatting for iterable dataset was simply to make the iterable dataset inherit from torch.utils.data.Dataset to make it work in a torch DataLoader. However ideally an unformatted dataset should also work with a DataLoader. To fix that, datasets.IterableDataset should inherit from torch.utils.data.IterableDataset.

Since we don't want to import torch on startup, I created this PR to dynamically make the datasets.IterableDataset class inherit form the torch one when a datasets.IterableDataset is instantiated and if PyTorch is available.

>>> from datasets import load_dataset
>>> ds = load_dataset("c4", "en", streaming=True, split="train")
>>> import torch.utils.data
>>> isinstance(ds, torch.utils.data.IterableDataset)
True
>>> dataloader = torch.utils.data.DataLoader(ds, batch_size=32, num_workers=4)
>>> for example in dataloader:
...:     ...

HuggingFaceDocBuilderDev · 2022-12-13T19:45:13Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-12-14T14:13:14Z

Need some more time to fix the tests, especially with pickle

mariosasko

Looks good!

This is probably the least hacky we can get here :)

src/datasets/iterable_dataset.py

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

polinaeterna

I like the hack :)

I just left a fix in the docs (not related to this PR).

And I actually don't quite understand the idea - what's the motivation behind making only IterableDataset compatible with torch DataLoader without setting the format explicitly?

docs/source/use_with_pytorch.mdx

src/datasets/iterable_dataset.py

lhoestq · 2022-12-15T16:12:56Z

And I actually don't quite understand the idea - what's the motivation behind making only IterableDataset compatible with torch DataLoader without setting the format explicitly?

Setting the format to pytorch = set the output types of the dataset to be pytorch tensors. However sometimes your dataset is not made of tensors but you still want to be able to use a pytorch DataLoader

mariosasko · 2022-12-15T16:36:50Z

A bit more context.

The arrow-backed Dataset supports DataLoader(ds) (even if the format is not "torch"), and we want to be able to do the same with IterableDataset for consistency. However, this is when the PyTorch internals come into play - an iterable dataset needs to be an instance of torch.utils.data.IterableDataset due to this check (notice there is no check for the map-style version). Hence the explicit subclassing in this PR.

lhoestq · 2022-12-15T17:51:43Z

Exactly :) Btw I just took your comments into account @polinaeterna , so feel free to review again

corbyrosset · 2023-01-04T04:46:34Z

@lhoestq just checking, does this change still preserve the fix to the "data duplicate when setting num_works > 1 with streaming data" issue from before?

#3423

lhoestq · 2023-01-04T12:45:40Z

Yes :)

lhoestq added 3 commits December 13, 2022 20:32

dynamically add torch IterableDataset parent class

d04dcb6

tests

f8bf664

docs

7d644a0

lhoestq requested review from albertvillanova, polinaeterna and mariosasko December 13, 2022 19:39

lhoestq removed request for albertvillanova, polinaeterna and mariosasko December 14, 2022 14:12

lhoestq marked this pull request as draft December 14, 2022 14:13

subclass torch Iterable dataset in __init__

be49a81

lhoestq force-pushed the support-torch-dataloader-without-torch-formatting branch from 4907aaf to be49a81 Compare December 14, 2022 15:24

fix tests

b7492dd

lhoestq marked this pull request as ready for review December 14, 2022 15:45

lhoestq requested review from mariosasko and polinaeterna December 14, 2022 15:45

mariosasko approved these changes Dec 14, 2022

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

Update src/datasets/iterable_dataset.py

11a2158

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

polinaeterna reviewed Dec 15, 2022

View reviewed changes

docs/source/use_with_pytorch.mdx Outdated Show resolved Hide resolved

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

polina's comments

00958f2

polinaeterna approved these changes Dec 15, 2022

View reviewed changes

lhoestq merged commit 0bec9f3 into main Dec 15, 2022

lhoestq deleted the support-torch-dataloader-without-torch-formatting branch December 15, 2022 19:15

benczaja mentioned this pull request Aug 14, 2024

Improve GPU training performance with DDP odissei-lifecourse/life-sequencing-dutch#74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support torch dataloader without torch formatting #5357

Support torch dataloader without torch formatting #5357

lhoestq commented Dec 13, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 13, 2022 •

edited

Loading

lhoestq commented Dec 14, 2022

mariosasko left a comment

polinaeterna left a comment •

edited

Loading

lhoestq commented Dec 15, 2022

mariosasko commented Dec 15, 2022

lhoestq commented Dec 15, 2022

corbyrosset commented Jan 4, 2023 •

edited

Loading

lhoestq commented Jan 4, 2023

Support torch dataloader without torch formatting #5357

Support torch dataloader without torch formatting #5357

Conversation

lhoestq commented Dec 13, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Dec 13, 2022 • edited Loading

lhoestq commented Dec 14, 2022

mariosasko left a comment

Choose a reason for hiding this comment

polinaeterna left a comment • edited Loading

Choose a reason for hiding this comment

lhoestq commented Dec 15, 2022

mariosasko commented Dec 15, 2022

lhoestq commented Dec 15, 2022

corbyrosset commented Jan 4, 2023 • edited Loading

lhoestq commented Jan 4, 2023

lhoestq commented Dec 13, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 13, 2022 •

edited

Loading

polinaeterna left a comment •

edited

Loading

corbyrosset commented Jan 4, 2023 •

edited

Loading