Skip to content

TorchData 0.3.0 Beta Release

Compare
Choose a tag to compare
@ejguan ejguan released this 10 Mar 18:43
· 468 commits to main since this release

0.3.0 Release Notes

We are delighted to present the Beta release of TorchData. This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called “DataPipes” that work well out of the box with the PyTorch’s DataLoader.

  • Highlights
    • What are DataPipes?
    • Usage Example
  • New Features
  • Documentation
  • Usage in Domain Libraries
  • Future Plans
  • Beta Usage Note

Highlights

We are releasing DataPipes - there are Iterable-style DataPipe (IterDataPipe) and Map-style DataPipe (MapDataPipe).

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch DataSets which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch DataSet for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes , and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json

class JsonParserIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs) -> None:
        self.source_datapipe = source_datapipe
        self.kwargs = kwargs

    def __iter__(self):
        for file_name, stream in self.source_datapipe:
            data = stream.read()
            yield file_name, json.loads(data)

    def __len__(self):
        return len(self.source_datapipe) 

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, DataSet simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes.

Usage Example

In this example, we have a compressed TAR archive file stored in Google Drive and accessible via an URL. We demonstrate how you can use DataPipes to download the archive, cache the result, decompress the archive, filter for specific files, parse and return the CSV content. The full example with detailed explanation is included in the example folder.

url_dp = IterableWrapper([URL])
cache_compressed_dp = GDriveReader(cache_compressed_dp)
# cache_decompressed_dp = ... # See source file for full code example
# Opens and loads the content of the TAR archive file.
cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").load_from_tar()
# Filters for specific files based on the file name.
cache_decompressed_dp = cache_decompressed_dp.filter(
    lambda fname_and_stream: _EXTRACTED_FILES[split] in fname_and_stream[0]
)
# Saves the decompressed file onto disk.
cache_decompressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
data_dp = FileOpener(cache_decompressed_dp, mode="b")
# Parses content of the decompressed CSV file and returns the result line by line. return 
return data_dp.parse_csv().map(fn=lambda t: (int(t[0]), " ".join(t[1:])))

New Features

[Beta] IterDataPipe

We have implemented over 50 Iterable-style DataPipes across 10 different categories. They cover different functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the fsspec and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each IterDataPipe.

[Beta] MapDataPipe

Similar to IterDataPipe, we have various, but a more limited number of MapDataPipe available for different functionalities. More MapDataPipes support will come later. If the existing ones do not meet your needs, you can write a custom DataPipe.

Documentation

The documentation for TorchData is now live. It contains a tutorial that covers how to use DataPipes, use them with DataLoader, and implement custom ones.

Usage in Domain Libraries

In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the popular datasets provided by the library are implemented using DataPipes and a section of its SST-2 binary text classification tutorial demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in TorchVision (available in nightly releases) and in TorchRec. You can find more specific examples here.

Future Plans

There will be a new version of DataLoader in the next release. At the high level, the plan is that DataLoader V2 will only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe. At the same time, the current/old version of DataLoader should still be available and you can use DataPipes with that as well.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.