[mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2) #17269

zhreshold · 2020-01-10T23:15:29Z

Description

This is the part 2 of Gluon Data API extension and fixes, which mainly focus on speeding up the current data loading pipeline using gluon dataset and dataloader.

Motivation

The current data loading pipeline is the major bottleneck for many training tasks. We can summarize the entire flow as:

| Dataset.__getitem__ -> 
| Transform.__call__()/forward() ->
| Batchify ->
| (optional communicate through shared_mem) ->
| split_and_load(ctxs) ->
| <training on GPUs>
->

where there are performance concerns:

performance of python dataset/transform functions aren't satisfying
it's not easy to embrace multithreading to speed up dataloading due to global interpreter lock
python multiprocessing is unfortunately slow and error prune, not to mention the shared memory implementations on different OS are quite difference and very annoying(e.g., it's very likely to run out of shared memory if not properly taken care of)
currently memory planing for batchify is non-exist, causing frequent alloc/dealloc for large chunk of memory if the batch size is big
batchify then split and load can be optimized to partial_batchify

Proposal

To alleviate the existing troubles I propose to use a hybrid solution, that is to

provide C++ Datasets that can cover the most usecases

from gluon.data.dataset import TupleDataset, ImageFolderDataset, ArrayDataset
# as long as TupleDataset, ImageSequenceDataset, ArrayDataset are supported by backend
dataset = TupleDataset([ImageSequenceDataset(img_paths), ArrayDataset(image_labels)])
# dataset is an image classification dataset while fully supported in C++
# with TupleDataset we can combine as many data as possible

# a C++ backed Dataset can have a magic __handle__ method to return the c++ handle for reference
class TupleDataset:
    def __init__(self, datasets):
        if all([callable(getattr(dataset, '__handle__')) for dataset in datasets]):
            # all supported by backend
            self._tuple_dataset = check_call(_LIB.MXTupleDatasetCreate([getattr(dataset, '__handle__') for dataset in datasets]))
        else:
            self._tuple_dataset = None

        def __handle__(self):
            return self._tuple_dataset

provide common C++ batchify functions that are split and context aware. Batchify with memory planner is TBD.
provide a C++ MultithreadingDataLoader which inherit the same arguments as gluon.data.DataLoader but use mxnet internal multithreading rather than python multiprocessing.
fallback to python multiprocessing whenever
- the dataset is not fully supported by backend(e.g., there are custom python datasets)
- Transform is not fully hybridizable
- Batchify is not fully supported by backend

User will continue to use the existing gluon.data.DataLoader, and the conversion will be applied automatically

loader = gluon.data.DataLoader(hybrid_dataset.transform(hybrid_transform), batch_size=32, batchify_fn=hybrid_batchify)

def DataLoader:
    def __init__(self, dataset, ...):
        if isinstance(dataset, _LazyTransformDataset) and is_hybrid(dataset._transform) and is_hybrid(dataset) and is_hybrid(batchify_fn):
            self._mt_dataloader = check_call(_LIB.MXMultiThreadDataLoaderCreate(...))
    def __iter__(self):
        if self._mt_dataloader:
                return self._mt_dataloader
        else:
               # fallback to single thread normal dataloader or multiprocessing dataloader

With this change, mxnet 2.0 will get smooth transition to mixed data loaders. Please comment with specific examples where this proposal fail to accommodate.

The text was updated successfully, but these errors were encountered:

zhreshold · 2020-01-10T23:17:33Z

@szha @eric-haibin-lin @sxjscience @szhengac Request for comments regarding NLP dataloading

zhreshold · 2020-05-13T00:48:21Z

closed by #17841

zhreshold added the Feature request label Jan 10, 2020

zachgk added Data-loading Gluon RFC Post requesting for comments labels Jan 23, 2020

zhreshold mentioned this issue Jan 29, 2020

[WIP][mxnet 2.0] Pure C++ threaded dataloader that shared the interface with gluon dataloader #17464

Closed

6 tasks

zhreshold mentioned this issue Mar 15, 2020

Gluon data 2.0: c++ dataloader and built-in image/bbox transforms #17841

Merged

zhreshold closed this as completed May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2) #17269

[mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2) #17269

zhreshold commented Jan 10, 2020 •

edited

Loading

zhreshold commented Jan 10, 2020

zhreshold commented May 13, 2020

[mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2) #17269

[mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2) #17269

Comments

zhreshold commented Jan 10, 2020 • edited Loading

Description

Motivation

Proposal

zhreshold commented Jan 10, 2020

zhreshold commented May 13, 2020

zhreshold commented Jan 10, 2020 •

edited

Loading