diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 8e4a794e8..fd33beac8 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -146,8 +146,7 @@ The reason why ``n_sample = 12`` is because ``ShardingFilter`` (``datapipe.shard each worker will independently return all samples. In this case, there are 10 rows per file and 3 files, with a batch size of 5, that gives us 6 batches per worker. With 2 workers, we get 12 total batches from the ``DataLoader``. -In order for DataPipe sharding to work with ``DataLoader``, we need to add the following. It is crucial to add -``ShardingFilter`` after ``Shuffler`` to ensure that all worker processes have the same order of data for sharding. +In order for DataPipe sharding to work with ``DataLoader``, we need to add the following. .. code:: python @@ -169,6 +168,12 @@ Note: - Place ``ShardingFilter`` (``datapipe.sharding_filter``) as early as possible in the pipeline, especially before expensive operations such as decoding, in order to avoid repeating these expensive operations across worker/distributed processes. +- For the data source that needs to be sharded, it is crucial to add ``Shuffler`` before ``ShardingFilter`` + to ensure data are globally shuffled before splitted into shards. Otherwise, each worker process would + always process the same shard of data for all epochs. And, it means each batch would only consist of data + from the same shard, which leads to low accuracy during training. However, it doesn't apply to the data + source that has already been sharded for each multi-/distributed process, since ``ShardingFilter`` is no + longer required to be presented in the pipeline. - There may be cases where placing ``Shuffler`` earlier in the pipeline lead to worse performance, because some operations (e.g. decompression) are faster with sequential reading. In those cases, we recommend decompressing the files prior to shuffling (potentially prior to any data loading).