Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix prototype datasets data loading tests #5711

Merged
merged 14 commits into from
Apr 5, 2022
7 changes: 3 additions & 4 deletions .circleci/config.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 3 additions & 4 deletions .circleci/config.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -152,11 +152,10 @@ commands:
args: --no-build-isolation <<# parameters.editable >> --editable <</ parameters.editable >> .
descr: Install torchvision <<# parameters.editable >> in editable mode <</ parameters.editable >>

# Installs all extra dependencies that are needed in the torchvision.prototype namespace, but are not tracked in the
# project requirements.
install_prototype_dependencies:
steps:
- pip_install:
args: iopath
descr: Install third-party dependencies
pmeier marked this conversation as resolved.
Show resolved Hide resolved
- pip_install:
args: --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu
descr: Install torchdata from nightly releases
Expand Down Expand Up @@ -366,7 +365,7 @@ jobs:
- install_torchvision
- install_prototype_dependencies
- pip_install:
args: scipy pycocotools h5py
args: scipy pycocotools h5py dill
descr: Install optional dependencies
- run_tests_selective:
file_or_dir: test/test_prototype_*.py
Expand Down
21 changes: 17 additions & 4 deletions test/test_prototype_builtin_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import torch
from builtin_dataset_mocks import parametrize_dataset_mocks, DATASET_MOCKS
from torch.testing._comparison import assert_equal, TensorLikePair, ObjectPair
from torch.utils.data._utils.serialization import DILL_AVAILABLE
from torch.utils.data.graph import traverse
from torch.utils.data.graph_settings import get_all_graph_pipes
from torchdata.datapipes.iter import Shuffler, ShardingFilter
Expand Down Expand Up @@ -109,19 +110,31 @@ def test_transformable(self, test_home, dataset_mock, config):

next(iter(dataset.map(transforms.Identity())))

@pytest.mark.xfail(reason="See https://github.com/pytorch/data/issues/237")
@parametrize_dataset_mocks(DATASET_MOCKS)
def test_serializable(self, test_home, dataset_mock, config):
dataset_mock.prepare(test_home, config)
def test_serializable_pickle(self, mocker, test_home, dataset_mock, config):
if DILL_AVAILABLE:
mocker.patch("torch.utils.data.datapipes.datapipe.DILL_AVAILABLE", new=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejguan @NivekT since the serialization backend is automatically selected based on the presence of dill, it is impossible to test pickle serialization without patching this. For now we only need to patch a single module, but as pytorch/pytorch#74958 (comment) implies, we need to do this in multiple places in the future.

Would it be possible give users the option to set the serialization backend? The default can still be dill if available otherwise pickle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I can confirm that we will rely on this DILL_AVAILABLE for any place in TorchData project to determine if dill is available or not.

Would it be possible give users the option to set the serialization backend? The default can still be dill if available otherwise pickle.

It's doable, but I am not sure if we want to do so because the goal of automatically using dill is to reduce the work users need to figure out if the DataPipe is serializable with lambda function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I can confirm that we will rely on this DILL_AVAILABLE for any place in TorchData project to determine if dill is available or not.

The problem is that we will need to patch every single module where this is imported. You cannot patch the place where it is defined, but rather where it is used. If you look above, we are not patching ._utils.serialization but rather .datapipes.datapipe, because this is where the flag is used. If we now need use this flag in multiple modules, we need to patch all of them. This is very brittle.

It's doable, but I am not sure if we want to do so because the goal of automatically using dill is to reduce the work users need to figure out if the DataPipe is serializable with lambda function.

Not sure I understand. If we just keep the same detection as we have now, users that don't care should not see any difference. If dill is available, it will be picked up and otherwise pickle will be used. But it would give users the option to enforce a particular backend if they need to. Without this option, the environment you use has an effect on the functionality and there is no way change that. I don't think this is good design.

Even if you don't do it for the users, think about how you want to test pickle vs dill yourself. Right now the only option is to have two separate workflows one with dill installed and one without.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we will need to patch every single module where this is imported. You cannot patch the place where it is defined, but rather where it is used. If you look above, we are not patching ._utils.serialization but rather .datapipes.datapipe, because this is where the flag is used. If we now need use this flag in multiple modules, we need to patch all of them. This is very brittle.

You still can add following code to override the method rather than using patch.

def state_fn(self):
    return self.__dict__
IterDataPipe. set_getstate_hook(state_fn)

But it would give users the option to enforce a particular backend if they need to. Without this option, the environment you use has an effect on the functionality and there is no way change that. I don't think this is good design.

This is actually a good argument for users who have fully understanding about what they want to achieve. We may be able to expose an API to switch backend if needed, similar to the set_getstate_hook but with syntax sugar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an issue pytorch/data#341

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @NivekT Overriding set_getstate_hook with the above function won't actually work for all DataPipe like Forker https://github.com/pytorch/pytorch/blob/835cc66e5dd26db558931b4fe47b45e08a3a09f7/torch/utils/data/datapipes/iter/combining.py#L158-L167

Then, we should definitely support backend switch that Philip suggested


dataset_mock.prepare(test_home, config)
dataset = datasets.load(dataset_mock.name, **config)

pickle.dumps(dataset)

@pytest.mark.skipif(not DILL_AVAILABLE, reason="Package `dill` is not available.")
# TODO: remove this as soon as dill is fully supported
@pytest.mark.xfail(reason="See https://github.com/pytorch/data/issues/237", raises=RecursionError)
@parametrize_dataset_mocks(DATASET_MOCKS)
def test_serializable_dill(self, test_home, dataset_mock, config):
import dill

dataset_mock.prepare(test_home, config)
dataset = datasets.load(dataset_mock.name, **config)

dill.dumps(dataset)

# TODO: we need to enforce not only that both a Shuffler and a ShardingFilter are part of the datapipe, but also
# that the Shuffler comes before the ShardingFilter. Early commits in https://github.com/pytorch/vision/pull/5680
# contain a custom test for that, but we opted to wait for a potential solution / test from torchdata for now.
@pytest.mark.xfail(reason="See https://github.com/pytorch/data/issues/237")
@parametrize_dataset_mocks(DATASET_MOCKS)
@pytest.mark.parametrize("annotation_dp_type", (Shuffler, ShardingFilter))
def test_has_annotations(self, test_home, dataset_mock, config, annotation_dp_type):
Expand Down