-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph traversal is broken for custom iter datapipes #237
Comments
I believe this is due to the function |
One of the reasons we are looking into this architecture is to be able to access instance attributes in these functions. Thus, moving the methods to functions would somewhat make the implementation more verbose. cc @NicolasHug |
Another way would be changing it to staticmethod as this function doesn't really use |
Thanks for the info @ejguan
In general, we'll be using non-static functions that rely on arbitrary So that we can plan on how to best move forward on our side, how long do you think a fix would take @ejguan ? |
|
It will be my top priority next week. I need to prepare all release-related stuff this week. |
I was able to provide a work around for
In order to provide solution for it, please correct me if I am wrong about the requirements:
cc: @pmeier @NicolasHug |
diff --git a/torch/utils/data/datapipes/datapipe.py b/torch/utils/data/datapipes/datapipe.py
index 0a90cec8d6..f1dd0ab88b 100644
--- a/torch/utils/data/datapipes/datapipe.py
+++ b/torch/utils/data/datapipes/datapipe.py
@@ -93,13 +93,7 @@ class IterDataPipe(IterableDataset[T_co], metaclass=_DataPipeMeta):
def __getstate__(self):
if IterDataPipe.getstate_hook is not None:
return IterDataPipe.getstate_hook(self)
- state_dict = {}
- for k, v in self.__dict__.items():
- if callable(v):
- state_dict[k] = serialize_fn(v)
- else:
- state_dict[k] = v
- return state_dict
+ return self.__dict__
def __setstate__(self, state_dict):
for k, v in state_dict.items():
serialization works as expected: import pickle
from torchdata.datapipes.iter import IterDataPipe, IterableWrapper
class CustomIterDataPipe(IterDataPipe):
def noop(self, x):
return x + 1
def __init__(self):
self._dp = IterableWrapper([1, 2, 4]).map(self.noop)
def __iter__(self):
yield from self._dp
assert list(CustomIterDataPipe()) == list(pickle.loads(pickle.dumps(CustomIterDataPipe()))) The logic I patched above was introduced in pytorch/pytorch#72896 cc @NivekT. |
Wondering what would be the visualized graph with such circular reference? |
Yeah. We might need to use a different method to handle dispatch pickling method for callable inside IterDataPipe. |
Actually, I am not 100% sure this issue can be resolved by using We could add a test for such circular reference DataPipe and make sure the |
I can take a look after I finished landing S3 |
FYI, this PR is work in progress (basically add graph traversal support for |
@ejguan @NivekT , going back to @pmeier 's #237 (comment):
Fixing the cyclic reference issue would allow us to move forward with our preferred design for torchvision new datasets. |
No. We can revert the custom serialization code to unblock you. cc: @NivekT about if reverting will cause BC breaking? |
The most common use case for For context, prior to that PR, we also used Without https://github.com/pytorch/pytorch/blob/master/test/test_datapipe.py#L645 I am not sure how many users use |
IIUC, the problem arises only because we are currently trying to support both, correct? Would it be possible to hard depend on |
I think adding |
Thanks @NivekT
This means that the cyclic reference issue is still happening for users that have Do you think there is a way to fix the |
Yep, the PR is meant to be a quick fix to unblock your work and I marked that as a TODO. I agree with your point and am looking into it. |
…nstalled" This is a quick fix. Only applies the custom serialization logic when `dill` is installed. The specific issue mentioned below should be resolved if `dill` is not installed in the local environment. See the [issue in this comment](pytorch/data#237 (comment)). I also plan to add some test related to circular dependency in the serialization process. [ghstack-poisoned]
This PR should resolve the recursion problem of But, for the problem of |
…rse (#75034) Summary: - Fix _Demux can not be pickled with DILL presented #74958 (comment) - And add cache to traverse function to prevent infinite recursion for circular reference of DataPipe (Fixes pytorch/data#237) Pull Request resolved: #75034 Approved by: https://github.com/wenleix Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/841a7f5187c7e5ed681777a8276b49a34b3d29fb Reviewed By: b0noI Differential Revision: D35295370 Pulled By: ejguan fbshipit-source-id: 4182476bd2ca36b3dbc6beb6eb41bc9e45337f32
…for non-method functions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…unctions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…for non-method functions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…unctions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…for non-method functions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…unctions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…for non-method functions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method lambda function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…unctions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method lambda function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…for non-method functions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method lambda function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…unctions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method lambda function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…for non-method functions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method lambda function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…unctions" This PR further restricts the application of our custom serialization function in `DataPipe`. It will only be applied if: 1. `dill` is installed 2. The object in question is a non-method lambda function If there isn't any edge case, I believe this should address the circular dependency issue raised in pytorch/data#237. Please let me know if you can think of anything that can potentially break this. [ghstack-poisoned]
…ataPipes" This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
…ataPipes" This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
…ataPipes" This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
…ataPipes" This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
…ataPipes" This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. As a result, some tests need to be changed (to use `dill` explicitly when necessary) and `serialization.py` has mostly been cleaned up as the functions are no longer needed. This should fix pytorch/data#237. Differential Revision: [D36111757](https://our.internmc.facebook.com/intern/diff/D36111757) [ghstack-poisoned]
Summary: Pull Request resolved: #74984 This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. This should fix pytorch/data#237. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D36111757 Pulled By: NivekT fbshipit-source-id: ff0c0d58dffd4cf3a839e83adee7033638046a49
Summary: Pull Request resolved: #74984 This PR removes the custom serialization logic (i.e. try with `pickle` then with `dill`) inside `__getstate__` and `__setstate__` of DataPipe classes because it causes circular reference and exceptions. Instead, the attempt to use `dill` instead of `pickle` is moved to the `traversal` function for DataPipe graphs and will be included within DataLoader2 as well. This should fix pytorch/data#237. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D36111757 Pulled By: NivekT fbshipit-source-id: ff0c0d58dffd4cf3a839e83adee7033638046a49
I believe this is fixed by pytorch/pytorch#74984 |
Without the
.map()
call it works fine. I don't think this is specific to.map()
though. From trying a few datapipes, this always happens ifself._dp
is composed in some way.The text was updated successfully, but these errors were encountered: