Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix prototype datasets data loading tests #5711
fix prototype datasets data loading tests #5711
Changes from 6 commits
1106c57
84cde3a
0b77ea8
f382ba8
4846095
b3a630f
e1aaf7a
d8aeb6d
f9b682c
0696613
d237f2f
5c3646d
fcdae07
50ff31e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejguan @NivekT since the serialization backend is automatically selected based on the presence of
dill
, it is impossible to test pickle serialization without patching this. For now we only need to patch a single module, but as pytorch/pytorch#74958 (comment) implies, we need to do this in multiple places in the future.Would it be possible give users the option to set the serialization backend? The default can still be
dill
if available otherwisepickle
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I can confirm that we will rely on this
DILL_AVAILABLE
for any place in TorchData project to determine ifdill
is available or not.It's doable, but I am not sure if we want to do so because the goal of automatically using
dill
is to reduce the work users need to figure out if the DataPipe is serializable with lambda function.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we will need to patch every single module where this is imported. You cannot patch the place where it is defined, but rather where it is used. If you look above, we are not patching
._utils.serialization
but rather.datapipes.datapipe
, because this is where the flag is used. If we now need use this flag in multiple modules, we need to patch all of them. This is very brittle.Not sure I understand. If we just keep the same detection as we have now, users that don't care should not see any difference. If
dill
is available, it will be picked up and otherwisepickle
will be used. But it would give users the option to enforce a particular backend if they need to. Without this option, the environment you use has an effect on the functionality and there is no way change that. I don't think this is good design.Even if you don't do it for the users, think about how you want to test
pickle
vsdill
yourself. Right now the only option is to have two separate workflows one withdill
installed and one without.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still can add following code to override the method rather than using
patch
.This is actually a good argument for users who have fully understanding about what they want to achieve. We may be able to expose an API to switch backend if needed, similar to the
set_getstate_hook
but with syntax sugar.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an issue pytorch/data#341
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @NivekT Overriding
set_getstate_hook
with the above function won't actually work for all DataPipe likeForker
https://github.com/pytorch/pytorch/blob/835cc66e5dd26db558931b4fe47b45e08a3a09f7/torch/utils/data/datapipes/iter/combining.py#L158-L167Then, we should definitely support backend switch that Philip suggested