Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynUnet get_feature_maps() issue when using distributed training #1564

Closed
yiheng-wang-nv opened this issue Feb 8, 2021 · 10 comments · Fixed by #1596
Closed

DynUnet get_feature_maps() issue when using distributed training #1564

yiheng-wang-nv opened this issue Feb 8, 2021 · 10 comments · Fixed by #1596
Assignees
Labels
bug Something isn't working

Comments

@yiheng-wang-nv
Copy link
Contributor

yiheng-wang-nv commented Feb 8, 2021

Describe the bug
If we use DistributedDataParallel to wrap the network, calling the 'get_feature_maps' function will raise the following error:

Screen Shot 2021-02-08 at 6 51 01 PM

It is common to return multiple variables in a network's forward function, not only for this case, but also for other situations such as for calculating triplet loss in metric learning based tasks. Therefore, here I think we should re-consider about what to return in the forward function of DynUnet.

Let me think about it and submit a PR then we can discuss it later.
@Nic-Ma @wyli @rijobro

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Feb 8, 2021

I recommend to do below change to fix this issue:
Remove the get_feature_maps() and change the logic of forward():

  1. if in training=true and deep_supervision=true, return a list of Tensor.
  2. else, return the output Tensor.

It's slightly different from the original code which always returns a list of Tensor.

@wyli @rijobro @ericspod , what do you guys think?

Thanks.

@Nic-Ma Nic-Ma added the bug Something isn't working label Feb 8, 2021
@rijobro
Copy link
Contributor

rijobro commented Feb 8, 2021

This will revert the behaviour of #1393. I suppose this is fine as long as we make the documentation clear so that future users aren't confused.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Feb 9, 2021

Hi @rijobro ,

I think most of the previous problems are due to the list output during validation or inference.
So I suggest to return a list of data during training, return only the output instead of [out] during validation or inference.

Thanks.

@yiheng-wang-nv
Copy link
Contributor Author

yiheng-wang-nv commented Feb 10, 2021

Hi @rijobro ,

I think most of the previous problems are due to the list output during validation or inference.
So I suggest to return a list of data during training, return only the output instead of [out] during validation or inference.

Thanks.

Hi @Nic-Ma @rijobro , actually at the beginning, in val or infer modes, it does only return the output tensor. However, return different types are not permitted for torchscript, thus start from this version, list based results will be returned.
If we all return a list, then our default inferrer is not compatible since it relies on tensor based input.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Feb 18, 2021

@ericspod @wyli @rijobro ,

Do you guys have any better idea to solve this issue? Seems it's not easy to get a perfect solution..
Thanks in advance.

@rijobro
Copy link
Contributor

rijobro commented Feb 18, 2021

So it seems that we need to be consistent with what we return, regardless of train mode, etc., is that correct?

If so, what if we return:

if self.deep_supervision:
     return dict{"data": self.output_block(out), "feature_map": self.heads[1 : self.deep_supr_num + 1]}
else:
    return self.output_block(out) 

In this fashion the output is always consistent (since deep_supervision is set at construction time and will not change over the lifetime of the object).

By returning a dictionary, hopefully it will be clearer to users what we are returning (since users were confused by the list previously returned).

@wyli
Copy link
Contributor

wyli commented Feb 18, 2021

So it seems that we need to be consistent with what we return, regardless of train mode, etc., is that correct?

If so, what if we return:

if self.deep_supervision:
     return dict{"data": self.output_block(out), "feature_map": self.heads[1 : self.deep_supr_num + 1]}
else:
    return self.output_block(out) 

In this fashion the output is always consistent (since deep_supervision is set at construction time and will not change over the lifetime of the object).

By returning a dictionary, hopefully it will be clearer to users what we are returning (since users were confused by the list previously returned).

thanks this looks good to me as well, we could have the default value self.deep_supervision=False, so that the user wouldn't get this error by default

seg_prob = predictor(window_data, *args, **kwargs).to(device) # batched patch segmentation
AttributeError: 'list' object has no attribute 'to'

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Feb 18, 2021

@yiheng-wang-nv ,

Please have a try with dict return first, I am afraid it can't fix the TorchScript issue...
Thanks.

@rijobro
Copy link
Contributor

rijobro commented Feb 18, 2021

If the default inferrer requires a tensor, then presumably you could wrap that function:

def default_dict_inferrer(input, key, *args, **kwargs):
    return default_inferrer(input[key], *args, **kwargs)

@yiheng-wang-nv
Copy link
Contributor Author

yiheng-wang-nv commented Feb 19, 2021

Hi @rijobro , thanks for the advice, but return this kind of dict doesn't fix the torchscript issue as @Nic-Ma mentioned. I suggest here we still return a list, and add the corresponding docstrings. In order to help users use this network, I will also add/update tutorials. Let me submit a PR first for you to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants