Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMDataParallel Implementation inconsistency between CPU / GPU #792

Closed
24hours opened this issue Jan 16, 2021 · 5 comments
Closed

MMDataParallel Implementation inconsistency between CPU / GPU #792

24hours opened this issue Jan 16, 2021 · 5 comments
Labels

Comments

@24hours
Copy link

24hours commented Jan 16, 2021

It seems that MMDataParallel support CPU only mode.

However, https://github.com/open-mmlab/mmcv/blob/master/mmcv/parallel/data_parallel.py#L52 will prompt error if self.device_ids is falsy.

 return self.module.train_step(*inputs, **kwargs)

In GPU mode, the following code is executed instead https://github.com/open-mmlab/mmcv/blob/master/mmcv/parallel/data_parallel.py#L67 with no error message.

return self.module.train_step(*inputs[0], **kwargs[0])

modifying line 52 to (*inputs[0], *kwargs[0]) seems to resolve the error with no other issue. Is there a reason why there is difference between line 52 and line 67 ?

In addition,

why CPU mode unsqueeze additional dimension ? https://github.com/open-mmlab/mmcv/blob/master/mmcv/parallel/_functions.py#L28
output = output.unsqueeze(0)

The comment mention

# unsquzee the first dimension thus the tensor's shape is the
# same as those scattered with GPU.

But it is not clear how GPU will have addition dimension

@24hours
Copy link
Author

24hours commented Jan 22, 2021

I have found another inconsistency between GPU and GPU

from mmcv.parallel import scatter_kwargs, DataContainer
import torch 

inputs = (torch.zeros([20, 3, 128, 128]), )
output, _ = scatter_kwargs(inputs, {}, [-1], 0)
print('CPU', output[0][0].size())
output, _ = scatter_kwargs(inputs, {}, [0], 0)
print('GPU', output[0][0].size())

print('------------')
inputs = (DataContainer([torch.zeros([20, 3, 128, 128])]),)
output, _ = scatter_kwargs(inputs, {}, [-1], 0)
print('CPU', output[0][0].size())
output, _ = scatter_kwargs(inputs, {}, [0], 0)
print('GPU', output[0][0].size())
CPU torch.Size([20, 3, 128, 128])
GPU torch.Size([20, 3, 128, 128])
---------------------
CPU torch.Size([1, 20, 3, 128, 128])
GPU torch.Size([20, 3, 128, 128])

you can see that CPU return wrong size when compare to GPU implementation

@ycxioooong
Copy link
Contributor

Thanks for reporting this issue, we'll carefully check the inconsistency.

@24hours
Copy link
Author

24hours commented Jan 25, 2021

Hi,

I find that updating
https://github.com/open-mmlab/mmcv/blob/master/mmcv/parallel/scatter_gather.py#L24

to

if obj.cpu_only or target_gpus == [-1]:

fix the issue, however I have not run unittest on the change yet

@ycxioooong
Copy link
Contributor

Thanks for your information. We'll update the codes accordingly.

@24hours
Copy link
Author

24hours commented Feb 16, 2021

I have investigate the issue deeper and find that mmcv do not intend to support CPU training.
I tried to modify some code and allow CPU training, it would require extensive change on MMCV code base and its perhaps not wise to do it without discussion with MMCV team.

I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants