You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I adapted the Vertical Federated Splitlearning CIFAR10 example to a ResNet50 and aimed for a split of the network closer to the middle as possible. My code works when the second client only has the last flatten layer and the linear layer. The code also works when I set the split more in the middle and only execute the training. However, when during training _validate() is called, I get a "torch.cuda.OutOfMemoryError: CUDA out of memory" error. It says it tried to allocate 20.00 MiB, GPU 0 has a total capacity of 23.64 GiB, of which 21.81 MiB is free. Process 87262 has 1.17 GiB memory in use. Process 87261 has 5.39 GiB memory in use. Including non-PyTorch memory, this process has 13.91 GiB memory in use. Process 4034216 has 2.70 GiB memory in use. Of the allocated memory, 12.27 GiB is allocated by PyTorch, and 505.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF."
Additional context
My thought is that because the tensors passed between clients are larger than when I set the split later in the network, they are being utilized/stored differently in _validate() than train(), which leads to the out of memory error.
The text was updated successfully, but these errors were encountered:
I think this issue is because the GPU does not have enough memory to run both training and validation at the same time.
So one way is that we don't do them at the same time.
If you still want to do them both at the same time, then you can try:
Describe the bug
I adapted the Vertical Federated Splitlearning CIFAR10 example to a ResNet50 and aimed for a split of the network closer to the middle as possible. My code works when the second client only has the last flatten layer and the linear layer. The code also works when I set the split more in the middle and only execute the training. However, when during training _validate() is called, I get a "torch.cuda.OutOfMemoryError: CUDA out of memory" error. It says it tried to allocate 20.00 MiB, GPU 0 has a total capacity of 23.64 GiB, of which 21.81 MiB is free. Process 87262 has 1.17 GiB memory in use. Process 87261 has 5.39 GiB memory in use. Including non-PyTorch memory, this process has 13.91 GiB memory in use. Process 4034216 has 2.70 GiB memory in use. Of the allocated memory, 12.27 GiB is allocated by PyTorch, and 505.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF."
To Reproduce
Code :https://github.com/eshatkeinensinn/NVFlare/tree/main/examples/advanced/vertical_federated_learning/cifar10-split-res
Run the regular README steps and launch Jupyter Lab.
Additional context
My thought is that because the tensors passed between clients are larger than when I set the split later in the network, they are being utilized/stored differently in _validate() than train(), which leads to the out of memory error.
The text was updated successfully, but these errors were encountered: