You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for making a fantastic bit of code. I've got a question about what to do if your problem won't fit into GPU ram on one device, but will on two. You have reference to using cudaMallocManaged, which I naïvely understand would handle host/GPU page faulting in case the problem is too large, and there is much pain (!) to be had in order to span across multiple devices and keep them synchronised.
a) You have explicitly disabled this and gone for manual memory management by hardcoding a parameter (cuda_global_memory) as false. Is this for performance reasons? I do find that CPU-only approaches are indeed much faster.
b) Do you have any plans to permit multi-GPU usage and trying to span them with some sort of NUMA architecture? My problem is about 110 GB in ram – don't ask! – and I realise this is a huge amount of work and the answer is probably 'no'.
Thanks for your help,
The text was updated successfully, but these errors were encountered:
a) In my experience, there are some rare situations where global memory is slower than pure GPU memory. For example, host to device copies in my experience do not overlap with compute tasks. Before we had some tools using global memory and some didn't. We unified this and turned it off by default. However, you can activate global memory by an environment variable, namely by setting BART_GPU_GLOBAL_MEMORY=1
This does not seem to be documented, but we'll add it soon.
b) We have support for multi-gpu based on MPI. So far, it is available in the pics tool via command line options and for training in the deep-learning tools (reconet and nlinvnet). If you tell us more, maybe your problem is already covered.
Hi there,
Thanks for making a fantastic bit of code. I've got a question about what to do if your problem won't fit into GPU ram on one device, but will on two. You have reference to using
cudaMallocManaged
, which I naïvely understand would handle host/GPU page faulting in case the problem is too large, and there is much pain (!) to be had in order to span across multiple devices and keep them synchronised.a) You have explicitly disabled this and gone for manual memory management by hardcoding a parameter (
cuda_global_memory
) as false. Is this for performance reasons? I do find that CPU-only approaches are indeed much faster.b) Do you have any plans to permit multi-GPU usage and trying to span them with some sort of NUMA architecture? My problem is about 110 GB in ram – don't ask! – and I realise this is a huge amount of work and the answer is probably 'no'.
Thanks for your help,
The text was updated successfully, but these errors were encountered: