Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA – memory spanning multiple devices and a question about cuda_global_memory #335

Open
NeutralKaon opened this issue Aug 9, 2024 · 1 comment

Comments

@NeutralKaon
Copy link

NeutralKaon commented Aug 9, 2024

Hi there,

Thanks for making a fantastic bit of code. I've got a question about what to do if your problem won't fit into GPU ram on one device, but will on two. You have reference to using cudaMallocManaged, which I naïvely understand would handle host/GPU page faulting in case the problem is too large, and there is much pain (!) to be had in order to span across multiple devices and keep them synchronised.

a) You have explicitly disabled this and gone for manual memory management by hardcoding a parameter (cuda_global_memory) as false. Is this for performance reasons? I do find that CPU-only approaches are indeed much faster.

b) Do you have any plans to permit multi-GPU usage and trying to span them with some sort of NUMA architecture? My problem is about 110 GB in ram – don't ask! – and I realise this is a huge amount of work and the answer is probably 'no'.

Thanks for your help,

@mblum94
Copy link
Contributor

mblum94 commented Aug 12, 2024

Hi there,

a) In my experience, there are some rare situations where global memory is slower than pure GPU memory. For example, host to device copies in my experience do not overlap with compute tasks. Before we had some tools using global memory and some didn't. We unified this and turned it off by default. However, you can activate global memory by an environment variable, namely by setting BART_GPU_GLOBAL_MEMORY=1
This does not seem to be documented, but we'll add it soon.

b) We have support for multi-gpu based on MPI. So far, it is available in the pics tool via command line options and for training in the deep-learning tools (reconet and nlinvnet). If you tell us more, maybe your problem is already covered.

Best,
Moritz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants