Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared pinned buffers #120

Open
wants to merge 25 commits into
base: trunk
Choose a base branch
from
Open

Shared pinned buffers #120

wants to merge 25 commits into from

Conversation

bengioe
Copy link
Collaborator

@bengioe bengioe commented Feb 23, 2024

This PR implements a better way of sharing torch tensors between process by creating (large enough) shared tensors that are created once are used as a transfer mechanism. Doing this on the fragment environment seh_frag.py I'm getting a 30% wall time improvement for simple settings, with batch size 64 (I'm sure we could have fun maxing that out and see how far we can take GPU utilization).

Some notes:

  • The effect is mostly felt when sampling (which is where most time is spent in the first place), and sending Batch and GraphActionCategoricals through shared buffers improves time
  • Passing batches to the training loop (which are much bigger and "rarer") doesn't seem to have a significant speedup, but I've implemented it nonetheless for future proofing

Other changes:

  • Removed local grad clipping which is not quite correct; the difference is minimal but relevant, there's also a nice speedup
  • Made all algorithms inherit from GFNAlgorithm
  • global_cfg is set for all algorithms
  • cond_info is now folded into the batch object rather than being passed as an argument everywhere
  • fixed GraphActionCategorical.entropy when masks are used, gradients wrt logits would be NaN.

Note, EnvelopeQL is still in a broken state, will fix in #127

@bengioe
Copy link
Collaborator Author

bengioe commented Mar 1, 2024

I'm of a mind to merge this actually. It's not the cleanest implementation possible but there are significant gains here (as mentioned, a 30% speedup with the default settings on seh_frag.py). Will test across tasks and report back.

@bengioe bengioe force-pushed the bengioe-mp-with-batch-buffers branch from 648961f to acfe070 Compare March 11, 2024 16:00
@bengioe
Copy link
Collaborator Author

bengioe commented Mar 11, 2024

Made significant simplifications to the method by subclassing Pickler/Unpickler, found some very tricky bugs (I was making a bad usage of pinned CUDA buffers and ended up with rare race conditions). Speedups remain (might even be a bit faster).

@bengioe bengioe marked this pull request as ready for review May 9, 2024 14:55
@bengioe bengioe changed the title Proof of concept of using shared pinned buffers Shared pinned buffers May 9, 2024
@bengioe bengioe requested a review from julienroyd May 9, 2024 15:05
@bengioe
Copy link
Collaborator Author

bengioe commented May 9, 2024

Merged with trunk + made a few fixes. Pretty happy with this now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant