-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pytorch profiler and additional tensorboard logs for GPU memory usage. #3607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One clarifying question: is enabling the pytorch profiler the only way to get GPU utilization metrics? Feels like this should be pretty cheap to obtain and ideally something we always collect. I can understand why the more granular memory metrics would be expensive, but just want to try and find a way to decouple those things.
I tested logging torch.cuda.* data as tfevents on a colab T4 to produce additional GPU memory utilization graphs in tensorboard. The additional logging doesn’t slow down training. I've attached an image that's fine-tuning lambda-2-7b on codealpaca for 1K steps. I've updated the PR to keep the option to use the full torch profiler (off by default) and log torch.cuda.* stuff always. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think the only thing to add is adding a logger.warning call at profiler initialization time to indicate that profiling is expensive and will slow down training.
Add two memory profiling mechanics to Ludwig.
1. Official pytorch profiler
Usage
Add
enable_profiling: True
to the trainer config.Notes
Memory profiling using torch cuda memory stats
Notes
cuda
.Inspecting using tensorboard
Launch tensorboard:
In colab: