Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pytorch profiler and additional tensorboard logs for GPU memory usage. #3607

Merged
merged 8 commits into from
Sep 18, 2023

Conversation

justinxzhao
Copy link
Contributor

@justinxzhao justinxzhao commented Sep 14, 2023

Add two memory profiling mechanics to Ludwig.

1. Official pytorch profiler

image

Usage

Add enable_profiling: True to the trainer config.

trainer:
  enable_profiling: True
  profiler:
    repeat: 5
    active: 3

Notes

  • Code is modeled off of this tutorial.
  • The full pytorch profiler can be rather heavyweight, so we disable it by default.
  • There was some trial and error to determine default values for the schedule, which seem to work ok with basic models.

Memory profiling using torch cuda memory stats

image

Notes

  • Getting information from torch.cuda memory data is lightweight and additional logging does not seem to negatively affect training time.
  • torch.cuda memory information is logged every training step under its own section in the tensorboard called cuda.
  • I'm unsure whether these logs will always be consistent with the official pytorch profiler. This may also hard to compare since the official pytorch profiler shows memory usage over a time window rather than on a per-step basis.

Inspecting using tensorboard

Launch tensorboard:

tensorboard --logdir=results/

In colab:

!pip install tensorflow  # tensorflow must be installed for the tensorboard extension to load correctly.
%load_ext tensorboard
%tensorboard --logdir=results

@github-actions
Copy link

github-actions bot commented Sep 14, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   39m 48s ⏱️ + 1m 25s
31 tests ±0  26 ✔️ ±0    5 💤 ±0  0 ±0 
82 runs  ±0  66 ✔️ ±0  16 💤 ±0  0 ±0 

Results for commit 48139cf. ± Comparison against base commit 6198693.

♻️ This comment has been updated with latest results.

@arnavgarg1 arnavgarg1 self-requested a review September 14, 2023 15:30
@justinxzhao justinxzhao marked this pull request as ready for review September 14, 2023 16:34
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One clarifying question: is enabling the pytorch profiler the only way to get GPU utilization metrics? Feels like this should be pretty cheap to obtain and ideally something we always collect. I can understand why the more granular memory metrics would be expensive, but just want to try and find a way to decouple those things.

@justinxzhao justinxzhao changed the title Add pytorch profiler on tensorboard. Add pytorch profiler and additional tensorboard logs for GPU memory usage. Sep 15, 2023
@justinxzhao
Copy link
Contributor Author

One clarifying question: is enabling the pytorch profiler the only way to get GPU utilization metrics? Feels like this should be pretty cheap to obtain and ideally something we always collect. I can understand why the more granular memory metrics would be expensive, but just want to try and find a way to decouple those things.

I tested logging torch.cuda.* data as tfevents on a colab T4 to produce additional GPU memory utilization graphs in tensorboard. The additional logging doesn’t slow down training.

I've attached an image that's fine-tuning lambda-2-7b on codealpaca for 1K steps.

I've updated the PR to keep the option to use the full torch profiler (off by default) and log torch.cuda.* stuff always.

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think the only thing to add is adding a logger.warning call at profiler initialization time to indicate that profiling is expensive and will slow down training.

@justinxzhao justinxzhao merged commit e4fc68a into master Sep 18, 2023
@justinxzhao justinxzhao deleted the profiler branch September 18, 2023 22:49
Infernaught pushed a commit to Infernaught/nightlyfix that referenced this pull request Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants