Add pytorch profiler and additional tensorboard logs for GPU memory usage. #3607

justinxzhao · 2023-09-14T13:10:57Z

Add two memory profiling mechanics to Ludwig.

1. Official pytorch profiler

Usage

Add enable_profiling: True to the trainer config.

trainer:
  enable_profiling: True
  profiler:
    repeat: 5
    active: 3

Notes

Code is modeled off of this tutorial.
The full pytorch profiler can be rather heavyweight, so we disable it by default.
There was some trial and error to determine default values for the schedule, which seem to work ok with basic models.

Memory profiling using torch cuda memory stats

Notes

Getting information from torch.cuda memory data is lightweight and additional logging does not seem to negatively affect training time.
torch.cuda memory information is logged every training step under its own section in the tensorboard called cuda.
I'm unsure whether these logs will always be consistent with the official pytorch profiler. This may also hard to compare since the official pytorch profiler shows memory usage over a time window rather than on a per-step basis.

Inspecting using tensorboard

Launch tensorboard:

tensorboard --logdir=results/

In colab:

!pip install tensorflow  # tensorflow must be installed for the tensorboard extension to load correctly.
%load_ext tensorboard
%tensorboard --logdir=results

github-actions · 2023-09-14T13:12:00Z

Unit Test Results

  6 files ±0   6 suites ±0 39m 48s ⏱️ + 1m 25s
31 tests ±0 26 ✔️ ±0   5 💤 ±0 0 ❌ ±0
82 runs ±0 66 ✔️ ±0 16 💤 ±0 0 ❌ ±0

Results for commit 48139cf. ± Comparison against base commit 6198693.

♻️ This comment has been updated with latest results.

tgaddair

One clarifying question: is enabling the pytorch profiler the only way to get GPU utilization metrics? Feels like this should be pretty cheap to obtain and ideally something we always collect. I can understand why the more granular memory metrics would be expensive, but just want to try and find a way to decouple those things.

justinxzhao · 2023-09-15T16:12:52Z

One clarifying question: is enabling the pytorch profiler the only way to get GPU utilization metrics? Feels like this should be pretty cheap to obtain and ideally something we always collect. I can understand why the more granular memory metrics would be expensive, but just want to try and find a way to decouple those things.

I tested logging torch.cuda.* data as tfevents on a colab T4 to produce additional GPU memory utilization graphs in tensorboard. The additional logging doesn’t slow down training.

I've attached an image that's fine-tuning lambda-2-7b on codealpaca for 1K steps.

I've updated the PR to keep the option to use the full torch profiler (off by default) and log torch.cuda.* stuff always.

arnavgarg1

LGTM! I think the only thing to add is adding a logger.warning call at profiler initialization time to indicate that profiling is expensive and will slow down training.

…sage. (ludwig-ai#3607)

justinxzhao added 2 commits September 14, 2023 01:54

Add torch profiler and config.

06cf440

Use better defaults.

b7ef172

Merge branch 'master' of github.com:ludwig-ai/ludwig into profiler

6f4b2df

arnavgarg1 self-requested a review September 14, 2023 15:30

justinxzhao marked this pull request as ready for review September 14, 2023 16:34

tgaddair reviewed Sep 14, 2023

View reviewed changes

justinxzhao added 3 commits September 15, 2023 03:11

Merge branch 'master' of github.com:ludwig-ai/ludwig into profiler

e600919

Add torch cuda memory stats event logs..

ec62468

Merge branch 'profiler' of github.com:ludwig-ai/ludwig into profiler

48139cf

justinxzhao changed the title ~~Add pytorch profiler on tensorboard.~~ Add pytorch profiler and additional tensorboard logs for GPU memory usage. Sep 15, 2023

justinxzhao requested a review from tgaddair September 15, 2023 16:11

arnavgarg1 approved these changes Sep 18, 2023

View reviewed changes

justinxzhao added 2 commits September 18, 2023 22:47

Merge branch 'master' of github.com:ludwig-ai/ludwig into profiler

7ea009b

Add warning statement when profiler is initialized.

23f85e1

justinxzhao merged commit e4fc68a into master Sep 18, 2023

justinxzhao deleted the profiler branch September 18, 2023 22:49

Infernaught pushed a commit to Infernaught/nightlyfix that referenced this pull request Sep 19, 2023

Add pytorch profiler and additional tensorboard logs for GPU memory u…

172463a

…sage. (ludwig-ai#3607)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pytorch profiler and additional tensorboard logs for GPU memory usage. #3607

Add pytorch profiler and additional tensorboard logs for GPU memory usage. #3607

justinxzhao commented Sep 14, 2023 •

edited

Loading

github-actions bot commented Sep 14, 2023 •

edited

Loading

tgaddair left a comment

justinxzhao commented Sep 15, 2023

arnavgarg1 left a comment

Add pytorch profiler and additional tensorboard logs for GPU memory usage. #3607

Add pytorch profiler and additional tensorboard logs for GPU memory usage. #3607

Conversation

justinxzhao commented Sep 14, 2023 • edited Loading

1. Official pytorch profiler

Usage

Notes

Memory profiling using torch cuda memory stats

Notes

Inspecting using tensorboard

github-actions bot commented Sep 14, 2023 • edited Loading

Unit Test Results

tgaddair left a comment

Choose a reason for hiding this comment

justinxzhao commented Sep 15, 2023

arnavgarg1 left a comment

Choose a reason for hiding this comment

justinxzhao commented Sep 14, 2023 •

edited

Loading

github-actions bot commented Sep 14, 2023 •

edited

Loading