Skip to content
This repository has been archived by the owner on Dec 20, 2024. It is now read-only.

Mlflow benchmark profiler update #38

Merged
merged 55 commits into from
Nov 4, 2024

Conversation

anaprietonem
Copy link
Contributor

@anaprietonem anaprietonem commented Aug 23, 2024

Describe your changes
PR to update the benchmark profiler to be able to:

  • run and track results with mlflow (including summary of system metrics)
  • Add the training and validation rates as metrics monitored by mlflow
  • Replace the memory profiler that used memray with torch profiler
  • Include option to generate model summary using torchinfo (already a listed dependency)

Please also include relevant motivation and context.
Previous version of the BP did not allow to configure the exact reports to be generated and did not make use of torch profiler. In this PR we address those issues, allowing one to choose what reports to generate, exploiting the features available in pytorch profiler for a more in depth breakdown of GPU/CPU memory usage and adding the option to generate a model summary

List any dependencies that are required for this change.
When running the memory report generation, it's possible to perform a Holistic Trace Analysis but for that you'd need to install https://hta.readthedocs.io/en/latest/source/intro/installation.html (so this is now included as extras for the profiler in setup.py)

Type of change
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • * This change requires a documentation update

Issue ticket number and link
We did not create a ticket when we first started working on this task. Will do next time, apologies!

Checklist before requesting a review

  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation and docstrings to reflect the changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have ensured that the code is still pip-installable after the changes and runs
  • I have not introduced new dependencies in the inference partion of the model
  • I have ran this on single GPU
  • I have ran this on multi-GPU or multi-node
  • I have ran this to work on LUMI (or made sure the changes work independently.)
  • I have ran the Benchmark Profiler against the old version of the code

*need to still update the confluence page but docstrings and comments are updated

Tag possible reviewers
You can @-tag people to review this PR in addition to formal review requests.
@cathalobrien @mchantry @JesperDramsch

[PR Migrated from aifs-mono to anemoi-training]


📚 Documentation preview 📚: https://anemoi-training--38.org.readthedocs.build/en/38/

@cathalobrien
Copy link
Contributor

Hi, I tested this branch and the various profilers all worked as expected. Nice work!

At first, some of the memory profiler output can be confusing. It shows negative numbers, and positive numbers greater then max device memory. Turns out this is because the profiler tracks deallocations, and aggregates allocations across the entire program runtime. Maybe we could add a note above the memory profiler output explaining this. Or maybe better to link to the documentation for the profiler in stdout, as I understand there's more options which can be enabled if the user digs into the code

| Name | ... | CUDA Mem | Self CUDA Mem | # of Calls | ...
| autograd::engine::evaluate_function: AddmmBackward0 | ... | 72.27 Gb | -44.86 Gb | 340 | ...

@anaprietonem anaprietonem added the enhancement New feature or request label Sep 13, 2024
mchantry
mchantry previously approved these changes Oct 25, 2024
Copy link
Member

@mchantry mchantry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks very much.

Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Just a minor comment about the class inheritance.

pyproject.toml Show resolved Hide resolved
src/anemoi/training/commands/profiler.py Outdated Show resolved Hide resolved
@anaprietonem anaprietonem requested a review from gmertes November 4, 2024 09:02
@anaprietonem anaprietonem merged commit 009301a into develop Nov 4, 2024
116 checks passed
@anaprietonem anaprietonem deleted the mlflow_benchmark_profiler_update branch November 4, 2024 21:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants