Skip to content
This repository has been archived by the owner on Dec 20, 2024. It is now read-only.

Mlflow benchmark profiler update #38

Merged
merged 55 commits into from
Nov 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
6507424
fix: saving frequency bug for inference checkpoints
anaprietonem Aug 20, 2024
7d2d620
Merge branch 'develop' into 257-bug-inference-checkpoints-saving-freq…
anaprietonem Aug 20, 2024
0027046
chore: update CHANGELOG
anaprietonem Aug 20, 2024
8cf698b
feat: add anemoi profiler with mlflow compatibility
anaprietonem Aug 20, 2024
d647bf9
fix: format error
anaprietonem Aug 20, 2024
352cd29
fix: removed atos path from noteook and fixed update_paths function
anaprietonem Aug 23, 2024
c7ab208
add hta functionality in documentation
anaprietonem Oct 7, 2024
ebe33bd
updating docs for profiler
anaprietonem Oct 7, 2024
9c67f3e
update profiler docs
anaprietonem Oct 7, 2024
2bcf957
update profiler docs
anaprietonem Oct 7, 2024
2e6a168
update profiler docs
anaprietonem Oct 7, 2024
29232ce
update profiler docs
anaprietonem Oct 7, 2024
c646e38
update profiler docs
anaprietonem Oct 7, 2024
4d9610b
update profiler docs
anaprietonem Oct 7, 2024
0a4070c
update profiler docs
anaprietonem Oct 7, 2024
45e7a7b
update profiler docs
anaprietonem Oct 7, 2024
3cea9d9
update profiler docs
anaprietonem Oct 7, 2024
3c2f2d9
update profiler docs
anaprietonem Oct 7, 2024
b8fcf99
update profiler docs
anaprietonem Oct 7, 2024
80e5522
Merge branch 'develop' into mlflow_benchmark_profiler_update
anaprietonem Oct 7, 2024
990aea9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 7, 2024
5aeeca4
fixing pre-commits on docs
anaprietonem Oct 7, 2024
b85eac2
fix pre-commit docs
anaprietonem Oct 7, 2024
ef54ffb
fix pre-commit docs
anaprietonem Oct 7, 2024
56e222f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 7, 2024
4aa225a
minor updates
anaprietonem Oct 7, 2024
81b57d8
Merge branch 'mlflow_benchmark_profiler_update' of github.com:ecmwf/a…
anaprietonem Oct 7, 2024
86e58ba
added docs for anemoi profiler
anaprietonem Oct 7, 2024
e943782
add section about profiling in overview
anaprietonem Oct 8, 2024
e177bd6
add section about profiling in overview
anaprietonem Oct 8, 2024
328ca19
add comment to avoid confussion with profiler for troubleshooting
anaprietonem Oct 8, 2024
702287e
added note about limit batches
anaprietonem Oct 9, 2024
36dc645
Merge branch 'develop' into mlflow_benchmark_profiler_update
anaprietonem Oct 24, 2024
a7280ab
updated changelog
anaprietonem Oct 24, 2024
05289e4
making sure anemoi-training profiler commands works in interactive gp…
anaprietonem Oct 25, 2024
df76686
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 25, 2024
977c3e4
update docs
anaprietonem Oct 25, 2024
d71d7c1
Merge branch 'mlflow_benchmark_profiler_update' of github.com:ecmwf/a…
anaprietonem Oct 25, 2024
442dd9a
removed comment based on refactor callbacks PR
anaprietonem Oct 25, 2024
60368ae
adapted patchedProfile to not break HTA
anaprietonem Oct 25, 2024
9c50023
avoid code duplication in commands and fix copyright notice
anaprietonem Oct 25, 2024
34494b8
refactor to have base command
anaprietonem Oct 28, 2024
a279ac0
consistency with train command and docs update
anaprietonem Oct 28, 2024
c00e477
added abstractmethod
anaprietonem Oct 28, 2024
8310190
can't have profile as command, due to ruff conflicts
anaprietonem Oct 29, 2024
a7898d5
missing file from renaming
anaprietonem Oct 29, 2024
a8cb7bb
Merge branch 'develop' into mlflow_benchmark_profiler_update
anaprietonem Oct 30, 2024
231eb7f
updated profiler with develop and new callbacks structure
anaprietonem Oct 30, 2024
9596132
specific command property
anaprietonem Oct 31, 2024
afa6d1a
docs update
anaprietonem Oct 31, 2024
bf4bc90
update property to command
anaprietonem Nov 1, 2024
e2a51eb
updating to new release of anemoi-utils
anaprietonem Nov 4, 2024
101ffe3
Merge branch 'develop' into mlflow_benchmark_profiler_update
anaprietonem Nov 4, 2024
f60fd66
correct changelog
anaprietonem Nov 4, 2024
814367e
fix command
anaprietonem Nov 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ Keep it human-readable, your future self will thank you!
- Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
- Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
- Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)


### Changed
- Modified training configuration to support max_steps and tied lr iterations to max_steps by default [#67](https://github.com/ecmwf/anemoi-training/pull/67)
Expand All @@ -32,6 +34,8 @@ Keep it human-readable, your future self will thank you!

- Lock python version <3.13 [#107](https://github.com/ecmwf/anemoi-training/pull/107)



## [0.2.1 - Bugfix: resuming mlflow runs](https://github.com/ecmwf/anemoi-training/compare/0.2.0...0.2.1) - 2024-10-24

### Added
Expand Down Expand Up @@ -82,6 +86,7 @@ Keep it human-readable, your future self will thank you!
- Feature: `AnemoiMlflowClient`, an mlflow client with authentication support [#86](https://github.com/ecmwf/anemoi-training/pull/86)
- Long Rollout Plots


### Fixed

- Fix `TypeError` raised when trying to JSON serialise `datetime.timedelta` object - [#43](https://github.com/ecmwf/anemoi-training/pull/43)
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/anemoi_profiler_config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_memory_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_memory_timeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_model_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_model_summary_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_system_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_time_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/idle_time_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/kernel_breakdown_dfs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/kernel_breakdown_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/memory_snapshot_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/temporal_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This package provides the *Anemoi* training functionality.
user-guide/training
user-guide/models
user-guide/tracking
user-guide/benchmarking
user-guide/distributed
user-guide/debugging

Expand Down
12 changes: 12 additions & 0 deletions docs/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,18 @@ and resolve issues during the training process, including:
- Debug configurations for quick error identification
- Guidance on isolating and addressing common problems

8. Benchmarking and HPC Profiling
=================================

Anemoi Training offers tools and configurations to support benchmarking
and High-Performance Computing (HPC) profiling, allowing users to
optimize training performance. This includes:

- Benchmarking configurations for evaluating training efficiency across
different hardware setups.
- Profiling tools for monitoring resource utilization (CPU, GPU,
memory) and identifying performance bottlenecks.

**************************
Components and Structure
**************************
Expand Down
Loading
Loading