Multistep training with batch_size >=1 per GPU #139

jsschreck · 2024-12-24T00:36:26Z

No description provided.

…rs with correct settings

dkimpara · 2024-12-25T17:40:39Z

can you add some details? is this going to replace all trainers/datasets or just the multistep ones?

jsschreck · 2024-12-25T17:44:50Z

can you add some details? is this going to replace all trainers/datasets or just the multistep ones?

I just added the single-step dataset. I am aiming for consolidation down to a single trainer with config options to specify what you need to do (single or multi). Working on that and should have it ready by end of week. So far the train_multistep.py in this PR has been tested with the new multi-step datasets. I will update documentation and what not once its all working.

…rsal.py depcreation of many scripts coming soon

…step

credit/datasets/load_dataset_and_dataloader.py

jsschreck · 2024-12-28T17:12:25Z

@kanz76 I think the bug involving batch size and history len is now corrected.

… len conflict

…docstrings

kanz76

I tested

era5_multistep_batcher.py (exceptMultiprocessingBatcherPrefetch)
load_dataset_and_dataloader.py
train_universal.py

They all look good!

…n the PR

jsschreck added 2 commits December 23, 2024 15:03

Initial commit of loading sequence for the new datasets and dataloade…

ae5bf40

…rs with correct settings

Updated train_multi + bug fixes

8755d28

kanz76 self-requested a review December 24, 2024 03:50

jsschreck added 2 commits December 24, 2024 12:18

Bug updates post milti-step training tests

8232708

Still working out daemon issues main vs imported

428cc1d

jsschreck requested a review from dkimpara December 25, 2024 15:37

jsschreck added 3 commits December 26, 2024 13:55

Added and tested single-step within the new scheme; added train_unive…

f59ff3c

…rsal.py depcreation of many scripts coming soon

Fixed tqdm bug and tested this trainer against grad-accum for single …

d50849e

…step

Adding (depcreated) singlestep dataset to datasets directory

6b6093a

kanz76 reviewed Dec 27, 2024

View reviewed changes

credit/datasets/load_dataset_and_dataloader.py Outdated Show resolved Hide resolved

jsschreck added 2 commits December 27, 2024 09:52

Cleaning up redundant method calls, adding logging details

cb3d266

Updating logging messages for edge cases

fd62daa

kanz76 reviewed Dec 27, 2024

View reviewed changes

credit/datasets/load_dataset_and_dataloader.py Outdated Show resolved Hide resolved

jsschreck added 2 commits December 27, 2024 15:42

Fixed import error

9113420

Fixed the batch size * history len bug

3c181c0

jsschreck requested a review from djgagne December 28, 2024 17:12

Fixed bug in MultiprocessingBatcher with indices assigned to workers

cf8427f

kanz76 self-requested a review December 29, 2024 19:00

jsschreck added 4 commits December 29, 2024 13:05

Removed prefetch option from ERA5_MultiStep_Batcher dataloader b/c of…

e83484d

… len conflict

Added pseudo-sampler to enable prefetch with ERA5_MultiStep_Batcher, …

6e80b86

…docstrings

Linting

08420fb

Fixed a few bugs related to dataset len and batches per epoch

c85c987

kanz76 approved these changes Dec 31, 2024

View reviewed changes

jsschreck added 2 commits December 31, 2024 11:39

Added unversal key for the trainer to use grad_accum

4721f30

Added example config for version 2.0 which will support the changes i…

c37eafb

…n the PR

kanz76 self-requested a review December 31, 2024 21:59

kanz76 approved these changes Dec 31, 2024

View reviewed changes

Final update of the config before merging

0be3d9a

jsschreck merged commit 082fe64 into main Jan 1, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multistep training with batch_size >=1 per GPU #139

Multistep training with batch_size >=1 per GPU #139

jsschreck commented Dec 24, 2024

dkimpara commented Dec 25, 2024

jsschreck commented Dec 25, 2024

jsschreck commented Dec 28, 2024

kanz76 left a comment •

edited

Loading

Multistep training with batch_size >=1 per GPU #139

Multistep training with batch_size >=1 per GPU #139

Conversation

jsschreck commented Dec 24, 2024

dkimpara commented Dec 25, 2024

jsschreck commented Dec 25, 2024

jsschreck commented Dec 28, 2024

kanz76 left a comment • edited Loading

Choose a reason for hiding this comment

kanz76 left a comment •

edited

Loading