Ensemble capability for single-step training #137

dkimpara · 2024-12-20T00:14:37Z

supports batch_size > 1
currently only KCRPS (unbiased CRPS) available
train and validation are the same loss, for this mode

new config field:
int trainer.ensemble_size = 1 by default (also set by parser)

high-level code design:

era5trainer_v2: torch.repeat_interleave to copy samples to transform batches from (b, ...) to (b * ensemble_size, ...). Then this new tensor is passed into the model
loss.py, KCRPSLoss: torch.vmap to vectorize and lift single obs modulus loss to handle ensembles with multiple target obs (batch_size > 1)
metrics.py: compute ensemble mean to compute metrics on

Future features (did not put them in this PR so to not clutter the config)

different train/validation losses (including ensemble/not-ensemble pairings)
different ensemble sizes for train/validation

Testing (on casper)

applications/train.py -c config/test_cesm_ensemble.yml -l 1

jsschreck · 2024-12-25T15:48:21Z

@dkimpara the batch_size > 1 ... is this meaning it will support the new datasets I added, or are you doing it yourself here?

dkimpara · 2024-12-25T16:12:21Z

I haven't yet configured it for your datasets. I will do that once your PR is settled. What I meant was that the ensemble features are compatible with batches with more than one target Dhamma Kimpara PhD Candidate in Computer Science National Center for Atmospheric Research, and University of Colorado Boulder Pronouns in use: He/Him/His

…

On Wed, Dec 25, 2024, 8:48 AM jsschreck ***@***.***> wrote: @dkimpara <https://github.com/dkimpara> the batch_size > 1 ... is this meaning it will support the new datasets I added, or are you doing it yourself here? — Reply to this email directly, view it on GitHub <#137 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD7CU77BITFFQ74LMAIHYN32HLHVXAVCNFSM6AAAAABT6B75GKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRHEZTINJTGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jsschreck · 2024-12-26T20:58:39Z

@dkimpara trainerERA5_v2.py will be deprecated in my PR; we will only need trainerERA5_multistep_grad_accum.py going forward, so you should add the ensemble change to that script as well. Keep what you have b/c I am not going to remove _v2 just yet since other people are still using it.

yingkaisha

Finished review. My request is that deterministic training should skip the CRPS code blocks.

yingkaisha · 2024-12-29T04:57:49Z

credit/trainers/trainerERA5_v2.py

+                # if samples in the batch are ordered (x,y,z) then the result tensor is (x, x, ..., y, y, ..., z,z ...)
+                # WARNING: needs to be used with a loss that can handle x with b * ensemble_size samples and y with b samples
+                x = torch.repeat_interleave(x, conf["trainer"]["ensemble_size"], 0)
+


This place can have an if condition, so the deterministic training routine will not touch this part.

yingkaisha · 2024-12-29T04:59:50Z

credit/trainers/trainerERA5_v2.py

+                # if samples in the batch are ordered (x,y,z) then the result tensor is (x, x, ..., y, y, ..., z,z ...)
+                # WARNING: needs to be used with a loss that can handle x with b * ensemble_size samples and y with b samples
+                x = torch.repeat_interleave(x, conf["trainer"]["ensemble_size"], 0)
+


if condition

yingkaisha · 2024-12-29T05:02:21Z

credit/metrics.py

+        # calculate ensemble mean, if ensemble_size=1, does nothing
+        pred = pred.view(y.shape[0], self.ensemble_size, *y.shape[1:]) #b, ensemble, c, t, lat, lon
+        pred = pred.mean(dim=1)
+


if condition

Yeah -- will also fail here.

yingkaisha · 2024-12-29T05:04:00Z

credit/parser.py

@@ -710,6 +710,9 @@ def credit_main_parser(
            "train_batch_size" in conf["trainer"]
        ), "Training set batch size ('train_batch_size') is missing from onf['trainer']"

+        if "ensemble_size" not in conf["trainer"]:
+            conf["trainer"]["ensemble_size"] = 1


Can you have an option to skip the ensemble routine? For those who do not use ensemble training, going through the ensemble code can cause a slow-down.

jsschreck · 2025-01-01T18:53:24Z

credit/metrics.py

@@ -28,11 +28,17 @@ def __init__(self, conf, predict_mode=False):
        # DO NOT apply these weights during metrics computations, only on the loss during
        self.w_var = None

+        self.ensemble_size = conf["trainer"]["ensemble_size"]


If I do not use ensemble_size this will fail here. Prob better to have something like "if ensemble_size in conf["trainer"] conditional

dkimpara added 2 commits December 19, 2024 14:47

ensemble capability, KCRPS for single-step training and batch-size=1

36394b6

ensemble capability for batch_size>1

6ce1c6b

dkimpara requested review from djgagne, yingkaisha and jsschreck December 20, 2024 00:14

dkimpara marked this pull request as ready for review December 20, 2024 18:36

yingkaisha requested changes Dec 29, 2024

View reviewed changes

jsschreck reviewed Jan 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensemble capability for single-step training #137

Ensemble capability for single-step training #137

dkimpara commented Dec 20, 2024 •

edited

Loading

jsschreck commented Dec 25, 2024

dkimpara commented Dec 25, 2024 via email

jsschreck commented Dec 26, 2024

yingkaisha left a comment •

edited

Loading

yingkaisha Dec 29, 2024 •

edited

Loading

yingkaisha Dec 29, 2024

yingkaisha Dec 29, 2024 •

edited

Loading

jsschreck Jan 1, 2025

yingkaisha Dec 29, 2024 •

edited

Loading

jsschreck Jan 1, 2025

Ensemble capability for single-step training #137

Are you sure you want to change the base?

Ensemble capability for single-step training #137

Conversation

dkimpara commented Dec 20, 2024 • edited Loading

jsschreck commented Dec 25, 2024

dkimpara commented Dec 25, 2024 via email

jsschreck commented Dec 26, 2024

yingkaisha left a comment • edited Loading

Choose a reason for hiding this comment

yingkaisha Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

yingkaisha Dec 29, 2024

Choose a reason for hiding this comment

yingkaisha Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

jsschreck Jan 1, 2025

Choose a reason for hiding this comment

yingkaisha Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

jsschreck Jan 1, 2025

Choose a reason for hiding this comment

dkimpara commented Dec 20, 2024 •

edited

Loading

yingkaisha left a comment •

edited

Loading

yingkaisha Dec 29, 2024 •

edited

Loading

yingkaisha Dec 29, 2024 •

edited

Loading

yingkaisha Dec 29, 2024 •

edited

Loading