🍱 Extra data and pre-batch shuffle on train datapipe #14

weiji14 · 2023-05-30T08:26:39Z

What I am changing

Adding more imagery data to the training set (temporarily commented out though, see 🍱 Extra data and pre-batch shuffle on train datapipe #14 (comment))
Implement proper shuffling to mix images with burned areas (1 labels) and no burned areas (0 labels)

How I did it

Added california_3.hdf5 and california_4.hdf5 to the datapipe from https://huggingface.co/datasets/chabud-team/chabud-extra/commit/7da36fcb240ef39beed1f877acc837b98746f35b
Use .shuffle instead of .in_batch_shuffle

How you can test it

Run python trainer.py fit --trainer.max_epochs=30 --data.batch_size=6 locally.

Related Issues

Note that the shuffling operation is slower than in-batch shuffling. There is a longer delay at the start as the image chips are added to the shuffle buffer, and each mini-batch is now processing about 2x slower (one iteration used to take ~1s, now it takes ~2s).

More sample imagery datasets for training, added in https://huggingface.co/datasets/chabud-team/chabud-extra/commit/7da36fcb240ef39beed1f877acc837b98746f35b.

Randomizing the order of the chips before creating mini-batches, because the train_eval.hdf5 contains all the non-zero labels while california_*.hdf5 contain all zero labels. The shuffling causes a roughly 2x slowdown from 1it/s to 2it/s. Also cherry-picked a9b3b95 to have a buffer_size of -1 in the demux DataPipe.

weiji14 · 2023-05-30T08:31:55Z

chabud/datapipe.py

        self.datapipe_train = (
-            dp_train.map(fn=_pre_post_mask_tuple)
+            dp_train.shuffle(buffer_size=100)


Default buffer size of 10000 was too slow (waited for minutes but the model never starts training). @srmsoumya, could you try a few other variations of this buffer_size and see how performant the model is?

sure, I was facing some error with buffers & set it to -1 in my experiment, I will look at other options as well

srmsoumya

Looks good to me, feel free to merge.

srmsoumya · 2023-05-30T12:40:39Z

chabud/datapipe.py

        self.datapipe_train = (
-            dp_train.map(fn=_pre_post_mask_tuple)
+            dp_train.shuffle(buffer_size=100)


sure, I was facing some error with buffers & set it to -1 in my experiment, I will look at other options as well

srmsoumya · 2023-05-31T13:58:42Z

chabud/datapipe.py

            "https://huggingface.co/datasets/chabud-team/chabud-extra/resolve/main/california_0.hdf5",
            "https://huggingface.co/datasets/chabud-team/chabud-extra/resolve/main/california_1.hdf5",
            "https://huggingface.co/datasets/chabud-team/chabud-extra/resolve/main/california_2.hdf5",
+            "https://huggingface.co/datasets/chabud-team/chabud-extra/resolve/main/california_3.hdf5",
+            "https://huggingface.co/datasets/chabud-team/chabud-extra/resolve/main/california_4.hdf5",


@weiji14 we can ignore the california_*.hdf5 files for now, as the dataset is currently imbalanced. We can add them back once we implement the mixup & cutmix augmentations.

Ok, I'll comment those lines out for now, done at e9b7255. The data imbalance can be tracked at #11 or #12.

Commented out the extra california_*.hdf5 data for now.

weiji14 added 2 commits May 30, 2023 19:44

🍱 Extra datasets california_3.hdf5 and california_4.hdf5

2115641

More sample imagery datasets for training, added in https://huggingface.co/datasets/chabud-team/chabud-extra/commit/7da36fcb240ef39beed1f877acc837b98746f35b.

weiji14 added the enhancement New feature or request label May 30, 2023

weiji14 self-assigned this May 30, 2023

weiji14 requested a review from srmsoumya May 30, 2023 08:27

weiji14 commented May 30, 2023

View reviewed changes

srmsoumya approved these changes May 31, 2023

View reviewed changes

🔀 Merge branch 'main' into extra-data-and-pre-batch-shuffle

e9b7255

Commented out the extra california_*.hdf5 data for now.

weiji14 marked this pull request as ready for review June 1, 2023 02:58

weiji14 merged commit 6ca3381 into main Jun 1, 2023

weiji14 deleted the extra-data-and-pre-batch-shuffle branch June 1, 2023 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🍱 Extra data and pre-batch shuffle on train datapipe #14

🍱 Extra data and pre-batch shuffle on train datapipe #14

weiji14 commented May 30, 2023 •

edited

Loading

weiji14 May 30, 2023

srmsoumya May 30, 2023

srmsoumya left a comment

srmsoumya May 30, 2023

srmsoumya May 31, 2023

weiji14 Jun 1, 2023 •

edited

Loading

🍱 Extra data and pre-batch shuffle on train datapipe #14

🍱 Extra data and pre-batch shuffle on train datapipe #14

Conversation

weiji14 commented May 30, 2023 • edited Loading

What I am changing

How I did it

How you can test it

Related Issues

weiji14 May 30, 2023

Choose a reason for hiding this comment

srmsoumya May 30, 2023

Choose a reason for hiding this comment

srmsoumya left a comment

Choose a reason for hiding this comment

srmsoumya May 30, 2023

Choose a reason for hiding this comment

srmsoumya May 31, 2023

Choose a reason for hiding this comment

weiji14 Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

weiji14 commented May 30, 2023 •

edited

Loading

weiji14 Jun 1, 2023 •

edited

Loading