Shuffling Multiple Datasets in Webdataset format #977

JJumSSu · 2023-06-20T05:57:24Z

JJumSSu
Jun 20, 2023

Hi,
thank you for the amazing repo!

I'm currently trying to train a CLIP model using multiple datasets in a webdataset format.
While doing so, I have some questions regarding the shuffling.

If using multiple training datasets, can I shuffle the multiple datasets while training? Or are they sequentially trained? (e.g., data1 -> data2 -> data3 ...)
Seems that the parameter --dataset-resampled shuffles the shards with replacement. So does it mean that some instances will be trained more than one time and some of them will not be trained at all? If so, what is the advantage of using the parameter?

Thank you :)

gabrielilharco · 2023-10-24T16:15:46Z

gabrielilharco
Oct 24, 2023
Maintainer

Hi @JJumSSu. Re. 1, all shards are shuffled. Re. 2, the advantage here is that it allows us to save checkpoints more frequently (at fractions of an epoch) by setting --train-num-samples to a lower value. This is important for larger datasets

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffling Multiple Datasets in Webdataset format #977

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Shuffling Multiple Datasets in Webdataset format #977

JJumSSu Jun 20, 2023

Replies: 1 comment

gabrielilharco Oct 24, 2023 Maintainer

JJumSSu
Jun 20, 2023

gabrielilharco
Oct 24, 2023
Maintainer