Error when using train-data-upsampling-factors #632

humzaiqbal · 2023-09-14T17:24:40Z

Hi! I'm trying to use multiple datasets when fine tuning OpenCLIP and I want to use train-data-upsampling-factors while doing so however I ran into an issue when trying to get data["val"]

AssertionError: --train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled).
    data["val"] = get_dataset_fn(args.val_data, args.dataset_type)(
  File "/home/ubuntu/research_nfs/humza/open_clip/src/training/data.py", line 357, in get_wds_dataset
    assert args.train_data_upsampling_factors is None,\
AssertionError: --train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled).
    assert args.train_data_upsampling_factors is None,\
AssertionError: --train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9027) of binary: /home/ubuntu/miniconda3/envs/rapids-22.06/bin/python3.9
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(

Specifically if you look at the get_wds_dataset function you'll see that we set resampled to automatically be false if we are looking at val rather than train, and if resampled is false then it will complain if we pass in args.train_data_upsampling_factors even if we don't use it. I think a better way to capture this would be to move the assert to something like

if is_train and args.train_data_upsampling_factors is not None:
   assert resampled, "--train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled)."

This way the check properly happens that when we want to use train_data_upsampling_factors we check that its in the right case. Let me know if this makes sense or if I'm missing something. Assuming the former I'm happy to put up a PR to fix. Thanks!

The text was updated successfully, but these errors were encountered:

rwightman · 2023-09-14T20:46:58Z

@gabrielilharco

gabrielilharco · 2023-09-28T13:17:19Z

Thanks @humzaiqbal. Your fix looks good to me, can you do a PR? Thanks!

humzaiqbal mentioned this issue Oct 4, 2023

Only assert reshuffle if we are in train mode and we specify a data upsample refactor #655

Merged

gabrielilharco closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using train-data-upsampling-factors #632

Error when using train-data-upsampling-factors #632

humzaiqbal commented Sep 14, 2023

rwightman commented Sep 14, 2023

gabrielilharco commented Sep 28, 2023

Error when using train-data-upsampling-factors #632

Error when using train-data-upsampling-factors #632

Comments

humzaiqbal commented Sep 14, 2023

rwightman commented Sep 14, 2023

gabrielilharco commented Sep 28, 2023