Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using train-data-upsampling-factors #632

Closed
humzaiqbal opened this issue Sep 14, 2023 · 2 comments
Closed

Error when using train-data-upsampling-factors #632

humzaiqbal opened this issue Sep 14, 2023 · 2 comments

Comments

@humzaiqbal
Copy link
Contributor

Hi! I'm trying to use multiple datasets when fine tuning OpenCLIP and I want to use train-data-upsampling-factors while doing so however I ran into an issue when trying to get data["val"]

AssertionError: --train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled).
    data["val"] = get_dataset_fn(args.val_data, args.dataset_type)(
  File "/home/ubuntu/research_nfs/humza/open_clip/src/training/data.py", line 357, in get_wds_dataset
    assert args.train_data_upsampling_factors is None,\
AssertionError: --train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled).
    assert args.train_data_upsampling_factors is None,\
AssertionError: --train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9027) of binary: /home/ubuntu/miniconda3/envs/rapids-22.06/bin/python3.9
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(

Specifically if you look at the get_wds_dataset function you'll see that we set resampled to automatically be false if we are looking at val rather than train, and if resampled is false then it will complain if we pass in args.train_data_upsampling_factors even if we don't use it. I think a better way to capture this would be to move the assert to something like

if is_train and args.train_data_upsampling_factors is not None:
   assert resampled, "--train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled)."

This way the check properly happens that when we want to use train_data_upsampling_factors we check that its in the right case. Let me know if this makes sense or if I'm missing something. Assuming the former I'm happy to put up a PR to fix. Thanks!

@rwightman
Copy link
Collaborator

@gabrielilharco

@gabrielilharco
Copy link
Collaborator

Thanks @humzaiqbal. Your fix looks good to me, can you do a PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants