A problem of traning a new model #86

TanYuChen1 · 2023-03-08T13:32:31Z

Hi,
I finished data preprocessing and feature extraction for LRS3 dataset, and had a problem when pre-training the new model. My command is

fairseq-hydra-train --config-dir /path/to/conf/pretrain/ --config-name base_lrs3_iter5.yaml \
task.data=/path/to/LRS3/30h_data/ task.label_dir=/path/to/km_labels \
model.label_rate=100 hydra.run.dir=/path/to/pretrain/ \
common.user_dir=`pwd`

/path/to/km_labels include {train,valid}.km and dict.km.txt .
{train,valid}.km and dict.km.txt generated from /av_hubert-new/avhubert/clustering/submit_cluster.py.

But when I run, the following problems occur.

Traceback (most recent call last):
  File "/path/to/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/path/to/av_hubert/fairseq/fairseq_cli/train.py", line 138, in main
    trainer = Trainer(cfg, task, model, criterion, quantizer)
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 148, in __init__
    if self.data_parallel_rank == 0:
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 181, in data_parallel_rank
    return distributed_utils.get_data_parallel_rank()
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
    return get_rank(get_data_parallel_group())
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
    return dist.get_rank(group=group)
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank
    default_pg = _get_default_group()
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I think the data may not be loaded, because I saw the following figure before the problem.

Could you give me some suggestions? thank you!

The text was updated successfully, but these errors were encountered:

chevalierNoir · 2023-03-10T15:28:49Z

Hi,

There seem to be two different issues. For the distribution error your distributed configurations are probably not correctly set. The default config you use uses 32 GPUs on 4 nodes. You may need to change them accordingly depending on your machine.

Also from the screenshot you only loaded one data item. Please check the tsv and km files contain enough data.

JinChow · 2023-07-22T05:30:21Z

Hi, I finished data preprocessing and feature extraction for LRS3 dataset, and had a problem when pre-training the new model. My command is

fairseq-hydra-train --config-dir /path/to/conf/pretrain/ --config-name base_lrs3_iter5.yaml \
task.data=/path/to/LRS3/30h_data/ task.label_dir=/path/to/km_labels \
model.label_rate=100 hydra.run.dir=/path/to/pretrain/ \
common.user_dir=`pwd`

/path/to/km_labels include {train,valid}.km and dict.km.txt . {train,valid}.km and dict.km.txt generated from /av_hubert-new/avhubert/clustering/submit_cluster.py.

But when I run, the following problems occur.

Traceback (most recent call last):
  File "/path/to/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/path/to/av_hubert/fairseq/fairseq_cli/train.py", line 138, in main
    trainer = Trainer(cfg, task, model, criterion, quantizer)
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 148, in __init__
    if self.data_parallel_rank == 0:
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 181, in data_parallel_rank
    return distributed_utils.get_data_parallel_rank()
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
    return get_rank(get_data_parallel_group())
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
    return dist.get_rank(group=group)
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank
    default_pg = _get_default_group()
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I think the data may not be loaded, because I saw the following figure before the problem.

Could you give me some suggestions? thank you!

Hello，Chen，I am running the av_hubert, and I have some questions，can you give me your phone or WeChat, I want to chat with you. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A problem of traning a new model #86

A problem of traning a new model #86

TanYuChen1 commented Mar 8, 2023

chevalierNoir commented Mar 10, 2023

JinChow commented Jul 22, 2023

A problem of traning a new model #86

A problem of traning a new model #86

Comments

TanYuChen1 commented Mar 8, 2023

chevalierNoir commented Mar 10, 2023

JinChow commented Jul 22, 2023