Skip to content
This repository has been archived by the owner on Sep 1, 2024. It is now read-only.

A problem of traning a new model #86

Open
TanYuChen1 opened this issue Mar 8, 2023 · 2 comments
Open

A problem of traning a new model #86

TanYuChen1 opened this issue Mar 8, 2023 · 2 comments

Comments

@TanYuChen1
Copy link

Hi,
I finished data preprocessing and feature extraction for LRS3 dataset, and had a problem when pre-training the new model. My command is

fairseq-hydra-train --config-dir /path/to/conf/pretrain/ --config-name base_lrs3_iter5.yaml \
task.data=/path/to/LRS3/30h_data/ task.label_dir=/path/to/km_labels \
model.label_rate=100 hydra.run.dir=/path/to/pretrain/ \
common.user_dir=`pwd`

/path/to/km_labels include {train,valid}.km and dict.km.txt .
{train,valid}.km and dict.km.txt generated from /av_hubert-new/avhubert/clustering/submit_cluster.py.

But when I run, the following problems occur.

Traceback (most recent call last):
  File "/path/to/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/path/to/av_hubert/fairseq/fairseq_cli/train.py", line 138, in main
    trainer = Trainer(cfg, task, model, criterion, quantizer)
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 148, in __init__
    if self.data_parallel_rank == 0:
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 181, in data_parallel_rank
    return distributed_utils.get_data_parallel_rank()
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
    return get_rank(get_data_parallel_group())
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
    return dist.get_rank(group=group)
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank
    default_pg = _get_default_group()
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I think the data may not be loaded, because I saw the following figure before the problem.
image

Could you give me some suggestions? thank you!

@chevalierNoir
Copy link
Contributor

Hi,

There seem to be two different issues. For the distribution error your distributed configurations are probably not correctly set. The default config you use uses 32 GPUs on 4 nodes. You may need to change them accordingly depending on your machine.

Also from the screenshot you only loaded one data item. Please check the tsv and km files contain enough data.

@JinChow
Copy link

JinChow commented Jul 22, 2023

Hi, I finished data preprocessing and feature extraction for LRS3 dataset, and had a problem when pre-training the new model. My command is

fairseq-hydra-train --config-dir /path/to/conf/pretrain/ --config-name base_lrs3_iter5.yaml \
task.data=/path/to/LRS3/30h_data/ task.label_dir=/path/to/km_labels \
model.label_rate=100 hydra.run.dir=/path/to/pretrain/ \
common.user_dir=`pwd`

/path/to/km_labels include {train,valid}.km and dict.km.txt . {train,valid}.km and dict.km.txt generated from /av_hubert-new/avhubert/clustering/submit_cluster.py.

But when I run, the following problems occur.

Traceback (most recent call last):
  File "/path/to/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/path/to/av_hubert/fairseq/fairseq_cli/train.py", line 138, in main
    trainer = Trainer(cfg, task, model, criterion, quantizer)
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 148, in __init__
    if self.data_parallel_rank == 0:
  File "/path/to/av_hubert/fairseq/fairseq/trainer.py", line 181, in data_parallel_rank
    return distributed_utils.get_data_parallel_rank()
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
    return get_rank(get_data_parallel_group())
  File "/path/to/av_hubert/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
    return dist.get_rank(group=group)
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank
    default_pg = _get_default_group()
  File "/path/to/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I think the data may not be loaded, because I saw the following figure before the problem. image

Could you give me some suggestions? thank you!

Hello,Chen,I am running the av_hubert, and I have some questions,can you give me your phone or WeChat, I want to chat with you. Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants