Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix torch multiprocessing error in gptneox conversion script #587

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rohithkrn
Copy link
Contributor

Fixes:

  1. Context has already been set error from torch multiprocessing similar to [BUG FIX] place multi-processing init to main method #443
  2. I think the device_count is incorrectly set in the example script.

@lanking520
Copy link
Contributor

@byshiue

@zhang-ge-hao
Copy link
Contributor

Fixes:

  1. Context has already been set error from torch multiprocessing similar to [BUG FIX] place multi-processing init to main method #443
  2. I think the device_count is incorrectly set in the example script.

can you provide some buggy cases that the current script can not deal with?

@rohithkrn
Copy link
Contributor Author

rohithkrn commented May 4, 2023

@AkiyamaYummy
Could you explain why is this needed https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py#L141?

The command I am using is:
python huggingface_gptneox_convert.py -i EleutherAI/gpt-neox-20b -o $PWD/artifacts -i_g 2 -m_n gptneox -weight_data_type fp16

It fails at the above line #L141 with error because I don't have that directory. I see in the gptneox_guide, it suggests to clone the model git lfs clone https://huggingface.co/<MODEL_GROUP>/<MODEL_NAME>. Why do we need to do this and it's not required for other models? GPTNeoXForCausalLM.from_pretrained() will download the model anyway for HF.

File "/workspace/fastertransformer/FasterTransformer/python/fastertransformer/examples/gptneox/huggingface_gptneox_convert.py", line 141, in split_and_convert
    huggingface_model_file_list = [__fn for __fn in os.listdir(args.in_file) if __fn.endswith(".bin")]
FileNotFoundError: [Errno 2] No such file or directory: 'EleutherAI/gpt-neox-20b'

To get around the error,

I added the check and ran the same command.

if os.path.exists(args.in_file):
        huggingface_model_file_list = [__fn for __fn in os.listdir(args.in_file) if __fn.endswith(".bin")]
else:
        huggingface_model_file_list = []

It runs in to the error which is similar to #443

Traceback (most recent call last):                                                     
  File "/workspace/fastertransformer/FasterTransformer/python/fastertransformer/examples/gptneox/huggingface_gptneox_convert.py", line 254, in <module>
    split_and_convert(args)             
  File "/workspace/fastertransformer/FasterTransformer/python/fastertransformer/examples/gptneox/huggingface_gptneox_convert.py", line 149, in split_and_convert               
    torch.multiprocessing.set_start_method("spawn")                                    
  File "/usr/lib/python3.9/multiprocessing/context.py", line 243, in set_start_method  
    raise RuntimeError('context has already been set')                                 
RuntimeError: context has already been set

Therefore, I introduced the change in this PR to fix the errors.

@rohithkrn
Copy link
Contributor Author

@byshiue @AkiyamaYummy

@zhang-ge-hao
Copy link
Contributor

@rohithkrn

By using GPTNeoXForCausalLM.from_pretrained(), you can still use the current script after the default model download process by locally finding the model's storage path.

Of course, you can also make this script more convenient by being compatible with this scenario.

PS: Some of us, including myself, prefer to maintain Huggingface's model files on our own because Huggingface defaults to downloading models to its cache folder, and I don't want large files to take up a lot of my hard disk resources without being visible in my workspace.

@rohithkrn
Copy link
Contributor Author

@AkiyamaYummy that makes sense, my change will not break that behavior.

@rohithkrn
Copy link
Contributor Author

@byshiue @AkiyamaYummy reminder for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants