-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Socket Timeout #97
Comments
Getting same error here. |
some of the other parameters need to be adjusted for single gpu: < --num-layers 4 --embedding-dim 4096 Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0 but i forgot to download the pretrained model (as per the training instructions), so it stopped there. Will post results once that step is complete. cheers |
Hi Darrin, Thanks |
It won't train on my 12GB GPU, it runs out of memory. It requires more VRAM than I currently have. |
@darrinh The fine tuning script will most likely not work on 12 GB VRAM. I'd recommend using LoRa for fine-tuning instead. Here's some sample code to get you started: https://github.com/togethercomputer/OpenChatKit/blob/ecfe4d5d9b5f4b1a533c4468cc1b7e1107b9a819/training/lora/redpajama-incite-chat-3b.py |
Thanks @orangetin , it starts but quickly runs out of memory. Thanks for the link, will check it out. thanks |
@yxy123 The arguments provided are invalid. Change this line > |
@orangetin Got it, thanks very much, it worked. |
sh training/finetune_Pythia-Chat-Base-7B.sh
Namespace(use_cuda=True, cuda_id=0, cuda_num=1, debug_mem=True, dist_backend='cupy_nccl', dp_backend='nccl', dist_url='tcp://127.0.0.1:7033', world_size= train_data=['./glue_dataset/data/QQP/train.tsv'], valid_data=['./glue_dataset/data/QQP/test.tsv'], tokenizer_type='BertWordPieceLowerCase', vocab_file='', train_log_backend='print', project_name='together', batch_size=32, micro_batch_size=1, lr=1e-05, num_iters=10, fp16=True, loss_scale=0, initial_loss_slreduce', gradient_accumulate_step=1, model_name='/data/app/OpenChatKit/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/', toketype='gptneox', checkpoint_path='/data/app/OpenChatKit/training/../model_ckpts/Pythia-Chat-Base-7B', task_name='/data/app/OpenChatKit/training/../data/OI_checkpoint=True, seed=42, profiling='no-profiling', trace_postfix='default', evaluation_steps=0, evaluation_data=None, evaluation_num_batch=None, checkp
Traceback (most recent call last):
File "/data/app/OpenChatKit/training/dist_clm_train.py", line 358, in
main()
File "/data/app/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 85, in init_communicators
default_init(args)
File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 81, in default_init
dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=5*60), init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
default_pg = _new_process_group_helper(
File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Socket Timeout
Error reporting when running with a single gpu.
The text was updated successfully, but these errors were encountered: