-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training procedure unclear #1
Comments
In case someone is in the same situation as me. This is what I ended up with: [ -z "${MASTER_PORT}" ] && MASTER_PORT=10086
[ -z "${n_gpu}" ] && n_gpu=1 #$(nvidia-smi -L | wc -l)
[ -z "${OMPI_COMM_WORLD_SIZE}" ] && OMPI_COMM_WORLD_SIZE=1
[ -z "${OMPI_COMM_WORLD_RANK}" ] && OMPI_COMM_WORLD_RANK=0
[ -z "${len_penalty}" ] && len_penalty=0.0
[ -z "${beam_size}" ] && beam_size=10
[ -z "${beam_size_second}" ] && beam_size_second=5
[ -z "${beam_head_second}" ] && beam_head_second=2
# SimpleGenerator SequenceGeneratorBeamSearch SequenceGeneratorBeamSearch_test
[ -z "${search_strategies}" ] && search_strategies=SequenceGeneratorBeamSearch
[ -z "${temperature}" ] && temperature=1
[ -z "${num_workers}" ] && num_workers=12
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=12
export TORCH_USE_CUDA_DSA=1
export CUDA_LAUNCH_BLOCKING=1
TORCH_DISTRIBUTED_DEBUG=DETAIL
run_name=nag2g
save_dir="./save/${run_name}"
mkdir -p ${save_dir}
python -m torch.distributed.run \
--nnodes=1 --nproc_per_node=$n_gpu $(which unicore-train) ./USPTO50K_brief_20230227 --user-dir . --valid-subset valid,test \
--num-workers 12 --ddp-backend=no_c10d --user-dir ./NAG2G \
--task G2G_unimolv2 --loss G2G --arch NAG2G_G2G --encoder-type unimolv2 \
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --clip-norm 1.0 \
--lr-scheduler polynomial_decay --lr 2.5e-4 --warmup-updates 12000 --total-num-update 10000 --batch-size 16 \
--update-freq 1 --seed 1 --power 1\
--tensorboard-logdir $save_dir/tsb \
--log-interval 100 --log-format simple \
--save-interval-updates 5000 --validate-interval-updates 5000 --keep-interval-updates 1 --no-epoch-checkpoints --max-epoch 0 \
--save-dir $save_dir --use-decoder \
--encoder-embed-dim 768 --encoder-ffn-embed-dim 768 --decoder-embed-dim 768 --decoder-ffn-embed-dim 768 \
--config_file config.ini \
--max-update 120000 --decoder_type new --want_decoder_attn --end-learning-rate 1e-09 --total-num-update 120000 --activation-fn gelu --laplacian_pe_dim 0 --auto-regressive --position-type relative --use_reorder --want_charge_h --shufflegraph randomsmiles --want_h_degree --decoder-layers 6 --decoder-attention-heads 24 --pair-embed-dim 256 --encoder-layers 12 --encoder-attention-heads 48 \
--fp16 --fp16-init-scale 4 --fp16-scale-window 256 \ I haven't managed to replicate it exactly, as I'm probably still missing something. |
I also try to install it validation works but honestly this is for advanced expert linux. Did you manage to train it ? |
I apologize for the delayed response. I have updated the |
@lennart-rth Were you able to train the model with the new script? I am having issue training it as well, could you guide me towards solution? #6 |
Hi,
can you give some additional information about the training process?
I would like to train NAG2G from scratch, but lack a training script.
Can I use the same arguments as in
validate.sh
without the--infer_step
flag and withlr
andoptimizer
as in the paper?Best regards,
Lennart
The text was updated successfully, but these errors were encountered: