Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training procedure unclear #1

Closed
lennart-rth opened this issue Jun 26, 2024 · 4 comments
Closed

Training procedure unclear #1

lennart-rth opened this issue Jun 26, 2024 · 4 comments

Comments

@lennart-rth
Copy link

Hi,
can you give some additional information about the training process?
I would like to train NAG2G from scratch, but lack a training script.
Can I use the same arguments as in validate.sh without the --infer_step flag and with lr and optimizer as in the paper?
Best regards,
Lennart

@lennart-rth
Copy link
Author

In case someone is in the same situation as me. This is what I ended up with:

[ -z "${MASTER_PORT}" ] && MASTER_PORT=10086
[ -z "${n_gpu}" ] && n_gpu=1 #$(nvidia-smi -L | wc -l)
[ -z "${OMPI_COMM_WORLD_SIZE}" ] && OMPI_COMM_WORLD_SIZE=1
[ -z "${OMPI_COMM_WORLD_RANK}" ] && OMPI_COMM_WORLD_RANK=0
[ -z "${len_penalty}" ] && len_penalty=0.0
[ -z "${beam_size}" ] && beam_size=10
[ -z "${beam_size_second}" ] && beam_size_second=5
[ -z "${beam_head_second}" ] && beam_head_second=2
# SimpleGenerator SequenceGeneratorBeamSearch SequenceGeneratorBeamSearch_test
[ -z "${search_strategies}" ] && search_strategies=SequenceGeneratorBeamSearch
[ -z "${temperature}" ] && temperature=1
[ -z "${num_workers}" ] && num_workers=12
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=12
export TORCH_USE_CUDA_DSA=1
export CUDA_LAUNCH_BLOCKING=1
TORCH_DISTRIBUTED_DEBUG=DETAIL
run_name=nag2g
save_dir="./save/${run_name}"
mkdir -p ${save_dir}

python -m torch.distributed.run \
       --nnodes=1 --nproc_per_node=$n_gpu $(which unicore-train) ./USPTO50K_brief_20230227  --user-dir . --valid-subset valid,test \
       --num-workers 12 --ddp-backend=no_c10d --user-dir ./NAG2G  \
       --task G2G_unimolv2 --loss G2G --arch NAG2G_G2G  --encoder-type unimolv2 \
       --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-8 --clip-norm 1.0 \
       --lr-scheduler polynomial_decay --lr 2.5e-4 --warmup-updates 12000 --total-num-update 10000 --batch-size 16 \
       --update-freq 1 --seed 1 --power 1\
       --tensorboard-logdir $save_dir/tsb \
       --log-interval 100 --log-format simple \
       --save-interval-updates 5000 --validate-interval-updates 5000 --keep-interval-updates 1 --no-epoch-checkpoints --max-epoch 0 \
       --save-dir $save_dir --use-decoder \
       --encoder-embed-dim 768 --encoder-ffn-embed-dim 768 --decoder-embed-dim 768 --decoder-ffn-embed-dim 768 \
       --config_file config.ini \
    --max-update 120000 --decoder_type new --want_decoder_attn --end-learning-rate 1e-09 --total-num-update 120000 --activation-fn gelu  --laplacian_pe_dim 0 --auto-regressive --position-type relative --use_reorder --want_charge_h --shufflegraph randomsmiles --want_h_degree --decoder-layers 6 --decoder-attention-heads 24 --pair-embed-dim 256 --encoder-layers 12  --encoder-attention-heads 48 \
     --fp16 --fp16-init-scale 4 --fp16-scale-window 256 \

I haven't managed to replicate it exactly, as I'm probably still missing something.

@thegodone
Copy link

I also try to install it validation works but honestly this is for advanced expert linux. Did you manage to train it ?

@synsis
Copy link
Collaborator

synsis commented Aug 9, 2024

I apologize for the delayed response. I have updated the train.sh file.

@synsis synsis closed this as completed Aug 9, 2024
@mujeebarshad
Copy link

@lennart-rth Were you able to train the model with the new script? I am having issue training it as well, could you guide me towards solution? #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants