Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare CIDEr optimization and training time with BUTD paper #100

Open
HYPJUDY opened this issue Jul 5, 2019 · 4 comments
Open

compare CIDEr optimization and training time with BUTD paper #100

HYPJUDY opened this issue Jul 5, 2019 · 4 comments

Comments

@HYPJUDY
Copy link

HYPJUDY commented Jul 5, 2019

Hi, thanks for your contribution! I have several questions:

I notice in Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering paper mentioned they complete CIDEr optimization in a single epoch as follows:

We have observed when decoding using beam search that the resulting beam typically contains at least one very high quality (and high scoring) caption – although frequently the best caption does not have the highest log-probability of the set. Therefore, we make one additional approximation. Rather than sampling captions from the entire probability distribution, for more rapid training we take the captions in the decoded beam as a sample set. Using this approach, we complete CIDEr optimization in a single epoch.

I wonder why don't you implement this method/trick since it would save huge amount of time! Do you have a plan on it? I haven't figured out the meaning of "take the captions in the decoded beam as a sample set" and how to implement it. Could you throw some light on it?

As for the time, BUTD paper says

Training using two Nvidia Titan X GPUs takes around 9 hours (including less than one hour for CIDEr optimization).

Using the BUTD model in this repo, training using 4 Telsa M40 GPUs takes around 9 hours and 40 minutes. (mine: 30 epoches cross entropy loss training with 4 M40 VS BUTD paper: 60 epoches cross entropy loss training + less than 1h for 1 epoch CIDEr optimization with 2 TitanX).

I think the original caffe implementation is much faster. Do you have any idea about it?

FYI (and others who might interested in the complete commands to get comparable BUTD model results), my training details are as follows:

# Training command:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --id topdown --caption_model topdown --input_json data/cocotalk.json  --input_label_h5 data/cocotalk_label.h5 --batch_size 100 --learning_rate 0.001 --learning_rate_decay_start 0 --checkpoint_path log_topdown --save_checkpoint_every 1100 --val_images_use 5000 --max_epochs 30 --rnn_size 1000 --input_encoding_size 1000 --att_feat_size 2048 --att_hid_size 512  --language_eval 1 --scheduled_sampling_start 0 --use_bn 1 --learning_rate_decay_every 4

I add --use_bn 1 refer to here.

I set the learning rate and optimization in the way a little different (--learning_rate 0.001 --learning_rate_decay_every 4) from #31 and ruotianluo/ImageCaptioning.pytorch#10 (they got good results but I didn't so I try to use larger learning rate). The performance of several models under my setting is

# Evaluate command:
################ Evaluate with the "best" model
CUDA_VISIBLE_DEVICES=0 python eval.py --dump_images 0 --num_images 5000 --model log_topdown/model-best.pth --infos_path log_topdown/infos_topdown-best.pkl --language_eval 1 --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5
# --beam_size 2 (default for evaluation)
{'bad_count_rate': 0.0004, 'CIDEr': 1.1148884011098006, 'Bleu_4': 0.3603408467287395, 'Bleu_3': 0.4694829933002577, 'Bleu_2': 0.607562860084038, 'Bleu_1': 0.766113185189927, 'ROUGE_L': 0.5653451933953932, 'METEOR': 0.2740137487261984, 'SPICE': 0.2055183043111108, 'WMD': 0.24464028099799806}

# Result (with --beam_size 5):
{'bad_count_rate': 0.0006, 'CIDEr': 1.1211965870603595, 'Bleu_4': 0.36460243862631503, 'Bleu_3': 0.4688452157360828, 'Bleu_2': 0.6018223507223015, 'Bleu_1': 0.7579787093708146, 'ROUGE_L': 0.5650916933775064, 'METEOR': 0.27389245734291195, 'SPICE': 0.2041505771510281, 'WMD': 0.24872474615271603}

################ Evaluate with the newest model at iteration 33000 (epoch 30)
CUDA_VISIBLE_DEVICES=0 python eval.py --dump_images 0 --num_images 5000 --model log_topdown/model-33000.pth --infos_path log_topdown/infos_topdown-33000.pkl --language_eval 1 --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5 
# --beam_size 1
{'bad_count_rate': 0.0084, 'CIDEr': 1.1029445535221165, 'Bleu_4': 0.34359710944020483, 'Bleu_3': 0.4572806495576515, 'Bleu_2': 0.6022717203909068, 'Bleu_1': 0.7631783352576728, 'ROUGE_L': 0.5601634990771689, 'METEOR': 0.2697002153724845, 'SPICE': 0.20273633154399562, 'WMD': 0.23520550249768166}

# --beam_size 2 (default for evaluation)
{'bad_count_rate': 0.003, 'CIDEr': 1.1407954473531186, 'Bleu_4': 0.3674616063653743, 'Bleu_3': 0.47547687916365766, 'Bleu_2': 0.6128676587816714, 'Bleu_1': 0.7717025247216635, 'ROUGE_L': 0.5695801619395764, 'METEOR': 0.277191528089664, 'SPICE': 0.20851787975533756, 'WMD': 0.25000557793758726}

# --beam_size 5
{'bad_count_rate': 0.001, 'CIDEr': 1.1237630664163487, 'Bleu_4': 0.36522950227562834, 'Bleu_3': 0.4684915848374431, 'Bleu_2': 0.6020135937253345, 'Bleu_1': 0.759595847236902, 'ROUGE_L': 0.5648983177964815, 'METEOR': 0.2758252654067539, 'SPICE': 0.20507711989921631, 'WMD': 0.25113003325702343}

You mentioned in README that

Beam search can increase the performance of the search for greedy decoding sequence by ~5%.

But I only got slightly improvement (<1%) in a few metric in my experiment, is it reasonable and do you know why?

The BUTD use this learning rate schedule:

In training, we use a simple learning rate schedule, beginning with a learning rate of 0.01 which is reduced to zero on a straight-line basis over 60K iterations using a batch size of 100 and a momentum parameter of 0.9.

Could you tell me how to set it with your code?
In addtion to set --learning_rate 0.01 --optim sgdmom and
change

elif opt.optim == 'sgdmom':
return optim.SGD(params, opt.learning_rate, opt.optim_alpha, weight_decay=opt.weight_decay, nesterov=True)

to

    elif opt.optim == 'sgdmom':
        return optim.SGD(params, opt.learning_rate, opt.optim_alpha, weight_decay=opt.weight_decay, nesterov=True, momentum=0.9)

what else should I modify?

@ruotianluo
Copy link
Owner

  1. I actually didn't notice that. I may explore it later.
  2. Due to the lagacy scheduled sampling issue, at training time I run the lstm step by step using a for loop. Using native decoder may help. I don't know. The time is also depending on the data loading.
  3. I do think there should be more, I ran on a checkpoint and got 1.104->1.127.
    4 I don't have that kind of learning rate decay. You can add such a lr scheduler.

@HYPJUDY
Copy link
Author

HYPJUDY commented Jul 6, 2019

  1. Thanks! Hoping to see this feature soon!
  2. What if I turn off scheduled sampling? It seems that the time doesn't decrease (I didn't perform strict ablation study). I did load data directly from source tsv feature files (instead of creating data/cocobu_fc, data/cocobu_att and data/cocobu_box) but I think it only increase the initial loading time. BTW, why don't you directly load tsv files? I remember you offer this option once in README but I cannot find it now. What's the disadvantage?
  3. My mistake. The default beam size for evaluation is 2 but not 1. I've updated my post with newest results and strikethrough. One more question related to the results: I found that the model-33000.pth is usually better than model-best.pth under most metrics. Why the val_loss in cross-entropy training stage is inconsist with most metrics? The model under which iteration/epoch is the real best?
  4. Gotcha~

@ruotianluo
Copy link
Owner

2 it's not about turning it on or off. The current way of implementation, in order to use scheduled sampling, is not using the default implementation of taking a whole sequence as LSTM input, which may be faster.
Yes. Only the first epoch. I never supported directly loading tsv files.
3 Just choose the one with the best cider score. that's what I usually do.

@HYPJUDY
Copy link
Author

HYPJUDY commented Jul 7, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants