compare CIDEr optimization and training time with BUTD paper #100

HYPJUDY · 2019-07-05T18:17:07Z

Hi, thanks for your contribution! I have several questions:

I notice in Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering paper mentioned they complete CIDEr optimization in a single epoch as follows:

We have observed when decoding using beam search that the resulting beam typically contains at least one very high quality (and high scoring) caption – although frequently the best caption does not have the highest log-probability of the set. Therefore, we make one additional approximation. Rather than sampling captions from the entire probability distribution, for more rapid training we take the captions in the decoded beam as a sample set. Using this approach, we complete CIDEr optimization in a single epoch.

I wonder why don't you implement this method/trick since it would save huge amount of time! Do you have a plan on it? I haven't figured out the meaning of "take the captions in the decoded beam as a sample set" and how to implement it. Could you throw some light on it?

As for the time, BUTD paper says

Training using two Nvidia Titan X GPUs takes around 9 hours (including less than one hour for CIDEr optimization).

Using the BUTD model in this repo, training using 4 Telsa M40 GPUs takes around 9 hours and 40 minutes. (mine: 30 epoches cross entropy loss training with 4 M40 VS BUTD paper: 60 epoches cross entropy loss training + less than 1h for 1 epoch CIDEr optimization with 2 TitanX).

I think the original caffe implementation is much faster. Do you have any idea about it?

FYI (and others who might interested in the complete commands to get comparable BUTD model results), my training details are as follows:

# Training command:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --id topdown --caption_model topdown --input_json data/cocotalk.json  --input_label_h5 data/cocotalk_label.h5 --batch_size 100 --learning_rate 0.001 --learning_rate_decay_start 0 --checkpoint_path log_topdown --save_checkpoint_every 1100 --val_images_use 5000 --max_epochs 30 --rnn_size 1000 --input_encoding_size 1000 --att_feat_size 2048 --att_hid_size 512  --language_eval 1 --scheduled_sampling_start 0 --use_bn 1 --learning_rate_decay_every 4

I add --use_bn 1 refer to here.

I set the learning rate and optimization in the way a little different (--learning_rate 0.001 --learning_rate_decay_every 4) from #31 and ruotianluo/ImageCaptioning.pytorch#10 (they got good results but I didn't so I try to use larger learning rate). The performance of several models under my setting is

# Evaluate command:
################ Evaluate with the "best" model
CUDA_VISIBLE_DEVICES=0 python eval.py --dump_images 0 --num_images 5000 --model log_topdown/model-best.pth --infos_path log_topdown/infos_topdown-best.pkl --language_eval 1 --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5
# --beam_size 2 (default for evaluation)
{'bad_count_rate': 0.0004, 'CIDEr': 1.1148884011098006, 'Bleu_4': 0.3603408467287395, 'Bleu_3': 0.4694829933002577, 'Bleu_2': 0.607562860084038, 'Bleu_1': 0.766113185189927, 'ROUGE_L': 0.5653451933953932, 'METEOR': 0.2740137487261984, 'SPICE': 0.2055183043111108, 'WMD': 0.24464028099799806}

# Result (with --beam_size 5):
{'bad_count_rate': 0.0006, 'CIDEr': 1.1211965870603595, 'Bleu_4': 0.36460243862631503, 'Bleu_3': 0.4688452157360828, 'Bleu_2': 0.6018223507223015, 'Bleu_1': 0.7579787093708146, 'ROUGE_L': 0.5650916933775064, 'METEOR': 0.27389245734291195, 'SPICE': 0.2041505771510281, 'WMD': 0.24872474615271603}

################ Evaluate with the newest model at iteration 33000 (epoch 30)
CUDA_VISIBLE_DEVICES=0 python eval.py --dump_images 0 --num_images 5000 --model log_topdown/model-33000.pth --infos_path log_topdown/infos_topdown-33000.pkl --language_eval 1 --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5 
# --beam_size 1
{'bad_count_rate': 0.0084, 'CIDEr': 1.1029445535221165, 'Bleu_4': 0.34359710944020483, 'Bleu_3': 0.4572806495576515, 'Bleu_2': 0.6022717203909068, 'Bleu_1': 0.7631783352576728, 'ROUGE_L': 0.5601634990771689, 'METEOR': 0.2697002153724845, 'SPICE': 0.20273633154399562, 'WMD': 0.23520550249768166}

# --beam_size 2 (default for evaluation)
{'bad_count_rate': 0.003, 'CIDEr': 1.1407954473531186, 'Bleu_4': 0.3674616063653743, 'Bleu_3': 0.47547687916365766, 'Bleu_2': 0.6128676587816714, 'Bleu_1': 0.7717025247216635, 'ROUGE_L': 0.5695801619395764, 'METEOR': 0.277191528089664, 'SPICE': 0.20851787975533756, 'WMD': 0.25000557793758726}

# --beam_size 5
{'bad_count_rate': 0.001, 'CIDEr': 1.1237630664163487, 'Bleu_4': 0.36522950227562834, 'Bleu_3': 0.4684915848374431, 'Bleu_2': 0.6020135937253345, 'Bleu_1': 0.759595847236902, 'ROUGE_L': 0.5648983177964815, 'METEOR': 0.2758252654067539, 'SPICE': 0.20507711989921631, 'WMD': 0.25113003325702343}

You mentioned in README that

Beam search can increase the performance of the search for greedy decoding sequence by ~5%.

~~But I only got slightly improvement (<1%) in a few metric in my experiment, is it reasonable and do you know why?~~

The BUTD use this learning rate schedule:

In training, we use a simple learning rate schedule, beginning with a learning rate of 0.01 which is reduced to zero on a straight-line basis over 60K iterations using a batch size of 100 and a momentum parameter of 0.9.

Could you tell me how to set it with your code?
In addtion to set --learning_rate 0.01 --optim sgdmom and
change

self-critical.pytorch/misc/utils.py

Lines 168 to 169 in 8118670

    
           elif opt.optim == 'sgdmom': 
        
               return optim.SGD(params, opt.learning_rate, opt.optim_alpha, weight_decay=opt.weight_decay, nesterov=True)

to

    elif opt.optim == 'sgdmom':
        return optim.SGD(params, opt.learning_rate, opt.optim_alpha, weight_decay=opt.weight_decay, nesterov=True, momentum=0.9)

what else should I modify?

The text was updated successfully, but these errors were encountered:

ruotianluo · 2019-07-06T01:52:42Z

I actually didn't notice that. I may explore it later.
Due to the lagacy scheduled sampling issue, at training time I run the lstm step by step using a for loop. Using native decoder may help. I don't know. The time is also depending on the data loading.
I do think there should be more, I ran on a checkpoint and got 1.104->1.127.
4 I don't have that kind of learning rate decay. You can add such a lr scheduler.

HYPJUDY · 2019-07-06T16:30:56Z

Thanks! Hoping to see this feature soon!
What if I turn off scheduled sampling? It seems that the time doesn't decrease (I didn't perform strict ablation study). I did load data directly from source tsv feature files (instead of creating data/cocobu_fc, data/cocobu_att and data/cocobu_box) but I think it only increase the initial loading time. BTW, why don't you directly load tsv files? I remember you offer this option once in README but I cannot find it now. What's the disadvantage?
My mistake. The default beam size for evaluation is 2 but not 1. I've updated my post with newest results and strikethrough. One more question related to the results: I found that the model-33000.pth is usually better than model-best.pth under most metrics. Why the val_loss in cross-entropy training stage is inconsist with most metrics? The model under which iteration/epoch is the real best?
Gotcha~

ruotianluo · 2019-07-06T16:43:27Z

2 it's not about turning it on or off. The current way of implementation, in order to use scheduled sampling, is not using the default implementation of taking a whole sequence as LSTM input, which may be faster.
Yes. Only the first epoch. I never supported directly loading tsv files.
3 Just choose the one with the best cider score. that's what I usually do.

HYPJUDY · 2019-07-07T00:54:52Z

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare CIDEr optimization and training time with BUTD paper #100

compare CIDEr optimization and training time with BUTD paper #100

HYPJUDY commented Jul 5, 2019 •

edited

Loading

ruotianluo commented Jul 6, 2019

HYPJUDY commented Jul 6, 2019

ruotianluo commented Jul 6, 2019

HYPJUDY commented Jul 7, 2019

compare CIDEr optimization and training time with BUTD paper #100

compare CIDEr optimization and training time with BUTD paper #100

Comments

HYPJUDY commented Jul 5, 2019 • edited Loading

ruotianluo commented Jul 6, 2019

HYPJUDY commented Jul 6, 2019

ruotianluo commented Jul 6, 2019

HYPJUDY commented Jul 7, 2019

HYPJUDY commented Jul 5, 2019 •

edited

Loading