-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstractive Summarization Results #340
Comments
Hey, so we are getting close to these results, but still a little bit below. Summarization Experiment DescriptionThis document describes how to replicate summarization experiments on the CNNDM and gigaword datasets using OpenNMT-py. An example article-title pair from Gigaword should look like this: Input Output Preprocessing the dataSince we are using copy-attention [1] in the model, we need to preprocess the dataset such that source and target are aligned and use the same dictionary. This is achieved by using the options command used: (1) CNNDM
(2) Gigaword
TrainingThe training procedure described in this section for the most part follows parameter choices and implementation similar to that of See et al. [2]. As mentioned above, we use copy attention as a mechanism for the model to decide whether to either generate a new word or to copy from the source ( For the training procedure, we are using SGD with an initial learning rate of 1 for a total of 16 epochs. In most cases, the lowest validation perplexity is achieved around epoch 10-12. We also use OpenNMT's default learning rate decay, which halves the learning rate after every epoch once the validation perplexity increased after an epoch (or after epoch 8). commands used: (1) CNNDM
(2) Gigaword
InferenceDuring inference, we use beam-search with a beam-size of 10. commands used: (1) CNNDM
(2) Gigaword
EvaluationCNNDMTo evaluate the ROUGE scores on CNNDM, we extended the pyrouge wrapper with additional evaluations such as the amount of repeated n-grams (typically found in models with copy attention), found here. It can be run with the following command:
Note that the GigawordFor evaluation of large test sets such as Gigaword, we use the a parallel python wrapper around ROUGE, found here. command used: Running the commands above should yield the following scores:
References[1] Vinyals, O., Fortunato, M. and Jaitly, N., 2015. Pointer Network. NIPS [2] See, A., Liu, P.J. and Manning, C.D., 2017. Get To The Point: Summarization with Pointer-Generator Networks. ACL [3] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. ICLR [4] Luong, M.T., Pham, H. and Manning, C.D., 2015. Effective approaches to attention-based neural machine translation. EMNLP |
Cool. Can you let us know what results you got? When you say "better", do you mean compared to what? |
Hey, you wrote:
Can't seem to find this file. Can you link me to the project? I meant "better" by comparing the accuracy results of the original dataset to See's preprocessed runs. |
The script we've been using is this one: https://github.com/falcondai/pyrouge/ Thanks for the note about See's dataset. I will try and compare models with the different datasets |
Still not sure where this |
Interesting discussion. @srush your examples shows the |
@mataney I linked the wrong repo - https://github.com/falcondai/rouge-baselines is what we use (that in turn uses pyrouge) @pltrdy You're absolutely right, I copied the commands from a time before the |
@sebastianGehrmann And in order to get just the big files I ran some of See's code (because I wanted to get another thing that is not just the article and the abstract) https://gist.github.com/mataney/67cfb05b0b84e88da3e0fe04fb80cfc8 So you can do something like this, or you can just concatenate them (the latter will be shorter) |
Thanks, I'll check it out. To make sure we use the same exact files, could you upload yours and send me a download link via email? That'd be great! (gehrmann (at) seas.harvard.edu) |
Huh, this is the code I ran to make the dataset, it was forked from hers. https://github.com/OpenNMT/cnn-dailymail I wonder if she changed anything... |
Oh I see, this is after the files are created. Huh, so the only thing I see that could be different is that she drops blank lines and does some unicode encoding. @mataney Could you run "sdiff " and confirm that ? I don't see anything else in this gist, but I could be missing something. |
@srush This files should be the same (sdiff shouldn't work as I have more data about each article than just I can conclude with false alarm as I didn't know you are using See's preprocessing, but you do :) |
Another question, after training and translating I only get 1 sentence summaries. This seem strange. |
Oh, shoot. I forgot to mention this. See uses |
Why not just replace BTW, it seems that there is no |
Hey guys, |
I think we are basically there. What scores are you getting? |
@sebastianGehrmann (when he gets back from vacation) |
Using the HP you, @srush , mentioned above, I get the following ROUGE scores on CNN/D% (after 16 epochs):
|
Getting about the same, although I'm getting better results when embedding and hidden sizes are 500. (Obviously this is said without taking anything from the brilliant work that has been done here! 😄 ) |
Okay, let me post our model, we're doing a lot better. Think we need to update the docs. (Although, worrisome that you are getting different results with the same args. I will check into that. ) |
Okay, here are his args:
(See's RNN is split 512/256 which we don't support at the moment.) And then during translation use Wu style coverage with We're seeing train ppl of 12.84, val ppl of 11.98 and ROUGE-1/2 of 0.38 | 0.168 |
Hey :) So only thing that might be different is the data the is being passed to M. |
Thanks @LeenaShekhar for the detail. I guess I agree with your point regarding batch_size not making any difference during inference. But I was curious to know why we did not use a batch size of 16 during inference as it could have made the inference faster. |
Hi, |
Hi, |
I updated the document describing the summarization experiments here. To directly answer some questions by @ratishsp from above:
|
Thank you so much for updating the document. |
Thanks @sebastianGehrmann for updating the document. |
@ratishsp To answer your second question: Dropout of 0.3 is default. group.add_argument('-dropout', type=float, default=0.3, |
Hi, everyone. Very nice discussion! I ran the baseline model nocopy_acc_51.33_ppl_12.74_e20 on Gigaword test set with report_rouge param and got |
Nice work! Thank you all! After reading the thread, I still have one question though. Is the coverage layer introduced by See is not suggested during training? @sebastianGehrmann |
Is there something wrong with the script "python baseline.py -m no_sent_tag ..." ? I try it but got low score. I run the "python baseline.py -m sent_tag_verbatim ...", and the result seems more normal. |
Hi @Maggione thanks for the question. The baseline.py script supports multiple formats for your src and tgt data. In the one I describe in the tutorial, we have and tags as sentence boundaries for the gold, but remove them from the prediction. Depending on your format, you might have to use a different one. I'll put it on my list to better format the different modes. |
closing this thread, it's fully documented in the FAQ now. testset @200k steps averaged last 10 checkpoints Running ROUGE...1 ROUGE-1 Average_R: 0.37251 (95%-conf.int. 0.37023 - 0.37478)
|
@vince62s I am also trying to run transformer on CNNDM (on 2 GPUs). Could you share the set of train parameters you used? Are they same as the parameters reported for transformer here : http://opennmt.net/OpenNMT-py/Summarization.html ? |
yes same as there. |
Thanks I just noticed that the results reported on CNN on this thread and on the summarization.md were different and so I asked. And also you have used -copy_attn which is different from the transformer paper setting. Was that to improve the score? |
Great work on all these summarization implementations, thanks a bunch! The results presented in the paper Bottom-up Abstractive Summarization are based on this implementation, is that correct? When I follow the summarization example, given the hyperparameters used in this example, I would expect my results to be equivalent to the "Pointer-Generator + Coverage Penalty (our implementation)" entry in table 1. However, I obtain a drop of ~2.5 ROUGE points as shown in the evaluation output below. Am I missing something or did current implementation diverge from the one used in the paper? 1 ROUGE-1 Average_R: 0.37577 (95%-conf.int. 0.37294 - 0.37851) 1 ROUGE-2 Average_R: 0.16530 (95%-conf.int. 0.16269 - 0.16789) 1 ROUGE-L Average_R: 0.34268 (95%-conf.int. 0.33975 - 0.34540) |
which one did you run ? rnn or transformer ? |
I ran the rnn on cnndm and evaluated using files2rouge with the predictions and targets stripped from tags. |
@AIJoris Yes, the results in the paper are all from OpenNMT-py and the summarization example provides the exact commands I ran. |
@sebastianGehrmann Thanks a lot for your quick response. I ran the inference overnight with model ada6_bridge_oldcopy_tagged_acc_54.17_ppl_11.17_e20.pt and the results are as follows: 1 ROUGE-1 Average_R: 0.37917 (95%-conf.int. 0.37638 - 0.38201) 1 ROUGE-2 Average_R: 0.16934 (95%-conf.int. 0.16684 - 0.17189) 1 ROUGE-L Average_R: 0.34806 (95%-conf.int. 0.34531 - 0.35070) I used the following command for inference: python` OpenNMT-py/translate.py -gpu 0 |
@sebastianGehrmann After some more testing I noticed that when loading the pretrained model, the input documents are not truncated to 400 tokens. This is the only difference I have been able to find between the pre-trained model and my own. Apart from that, it looks like the inference and/or test procedure is different, as the reported results above are different from the paper. |
I still haven't managed to obtain the reported results. I have tested both the pre-trained Transformer model as well as a Transformer model trained from scratch using the parameters from the documentation. I am using the following parameters for inference:
When evaluating with rouge-baselines using
Just to be clear, the above results are from the pre-trained model located here. Do you have any idea what can cause this difference in performance and how to improve it? |
Hi all, Thanks for the code and all replies to this discussion! I noticed that the command line for training has no '-seed' setting, this means defaults value is used. However, I observed different results when launching without setting myself the seed in the train command line. |
@lauhaide could you report how long you trained the model, and the final scores? In fact, I don't reproduce it either, not by training from scratch nor by just running inference (I get exact same results as @AIJoris ). @sebastianGehrmann could you help us on this? I even tried to run inference on old commit (since the repo has been moving) using commit from 2018-04-27 bde7f83 I get slightly different results but still not the same as yours (old F1: 0.38689 0.17099 0.35935) |
Thanks for your prompt reply @pltrdy ! |
@lauhaide could you provide the checkpoint file? |
Yes, can be downloaded from here: |
Hey guys, looking at recent pull requests and issues, it looks like a common interest of contributors (On top of NMT obv) is Abstractive Summarization.
Any suggestions of how to train a model that will get close results to recent papers on the CNN-Daily Mail Database? Any additional preprocessing?
Thanks?
The text was updated successfully, but these errors were encountered: