-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NER with bert #21
Comments
Hello @byzhang ! Yes I plan to reproduce BERT NER when I will find the time (also FLAIR). Did you use the fine-tuning approach or the "ELMo-like" feature-based approach they describe in their paper in section 5.4? In the NER evaluation of their paper, it was unclear if they used or not the CoNLL 2003 dev section for training, which can make a quite big difference in the final f-score (but not as big as what you mention). |
I used the fine-tuning approach, and the dev set is used for hyper tuning and early stop only |
see ongoing work on PR #78 |
The best run I could get with BERT-base-en (cased) is 91.68 for CoNLL 2003 NER test set, tuning with dev set, training only with train set - but I added a CRF activation layer for fine tuning instead of the default softmax (CRF brings around +0.3 to f-score). So this is the same as you. Average over 10 training+eval, this gives 91.20 - so very far from the reported 92.4. As discussed there, the results reported in the paper for NER are likely token-level scores, not entity-level - very misleading of course. |
In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.
Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40). With these configurations and preprocessing, I was able to reach 92.8 with bert-large. |
Hello @ghaddarAbs ! Thank you for your message and spending the time to share your experiments to reproduce the BERT reported results. Sorry it took me some time to come back to this. I've tried to see the impact of the truecase pre-processing with bert-base-en (cased), so having in mind the reported 92.4 f-score (I am using bert-base because I don't have easily the GPU to use bert-large). Below I didn't touch the hyper-parameters:
The scores are averaged over 10 train/runs with worst-best scores in the parentheses. So the gain alone of the pre-processing is significant (+0.22) but not big. Apparently the truecase has no impact on BidLSTM-CRF, but an impact with BERT. I guess it's because in BERT, the vocabulary is case-sensitive and do not consider extra casing variants from the 30522 sub-tokens, while BidLSTM has a dedicated char input channel which will deal very well with generalization of casing (which also explains why adding "casing" features in the BidLSTM-CRF has zero effect). In term of evaluation, I think we are not comparing really anymore just NER algorithm here, we are also evaluating the true case tool, it's what people call usually "using external knowledge". I've started to experiment with your indicated hyper-parameters, but it takes a lot of time (> 40 epoch with learning rate so low, it's really different from the usual 3-6 max epoch usually selected with BERT, it takes days and days with 10 runs :/). Regarding your results, may I ask you the following questions:
On my side, when I add all these "tricks", I am not very far from the reported score (but still 0.3-0.4 missing). But, from the reproducibility point of view, according to the original BERT paper they are not using any of them (thus the 91.20 f-score versus the reported 92.4). From the evaluation point of view, I mus say using these tricks makes the evaluation not anymore comparable with other reported numbers, or we have to add them too to the other algorithms. |
@kermitt2 ... I used GPU with 32 GB for these experiments. To answer your 3 questions:
My own intuition is that the authors of BERT applied true casing on CoNLL-2003 for fine-tune NER. It was the only way for me to reproduce their results, but I don't know actually if they have done it or not. Of course, if true casing is applied than the results are not comparable with previous works. |
I trained the NER model with bert-base-cased and truecase as well and find that in can get 91.72 F1 score on average, but it is still far from the reported score in BERT paper |
I tried truecase with bert-base-cased and it gave a little improvement but the test f1 still was limited below 92.0. The BERT paper says that they used maximal document context for NER. That means, I think, that they used left/right sentence context for predicting target sentence. I tried this document context and could get around 92.4 test f1. |
@pinesnow72 |
@BCWang93 |
@pinesnow72 ,hi, can you share some code you process data in this method?Thanks! |
Do you have a plan to reproduce the BERT NER model? I tried, but with Bert_base, the best micro-avg Test F1 on CoNLL-2003 is 91.37, while the reported in the paper is 92.4.
The text was updated successfully, but these errors were encountered: