Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bert sequence labelling #78

Merged
merged 36 commits into from
Jan 16, 2020
Merged

Bert sequence labelling #78

merged 36 commits into from
Jan 16, 2020

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Dec 28, 2019

Add BERT architecture for sequence labelling.

As noted here the original CoNLL-2003 NER results reported by the Google Research paper are not reproducible, by far, and they probably reported token-level metrics instead of entity-level metrics (as done by conlleval and previous works). In general, generic transformer pre-trained models appear to perform poorly for information extraction and NER tasks (both with fine-tuning or contextual embedding features), as compared to ELMo.

Still it's a good exercise and using scibert/biobert for scientific text achieves very good and faster results, even compared to ELMo+BidLSTM-CRF.

Similarly as the usage of BERT for text classification in DeLFT, we use a data generator to feed BERT when predicting (instead of the file-based input function of the original BERT implementation), and avoid reloading the whole TF graph for each batch. This was possible by using the FastPredict class in model.py, which is adapted from https://github.com/marcsto/rl/blob/master/src/fast_predict2.py by Marc Stogaitis.

Using a nvidia GeForce 1080 GPU, we can process around 1000 tokens per second with this approach, which is 3 times faster than BiLSTM-CRF+ELMo, but 30 times slower than with a BiLSTM-CRF (and 100 times slower than what we get with a Wapiti CRF model on a modern workstation ;).

@kermitt2 kermitt2 self-assigned this Dec 28, 2019
@kermitt2 kermitt2 added the enhancement New feature or request label Dec 28, 2019
@kermitt2
Copy link
Owner Author

I will also add an optional CRF activation layer as alternative to the current softmax layer (which is not good for sequence labelling).

@kermitt2 kermitt2 changed the title Bert sequence labeling Bert sequence labelling Dec 28, 2019
@kermitt2
Copy link
Owner Author

kermitt2 commented Dec 28, 2019

CoNLL 2003 NER
with softmax activation layer for fine-tuning
no parameter tuning, dev set ignored
BERT-base-en cased

average over 10 folds
            precision    recall  f1-score   support

       ORG     0.8804    0.8999    0.8900      1661
      MISC     0.7702    0.8181    0.7934       702
       PER     0.9604    0.9533    0.9568      1617
       LOC     0.9240    0.9258    0.9249      1668

	macro f1 = 0.9068
	macro precision = 0.9010
	macro recall = 0.9127 


** Worst ** model scores - 7
                  precision    recall  f1-score   support

             PER     0.9643    0.9524    0.9583      1617
             ORG     0.8642    0.8892    0.8766      1661
             LOC     0.9188    0.9293    0.9240      1668
            MISC     0.7510    0.8077    0.7783       702

all (micro avg.)     0.8932    0.9090    0.9010      5648


** Best ** model scores - 1
                  precision    recall  f1-score   support

             PER     0.9628    0.9592    0.9610      1617
             ORG     0.8937    0.9013    0.8975      1661
             LOC     0.9169    0.9257    0.9212      1668
            MISC     0.7867    0.8248    0.8053       702

all (micro avg.)     0.9062    0.9155    0.9109      5648

@kermitt2
Copy link
Owner Author

kermitt2 commented Dec 28, 2019

CoNLL 2003 NER
with CRF activation layer for fine-tuning (improved a bit as compared to softmax...)
no parameter tuning, dev set ignored
BERT-base-en cased

average over 10 folds
            precision    recall  f1-score   support

       ORG     0.8793    0.9043    0.8916      1661
      MISC     0.7741    0.8201    0.7964       702
       PER     0.9632    0.9573    0.9602      1617
       LOC     0.9258    0.9257    0.9258      1668

	macro f1 = 0.9089
	macro precision = 0.9026
	macro recall = 0.9153 


** Worst ** model scores - 5
                  precision    recall  f1-score   support

             PER     0.9603    0.9573    0.9588      1617
             ORG     0.8776    0.8977    0.8875      1661
            MISC     0.7507    0.8148    0.7814       702
             LOC     0.9218    0.9257    0.9237      1668

all (micro avg.)     0.8968    0.9127    0.9047      5648


** Best ** model scores - 8
                  precision    recall  f1-score   support

             PER     0.9628    0.9604    0.9616      1617
             ORG     0.8735    0.9103    0.8915      1661
            MISC     0.7846    0.8248    0.8042       702
             LOC     0.9336    0.9269    0.9302      1668

all (micro avg.)     0.9045    0.9189    0.9116      5648

@kermitt2
Copy link
Owner Author

kermitt2 commented Dec 29, 2019

After painful tuning the hyperparameters on the dev set, this is the best I get with BERT-base+CRF

CoNLL 2003 NER
with CRF activation layer for fine-tuning
hyperparameter tuning with dev set
BERT-base-en cased

average over 10 folds
            precision    recall  f1-score   support

       ORG     0.8804    0.9114    0.8957      1661
      MISC     0.7823    0.8189    0.8002       702
       PER     0.9633    0.9576    0.9605      1617
       LOC     0.9290    0.9316    0.9303      1668

	macro f1 = 0.9120
	macro precision = 0.9050
	macro recall = 0.9191 


** Worst ** model scores - 9
                  precision    recall  f1-score   support

             ORG     0.8736    0.9073    0.8901      1661
             PER     0.9596    0.9555    0.9575      1617
             LOC     0.9221    0.9293    0.9256      1668
            MISC     0.7757    0.8177    0.7961       702

all (micro avg.)     0.8992    0.9164    0.9078      5648


** Best ** model scores - 2
                  precision    recall  f1-score   support

             ORG     0.8897    0.9229    0.9060      1661
             PER     0.9627    0.9573    0.9600      1617
             LOC     0.9375    0.9353    0.9364      1668
            MISC     0.7862    0.8120    0.7989       702

all (micro avg.)     0.9110    0.9226    0.9168      5648

nerTagger.py Outdated
@@ -422,6 +531,10 @@ def annotate(output_format,


if __name__ == "__main__":

architectures = ['BidLSTM_CRF', 'BidLSTM_CNN_CRF', 'BidLSTM_CNN_CRF', 'BidGRU_CRF', 'BidLSTM_CNN', 'BidLSTM_CRF_CASING',
'bert-base-en', 'bert-base-en', 'scibert', 'biobert']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there are two 'bert-base-en'

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oups the second one should be bert-large-en!

@kermitt2 kermitt2 merged commit 94d45e8 into master Jan 16, 2020
@lfoppiano
Copy link
Collaborator

I'm adding the results for the quantities model and for the superconductor model, (ran with the base-bert-en) using bert or scibert is actually not bringing any improvement.

Quantities

(base) [lfoppian0@mdpfdm005 delft-bert]$ grep "verage over 10 folds" -A 14  DelftQuantities*
DelftQuantitiesNormal.o4743:Average over 10 folds
DelftQuantitiesNormal.o4743-                  precision    recall  f1-score   support
DelftQuantitiesNormal.o4743-
DelftQuantitiesNormal.o4743-      <unitLeft>     0.9473    0.9667    0.9569       258
DelftQuantitiesNormal.o4743-     <unitRight>     0.9400    0.8000    0.8633        11
DelftQuantitiesNormal.o4743-   <valueAtomic>     0.8219    0.8868    0.8530       304
DelftQuantitiesNormal.o4743-     <valueBase>     0.9714    0.7500    0.8457         8
DelftQuantitiesNormal.o4743-    <valueLeast>     0.8772    0.8063    0.8396        80
DelftQuantitiesNormal.o4743-     <valueList>     0.7391    0.7883    0.7617        60
DelftQuantitiesNormal.o4743-     <valueMost>     0.8961    0.8234    0.8580        77
DelftQuantitiesNormal.o4743-    <valueRange>     0.9889    0.9250    0.9522         8
DelftQuantitiesNormal.o4743-
DelftQuantitiesNormal.o4743-all (macro avg.)     0.8706    0.8888    0.8796          
DelftQuantitiesNormal.o4743-
DelftQuantitiesNormal.o4743-model config file saved
--
DelftQuantitiesSciBert.o4594:Average over 10 folds
DelftQuantitiesSciBert.o4594-                  precision    recall  f1-score   support
DelftQuantitiesSciBert.o4594-
DelftQuantitiesSciBert.o4594-      <unitLeft>     0.9491    0.8810    0.9138       258
DelftQuantitiesSciBert.o4594-     <unitRight>     0.8332    0.6727    0.7411        11
DelftQuantitiesSciBert.o4594-   <valueAtomic>     0.8599    0.8635    0.8617       304
DelftQuantitiesSciBert.o4594-     <valueBase>     0.9875    0.9875    0.9875         8
DelftQuantitiesSciBert.o4594-    <valueLeast>     0.9080    0.8475    0.8766        80
DelftQuantitiesSciBert.o4594-     <valueList>     0.7585    0.8583    0.8053        60
DelftQuantitiesSciBert.o4594-     <valueMost>     0.9026    0.8870    0.8946        77
DelftQuantitiesSciBert.o4594-    <valueRange>     0.9500    0.9500    0.9500         8
DelftQuantitiesSciBert.o4594-
DelftQuantitiesSciBert.o4594-all (macro avg.)     0.8886    0.8689    0.8786      

Superconductors:

(base) [lfoppian0@mdpfdm005 delft-bert]$ grep "verage over 10 folds" -A 14  DelftSuper*
DelftSuperBert.o4593:Average over 10 folds
DelftSuperBert.o4593-                  precision    recall  f1-score   support
DelftSuperBert.o4593-
DelftSuperBert.o4593-         <class>     0.4298    0.3107    0.3598        28
DelftSuperBert.o4593-      <material>     0.6509    0.7420    0.6932       143
DelftSuperBert.o4593-     <me_method>     0.5057    0.3167    0.3842        12
DelftSuperBert.o4593-      <pressure>     0.6522    0.3500    0.4472        10
DelftSuperBert.o4593-            <tc>     0.6988    0.5697    0.6271        76
DelftSuperBert.o4593-       <tcValue>     0.5219    0.5312    0.5257        16
DelftSuperBert.o4593-
DelftSuperBert.o4593-all (macro avg.)     0.6314    0.6102    0.6205          
DelftSuperBert.o4593-
DelftSuperBert.o4593-
DelftSuperBert.o4593-Leaving TensorFlow...
--
DelftSuperNormal.o4742:Average over 10 folds
DelftSuperNormal.o4742-                  precision    recall  f1-score   support
DelftSuperNormal.o4742-
DelftSuperNormal.o4742-         <class>     0.4324    0.2857    0.3394        28
DelftSuperNormal.o4742-      <material>     0.7915    0.7594    0.7750       143
DelftSuperNormal.o4742-     <me_method>     0.5980    0.2917    0.3820        12
DelftSuperNormal.o4742-      <pressure>     0.9333    0.4100    0.5646        10
DelftSuperNormal.o4742-            <tc>     0.8432    0.7461    0.7912        76
DelftSuperNormal.o4742-       <tcValue>     0.6232    0.6562    0.6375        16
DelftSuperNormal.o4742-
DelftSuperNormal.o4742-all (macro avg.)     0.7628    0.6716    0.7140          
DelftSuperNormal.o4742-
DelftSuperNormal.o4742-model config file saved
DelftSuperNormal.o4742-preprocessor saved
DelftSuperNormal.o4742-model saved
--
DelftSuperSciBert.o4590:Average over 10 folds
DelftSuperSciBert.o4590-                  precision    recall  f1-score   support
DelftSuperSciBert.o4590-
DelftSuperSciBert.o4590-         <class>     0.3579    0.3321    0.3419        28
DelftSuperSciBert.o4590-      <material>     0.7876    0.8126    0.7997       143
DelftSuperSciBert.o4590-     <me_method>     0.6787    0.5333    0.5950        12
DelftSuperSciBert.o4590-      <pressure>     0.5710    0.4900    0.5229        10
DelftSuperSciBert.o4590-            <tc>     0.7339    0.6079    0.6648        76
DelftSuperSciBert.o4590-       <tcValue>     0.6233    0.6500    0.6355        16
DelftSuperSciBert.o4590-
DelftSuperSciBert.o4590-all (macro avg.)     0.7129    0.6786    0.6951       

@lfoppiano
Copy link
Collaborator

Another thing I noticed, here is that the config model of bert-base-en has some strange values:

  1. useBert = false
  2. embeddings_name: glove-...

See below:

{
    "model_name": "superconductors-bert-bert-base-en",
    "model_type": "bert-base-en",
    "embeddings_name": "glove-840B",
    "char_vocab_size": 179,
    "case_vocab_size": 8,
    "char_embedding_size": 25,
    "num_char_lstm_units": 25,
    "max_char_length": 30,
    "max_sequence_length": 512,
    "word_embedding_size": 300,
    "num_word_lstm_units": 100,
    "case_embedding_size": 5,
    "dropout": 0.5,
    "recurrent_dropout": 0.5,
    "use_char_feature": true,
    "use_crf": false,
    "fold_number": 10,
    "batch_size": 6,
    "use_ELMo": false,
    "use_BERT": false,
    "labels": {
        "<PAD>": 0,
        "O": 1,
        "B-<tc>": 2,
        "I-<tc>": 3,
        "B-<material>": 4,
        "I-<material>": 5,
        "B-<tcValue>": 6,
        "I-<tcValue>": 7,
        "B-<class>": 8,
        "I-<class>": 9,
        "B-<me_method>": 10,
        "I-<me_method>": 11,
        "B-<pressure>": 12,
        "I-<pressure>": 13
    }
}

@kermitt2
Copy link
Owner Author

Thanks @lfoppiano ! What is the "normal" you're comparing with?

For sequence labeling, BERT base gives indeed results similar with BidLSTM-CRF with Gloves in general (but only after tuning parameters), so this is in line with what is observed usually. I saw with the recognition of software mentions that SciBERT was much better on scientific text than the normal BERT, but it is still significantly lower scores than ELMo+BidLSTM-CRF (minus 1-2 f-score). On the contrary, For classification, SciBERT gives the best result on "scholar" texts.

About the config, useBert = false is correct because this parameter is for using BERT as "extracted features" so as dynamic embeddings on top of Gloves for instance in usual architecture (it's the equivalent of parameter useELMo functionally). We could think about a better name to avoid confusion.

"embeddings_name": "glove-840B" because glove is the default value of the word embedding to be used. As there is no word embedding involved with fine-tuned BERT architecture, this parameter is ignored - but we could set it to None when a BERT architecture is selected.

@lfoppiano lfoppiano deleted the bert-sequence-labeling branch January 28, 2020 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants