History ( ~ 2020. 2. 25)

Evaluation script
- etc/token_eval.py
- etc/chunk_eval.py
- etc/conlleval
The bellow results for BERT is not valid now. because BERT is used as feature-based currently.
- checkout the code for BERT fine-tuning: https://github.com/dsindex/etagger/tree/7354971552bbf204a4357369637b687c1704bdcc
- the result for feature-based BERT
  - read 'BERT new result, aligned wordpiece+word embeddings)'

Evaluation

experiment logs

results

QRNN
- Glove
  - setting : experiments 14, test 8
  - per-token(partial) f1 : 0.8892680845877263
  - per-chunk(exact) f1 : 0.8809544851966417 (conlleval)
  - average processing time per bucket
    - 1 GPU(TITAN X(Pascal), 12196MiB)
      - restore version : 0.013028464151645457 sec
    - 32 processor CPU(multi-threading)
      - python : 0.004297458387741437 sec
      - C++ : 0.004124 sec
    - 1 CPU(single-thread)
      - python : 0.004832443533451109 sec
      - C++ : 0.004734 sec
Transformer
- Glove
  - setting : experiments 7, test 9
  - per-token(partial) f1 : 0.9083215796897038
  - per-chunk(exact) f1 : 0.904078014184397 (chunk_eval)
  - average processing time per bucket
    - 1 GPU(TITAN X (Pascal), 12196MiB)
      - restore version : 0.013825567226844812 sec
      - frozen version : 0.015376264122228799 sec
      - tensorRT(FP16) version : no meaningful difference
    - 32 processor CPU(multi-threading)
      - python : 0.017238136546748987 sec
      - C++ : 0.013 sec
    - 1 CPU(single-thread)
      - python : 0.03358284470571628 sec
      - C++ : 0.021510 sec
BiLSTM
- Glove
  - setting : experiments 9, test 1
  - per-token(partial) f1 : 0.9152852267186738
  - per-chunk(exact) f1 : 0.9094911075893644 (chunk_eval)
  - average processing time per bucket
    - 1 GPU(TITAN X (Pascal), 12196MiB)
      - restore version : 0.010454932072004718 sec
      - frozen version : 0.011339560587942018 sec
      - tensorRT(FP16) version : no meaningful difference
    - 32 processor CPU(multi-threading)
      - rnn_num_layers 2 : 0.006132203450549827 sec
      - rnn_num_layers 1
        
        python
        
        0.0041805055967241884 sec
        
        0.003053264560968687 sec (experiments 12, test 5)
        
        C++
        
        0.002735 sec
        
        0.002175 sec (experiments 9, test 2), 0.8800
        
        0.002783 sec (experiments 9, test 3), 0.8858
        
        0.004407 sec (experiments 9, test 4), 0.8887
        
        0.003687 sec (experiments 9, test 5), 0.8835
        
        0.002976 sec (experiments 9, test 6), 0.8782
        
        0.002855 sec (experiments 9, test 7), 0.8906
        
        0.002697 sec with optimizations for FMA, AVX and SSE. no meaningful difference.
        
        0.002040 sec (experiments 12, test 5), 0.9047
    - 1 CPU(single-thread)
      - rnn_num_layers 2 : 0.008001159379070668 sec
      - rnn_num_layers 1
        
        python
        
        0.0051817628640952506 sec
        
        0.0042755354628630235 sec (experiments 12, test 5)
        
        C++
        
        0.003998 sec
        
        0.002853 sec (experiments 9, test 2)
        
        0.003474 sec (experiments 9, test 3)
        
        0.005118 sec (experiments 9, test 4)
        
        0.004139 sec (experiments 9, test 5)
        
        0.004133 sec (experiments 9, test 6)
        
        0.003334 sec (experiments 9, test 7)
        
        0.003078 sec with optimizations for FMA, AVX and SSE. no meaningful difference.
        
        0.002683 sec (experiments 12, test 5)
- ELMo
  - setting : experiments 8, test 2
  - per-token(partial) f1 : 0.9322728663199756
  - per-chunk(exact) f1 : 0.9253625751680227 (chunk_eval)
```
$ etc/conlleval < pred.txt
processed 46666 tokens with 5648 phrases; found: 5662 phrases; correct: 5234.
accuracy:  98.44%; precision:  92.44%; recall:  92.67%; FB1:  92.56
              LOC: precision:  94.29%; recall:  92.99%; FB1:  93.63  1645
             MISC: precision:  84.38%; recall:  84.62%; FB1:  84.50  704
              ORG: precision:  89.43%; recall:  91.69%; FB1:  90.55  1703
              PER: precision:  97.27%; recall:  96.85%; FB1:  97.06  1610
```
  - average processing time per bucket
    - 1 GPU(TITAN X (Pascal), 12196MiB) : 0.06133532517637155 sec -> need to recompute
    - 1 GPU(Tesla V100) : 0.029950057644797457 sec
    - 32 processor CPU(multi-threading) : 0.40098162731570347 sec
    - 1 CPU(single-thread) : 0.7398052649182165 sec
- ELMo + Glove
  - setting : experiments 10, test 16
  - per-token(partial) f1 : 0.9322386962382061
  - per-chunk(exact) f1 : 0.928729526339088 (chunk_eval)
```
processed 46666 tokens with 5648 phrases; found: 5657 phrases; correct: 5247.
accuracy:  98.44%; precision:  92.75%; recall:  92.90%; FB1:  92.83
              LOC: precision:  93.89%; recall:  94.00%; FB1:  93.95  1670
             MISC: precision:  85.03%; recall:  83.33%; FB1:  84.17  688
              ORG: precision:  90.17%; recall:  91.63%; FB1:  90.89  1688
              PER: precision:  97.58%; recall:  97.22%; FB1:  97.40  1611
```
  - average processing time per bucket
    - 1 GPU(TITAN X (Pascal), 12196MiB) : 0.036233977567360014 sec
    - 1 GPU(Tesla V100, 32510MiB) : 0.031166194639816864 sec
- BERT new result, aligned wordpiece+word embeddings)
  - BERT(large) + Glove + ELMo
    - setting : experiments 15, test 7
    - per-token(partial) f1 : 0.9306700873495816
    - per-chunk(exact) f1 : 0.9264420532721821(chunk_eval), 92.64(conlleval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : pass
  - BERT(large) + Glove
    - setting : experiments 15, test 6
    - per-token(partial) f1 : 0.9217156200073737
    - per-chunk(exact) f1 : 0.9158398299078666(chunk_eval), 91.58(conlleval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : pass
  - BERT(large)
    - BERT + LSTM + CRF only
    - setting : experiments 15, test 2
    - per-token(partial) f1 : 0.9120832058733557
    - per-chunk(exact) f1 : 0.9015151515151516(chunk_eval), 90.14(conlleval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : pass
- BERT old result, extending word embeddings for wordpieces
  - BERT(base)
    - setting : experiments 11, test 1
    - per-token(partial) f1 : 0.9234725113260683
    - per-chunk(exact) f1 : 0.9131509267431598 (chunk_eval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : 0.026964144585057526 sec
  - BERT(base) + Glove
    - setting : experiments 11, test 2`
    - per-token(partial) f1 : 0.921535076998289
    - per-chunk(exact) f1 : 0.9123210182075304 (chunk_eval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : 0.029030597688838533 sec
  - BERT(large)
    - BERT + CRF only
    - setting : experiments 11, test 15
    - per-token(partial) f1 : 0.929012534393152
    - per-chunk(exact) f1 : 0.9215426705498191 (chunk_eval), 92.00(conlleval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : pass
  - BERT(large)
    - BERT + LSTM + CRF only
    - setting : experiments 11, test 19
    - per-token(partial) f1 : 0.9310957309977338
    - per-chunk(exact) f1 : 0.9240976645435245 (chunk_eval), 92.23(conlleval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : pass
  - BERT(large) + Glove
    - setting : experiments 11, test 3
    - per-token(partial) f1 : 0.9278869778869779
    - per-chunk(exact) f1 : 0.918813634351483 (chunk_eval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : 0.040225753178425645 sec
  - BERT(large) + Glove + Transformer
    - setting : experiments 11, test 7
    - per-token(partial) f1 : 0.9244949032533724
    - per-chunk(exact) f1 : 0.9170714474962465 (chunk_eval)
    - average processing time per bucket
      - 1 GPU(Tesla V100) : 0.05737522856032033 sec
BiLSTM + Transformer
- Glove
  - setting : experiments 7, test 10
  - per-token(partial) f1 : 0.910979409787988
  - per-chunk(exact) f1 : 0.9047451049567825 (chunk_eval)
BiLSTM + multi-head attention
- Glove
  - setting : experiments 6, test 7
  - per-token(partial) f1 : 0.9157317073170732
  - per-chunk(exact) f1 : 0.9102156238953694 (chunk_eval)

comparision to previous research

implementations
- Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs
  - tested
  - Glove6B.100
  - Prec: 0.887, Rec: 0.902, F1: 0.894
- sequence_tagging
  - tested
  - Glove6B.100
  - F1: 0.8998
- tf_ner
  - tested
  - Glove840B.300
  - F1 : 0.905 ~ 0.907 (chars_conv_lstm_crf)
    - reported F1 : 0.9118
- torchnlp
  - tested
  - Glove6B.200
  - F1 : 0.8845
    - just 1 block of Transformer encoder
SOTA
- SOTA on named-entity-recognition-ner-on-conll-2003
  - Cloze-driven Pretraining of Self-attention Networks
    - reported F1 : 0.935
  - GCDT: A Global Context Enhanced Deep Transition Architecture for Sequence Labeling
    - reported F1 : 0.9347
  - Contextual String Embeddings for Sequence Labeling
    - reported F1 : 0.9309
  - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    - reported F1 : 0.928
  - Semi-Supervised Sequence Modeling with Cross-View Training
    - reported F1 : 0.926
  - Deep contextualized word representations
    - reported F1 : 0.9222
  - Semi-supervised sequence tagging with bidirectional language models
    - reported F1 : 0.9193

Development note

accuracy and loss

abnormal case when using multi-head

why?

i guess that the softmax(applied in multi-head attention functions) was corrupted by paddings.
  -> so, i replaced the multi-head attention code to `https://github.com/Kyubyong/transformer/blob/master/modules.py`
     which applies key and query masking for paddings.
  -> however, simillar corruption was happended.
  -> it was caused by the tf.contrib.layers.layer_norm() which normalizes over [begin_norm_axis ~ R-1] dimensions.
  -> what about remove the layer_norm()? performance goes down!
  -> try to use other layer normalization code from `https://github.com/Kyubyong/transformer/blob/master/modules.py`
     which normalizes over the last dimension only.
     this code perfectly matches to my intention.

after replacing layer_norm() to normalize() and applying the dropout of word embeddings

train, dev accuracy after applying LSTMBlockFusedCell

tips for training speed up

filter out words(which are not in train/dev/test data) from glove840B word embeddings. but not for service.

use LSTMBlockFusedCell for bidirectional LSTM. this is faster than LSTMCell.

about 3.13 times faster during training time.
- 297.6699993610382 sec -> 94.96637988090515 sec for 1 epoch
about 1.26 times faster during inference time.
- 0.010652577061606541 sec -> 0.008411417501886556 sec for 1 sentence
where is the LSTMBlockFusedCell() defined?

https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/contrib/rnn/python/ops/lstm_ops.py
vi ../lib/python3.6/site-packages/tensorflow/contrib/rnn/ops/gen_lstm_ops.py
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/ops/lstm_ops.cc
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/kernels/lstm_ops.cc

use early stopping

tips for Transformer

start with small learning rate.
be careful to use residual connection after multi-head attention or feed forward net.
- x = tf.nn.dropout(x + y) -> x = tf.nn.dropout(x_norm + y)
the f1 of train/dev by token are relatively lower than the f1 of the BiLSTM. but after applying the CRF layer, those f1 by token are increased very sharply.
- does it mean that the Transformer is weak for collecting context for deciding label at the current position? then, how to overcome?
- try to revise the position-wise feed forward net
  - padding before and after
    - (batch_size, sentence_length, model_dim) -> (batch_size, 1+sentence_length+1, model_dim)
  - conv1d with kernel size 1 -> 3
  - this is the key to sequence taggging problems.
- after applying kernel_size 3

tips in general

save best model by using token-based f1. token-based f1 is slightly better than chunk-based f1
be careful for word lowercase when you are using glove6B embeddings. those are all lowercased.
feed max sentence length to session. this yields huge improvement of inference speed.
when it comes to using import_meta_graph(), you should run global_variable_initialzer() before restore().

tips for BERT fine-tuning

it seems that the warmup and exponential decay of learing rate are worth to use.

References

general

articles
tensorflow impl
keras impl
- Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs
pytorch impl
- torchnlp

character convolution

articles
- Implementing a CNN for Text Classification in TensorFlow
- Implementing a sentence classification using Char level CNN & RNN
tensorflow impl
- cnn-text-classification-tf/text_cnn.py
- lstm-char-cnn-tensorflow/LSTMTDNN.py

Transformer

articles
- Building the Mighty Transformer for Sequence Tagging in PyTorch
- QANET: COMBINING LOCAL CONVOLUTION WITH GLOBAL SELF-ATTENTION FOR READING COMPREHENSION
tensorflow impl
pytorch impl
- torchnlp/sublayers.py

CRF

articles
- Sequence Tagging with Tensorflow
- ADVANCED: MAKING DYNAMIC DECISIONS AND THE BI-LSTM CRF
tensorflow impl
pytorch impl
- allennlp/conditional_random_field.py

pretrained LM

articles
tensorflow impl
pytorch impl
- flair

tensorflow

tensorflow save and restore from python/C/C++
inference speed up
- GPU
  - tensorRT
    - install tensorRT
    - Speed up TensorFlow Inference on GPUs with TensorRT
    - how to use tensorRT
    - Speed up Inference by TensorRT
    - experiments
      - no meaningful difference. is it not effective for batch size 1 ?
- CPU
  - quantizing graph
    - tf.contrib.quantize
      - tf.contrib.quantize
      - Quantizing neural network to 8-bit using Tensorflow(pdf)
      - Quantizing deep convolutional networks for efficient inference: A whitepaper
      - experiments
        
        tf.import_graph_def() error after training with tf.contrib.quantize.create_training_graph(), freezing, exporting.
        
        hmm... something messy.
    - optimize_for_inference, quantize_graph, transform_graph
      - tensorflow-for-mobile-poets
      - graph_transforms
  - tensorflow MKL
    - optimizing tensorflow for cpu
    - conda tensorflow distribution
    - experiments
      - no meaningful improvement.
tensorflow summary
- how to manually create a tf summary
tfrecord, tf.data api
- simple_batching
tensorflow runtime include path, library path, check if built_with_cuda enabled.

$ python -c "import tensorflow as tf; print(tf.sysconfig.get_lib())"
$ python -c "import tensorflow as tf; print(tf.sysconfig.get_include())"
$ python -c "import tensorflow as tf; print(int(tf.test.is_built_with_cuda()))"

tensorflow backend

- implementations of BLAS specification
  - OpenBlas, intel MKL, Eigen(more functionality, high level library in C++)
- Nvidia GPU
  - CUDA language specification and library
  - cuDNN(more functionality, high level library)
- tensorflow
  - GPU
    - use mainly cuDNN
    - some cuBlas, GOOGLE CUDA(customized by google)
  - CPU
    - use basically Eigen
    - support MKL, MKL-DNN
    - or Eigen with MKL-DNN backend

etc

QRNN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MORE.md

MORE.md

History ( ~ 2020. 2. 25)

Evaluation

experiment logs

results

comparision to previous research

Development note

accuracy and loss

abnormal case when using multi-head

train, dev accuracy after applying LSTMBlockFusedCell

tips for training speed up

tips for Transformer

tips in general

tips for BERT fine-tuning

References

general

character convolution

Transformer

CRF

pretrained LM

tensorflow

etc

Files

MORE.md

Latest commit

History

MORE.md

File metadata and controls

History ( ~ 2020. 2. 25)

Evaluation

experiment logs

results

comparision to previous research

Development note

accuracy and loss

abnormal case when using multi-head

train, dev accuracy after applying LSTMBlockFusedCell

tips for training speed up

tips for Transformer

tips in general

tips for BERT fine-tuning

References

general

character convolution

Transformer

CRF

pretrained LM

tensorflow

etc