- Evaluation script
- etc/token_eval.py
- etc/chunk_eval.py
- etc/conlleval
- The bellow results for BERT is not valid now. because BERT is used as feature-based currently.
- checkout the code for BERT fine-tuning: https://github.com/dsindex/etagger/tree/7354971552bbf204a4357369637b687c1704bdcc
- the result for feature-based BERT
- read 'BERT new result, aligned wordpiece+word embeddings)'
- QRNN
- Glove
- setting :
experiments 14, test 8
- per-token(partial) f1 : 0.8892680845877263
- per-chunk(exact) f1 : 0.8809544851966417 (conlleval)
- average processing time per bucket
- 1 GPU(TITAN X(Pascal), 12196MiB)
- restore version : 0.013028464151645457 sec
- 32 processor CPU(multi-threading)
- python : 0.004297458387741437 sec
- C++ : 0.004124 sec
- 1 CPU(single-thread)
- python : 0.004832443533451109 sec
- C++ : 0.004734 sec
- 1 GPU(TITAN X(Pascal), 12196MiB)
- setting :
- Glove
- Transformer
- Glove
- setting :
experiments 7, test 9
- per-token(partial) f1 : 0.9083215796897038
- per-chunk(exact) f1 : 0.904078014184397 (chunk_eval)
- average processing time per bucket
- 1 GPU(TITAN X (Pascal), 12196MiB)
- restore version : 0.013825567226844812 sec
- frozen version : 0.015376264122228799 sec
- tensorRT(FP16) version : no meaningful difference
- 32 processor CPU(multi-threading)
- python : 0.017238136546748987 sec
- C++ : 0.013 sec
- 1 CPU(single-thread)
- python : 0.03358284470571628 sec
- C++ : 0.021510 sec
- 1 GPU(TITAN X (Pascal), 12196MiB)
- setting :
- Glove
- BiLSTM
- Glove
- setting :
experiments 9, test 1
- per-token(partial) f1 : 0.9152852267186738
- per-chunk(exact) f1 : 0.9094911075893644 (chunk_eval)
- average processing time per bucket
- 1 GPU(TITAN X (Pascal), 12196MiB)
- restore version : 0.010454932072004718 sec
- frozen version : 0.011339560587942018 sec
- tensorRT(FP16) version : no meaningful difference
- 32 processor CPU(multi-threading)
- rnn_num_layers 2 : 0.006132203450549827 sec
- rnn_num_layers 1
- python
- 0.0041805055967241884 sec
- 0.003053264560968687 sec (
experiments 12, test 5
)
- C++
- 0.002735 sec
- 0.002175 sec (
experiments 9, test 2
), 0.8800 - 0.002783 sec (
experiments 9, test 3
), 0.8858 - 0.004407 sec (
experiments 9, test 4
), 0.8887 - 0.003687 sec (
experiments 9, test 5
), 0.8835 - 0.002976 sec (
experiments 9, test 6
), 0.8782 - 0.002855 sec (
experiments 9, test 7
), 0.8906- 0.002697 sec with optimizations for FMA, AVX and SSE. no meaningful difference.
- 0.002040 sec (
experiments 12, test 5
), 0.9047
- python
- 1 CPU(single-thread)
- rnn_num_layers 2 : 0.008001159379070668 sec
- rnn_num_layers 1
- python
- 0.0051817628640952506 sec
- 0.0042755354628630235 sec (
experiments 12, test 5
)
- C++
- 0.003998 sec
- 0.002853 sec (
experiments 9, test 2
) - 0.003474 sec (
experiments 9, test 3
) - 0.005118 sec (
experiments 9, test 4
) - 0.004139 sec (
experiments 9, test 5
) - 0.004133 sec (
experiments 9, test 6
) - 0.003334 sec (
experiments 9, test 7
)- 0.003078 sec with optimizations for FMA, AVX and SSE. no meaningful difference.
- 0.002683 sec (
experiments 12, test 5
)
- python
- 1 GPU(TITAN X (Pascal), 12196MiB)
- setting :
- ELMo
- setting :
experiments 8, test 2
- per-token(partial) f1 : 0.9322728663199756
- per-chunk(exact) f1 : 0.9253625751680227 (chunk_eval)
$ etc/conlleval < pred.txt processed 46666 tokens with 5648 phrases; found: 5662 phrases; correct: 5234. accuracy: 98.44%; precision: 92.44%; recall: 92.67%; FB1: 92.56 LOC: precision: 94.29%; recall: 92.99%; FB1: 93.63 1645 MISC: precision: 84.38%; recall: 84.62%; FB1: 84.50 704 ORG: precision: 89.43%; recall: 91.69%; FB1: 90.55 1703 PER: precision: 97.27%; recall: 96.85%; FB1: 97.06 1610
- average processing time per bucket
- 1 GPU(TITAN X (Pascal), 12196MiB) : 0.06133532517637155 sec -> need to recompute
- 1 GPU(Tesla V100) : 0.029950057644797457 sec
- 32 processor CPU(multi-threading) : 0.40098162731570347 sec
- 1 CPU(single-thread) : 0.7398052649182165 sec
- setting :
- ELMo + Glove
- setting :
experiments 10, test 16
- per-token(partial) f1 : 0.9322386962382061
- per-chunk(exact) f1 : 0.928729526339088 (chunk_eval)
processed 46666 tokens with 5648 phrases; found: 5657 phrases; correct: 5247. accuracy: 98.44%; precision: 92.75%; recall: 92.90%; FB1: 92.83 LOC: precision: 93.89%; recall: 94.00%; FB1: 93.95 1670 MISC: precision: 85.03%; recall: 83.33%; FB1: 84.17 688 ORG: precision: 90.17%; recall: 91.63%; FB1: 90.89 1688 PER: precision: 97.58%; recall: 97.22%; FB1: 97.40 1611
- average processing time per bucket
- 1 GPU(TITAN X (Pascal), 12196MiB) : 0.036233977567360014 sec
- 1 GPU(Tesla V100, 32510MiB) : 0.031166194639816864 sec
- setting :
- BERT
new result, aligned wordpiece+word embeddings)
- BERT(large) + Glove + ELMo
- setting :
experiments 15, test 7
- per-token(partial) f1 : 0.9306700873495816
- per-chunk(exact) f1 : 0.9264420532721821(chunk_eval), 92.64(conlleval)
- average processing time per bucket
- 1 GPU(Tesla V100) : pass
- setting :
- BERT(large) + Glove
- setting :
experiments 15, test 6
- per-token(partial) f1 : 0.9217156200073737
- per-chunk(exact) f1 : 0.9158398299078666(chunk_eval), 91.58(conlleval)
- average processing time per bucket
- 1 GPU(Tesla V100) : pass
- setting :
- BERT(large)
- BERT + LSTM + CRF only
- setting :
experiments 15, test 2
- per-token(partial) f1 : 0.9120832058733557
- per-chunk(exact) f1 : 0.9015151515151516(chunk_eval), 90.14(conlleval)
- average processing time per bucket
- 1 GPU(Tesla V100) : pass
- BERT(large) + Glove + ELMo
- BERT
old result, extending word embeddings for wordpieces
- BERT(base)
- setting :
experiments 11, test 1
- per-token(partial) f1 : 0.9234725113260683
- per-chunk(exact) f1 : 0.9131509267431598 (chunk_eval)
- average processing time per bucket
- 1 GPU(Tesla V100) : 0.026964144585057526 sec
- setting :
- BERT(base) + Glove
- setting : experiments 11, test 2`
- per-token(partial) f1 : 0.921535076998289
- per-chunk(exact) f1 : 0.9123210182075304 (chunk_eval)
- average processing time per bucket
- 1 GPU(Tesla V100) : 0.029030597688838533 sec
- BERT(large)
- BERT + CRF only
- setting :
experiments 11, test 15
- per-token(partial) f1 : 0.929012534393152
- per-chunk(exact) f1 : 0.9215426705498191 (chunk_eval), 92.00(conlleval)
- average processing time per bucket
- 1 GPU(Tesla V100) : pass
- BERT(large)
- BERT + LSTM + CRF only
- setting :
experiments 11, test 19
- per-token(partial) f1 : 0.9310957309977338
- per-chunk(exact) f1 : 0.9240976645435245 (chunk_eval), 92.23(conlleval)
- average processing time per bucket
- 1 GPU(Tesla V100) : pass
- BERT(large) + Glove
- setting :
experiments 11, test 3
- per-token(partial) f1 : 0.9278869778869779
- per-chunk(exact) f1 : 0.918813634351483 (chunk_eval)
- average processing time per bucket
- 1 GPU(Tesla V100) : 0.040225753178425645 sec
- setting :
- BERT(large) + Glove + Transformer
- setting :
experiments 11, test 7
- per-token(partial) f1 : 0.9244949032533724
- per-chunk(exact) f1 : 0.9170714474962465 (chunk_eval)
- average processing time per bucket
- 1 GPU(Tesla V100) : 0.05737522856032033 sec
- setting :
- BERT(base)
- Glove
- BiLSTM + Transformer
- Glove
- setting :
experiments 7, test 10
- per-token(partial) f1 : 0.910979409787988
- per-chunk(exact) f1 : 0.9047451049567825 (chunk_eval)
- setting :
- Glove
- BiLSTM + multi-head attention
- Glove
- setting :
experiments 6, test 7
- per-token(partial) f1 : 0.9157317073170732
- per-chunk(exact) f1 : 0.9102156238953694 (chunk_eval)
- setting :
- Glove
- implementations
- Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs
- tested
- Glove6B.100
- Prec: 0.887, Rec: 0.902, F1: 0.894
- sequence_tagging
- tested
- Glove6B.100
- F1: 0.8998
- tf_ner
- tested
- Glove840B.300
- F1 : 0.905 ~ 0.907 (chars_conv_lstm_crf)
- reported F1 : 0.9118
- torchnlp
- tested
- Glove6B.200
- F1 : 0.8845
- just 1 block of Transformer encoder
- Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs
- SOTA
- SOTA on named-entity-recognition-ner-on-conll-2003
- Cloze-driven Pretraining of Self-attention Networks
- reported F1 : 0.935
- GCDT: A Global Context Enhanced Deep Transition Architecture for Sequence Labeling
- reported F1 : 0.9347
- Contextual String Embeddings for Sequence Labeling
- reported F1 : 0.9309
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- reported F1 : 0.928
- Semi-Supervised Sequence Modeling with Cross-View Training
- reported F1 : 0.926
- Deep contextualized word representations
- reported F1 : 0.9222
- Semi-supervised sequence tagging with bidirectional language models
- reported F1 : 0.9193
- Cloze-driven Pretraining of Self-attention Networks
- SOTA on named-entity-recognition-ner-on-conll-2003
- why?
i guess that the softmax(applied in multi-head attention functions) was corrupted by paddings.
-> so, i replaced the multi-head attention code to `https://github.com/Kyubyong/transformer/blob/master/modules.py`
which applies key and query masking for paddings.
-> however, simillar corruption was happended.
-> it was caused by the tf.contrib.layers.layer_norm() which normalizes over [begin_norm_axis ~ R-1] dimensions.
-> what about remove the layer_norm()? performance goes down!
-> try to use other layer normalization code from `https://github.com/Kyubyong/transformer/blob/master/modules.py`
which normalizes over the last dimension only.
this code perfectly matches to my intention.
- filter out words(which are not in train/dev/test data) from glove840B word embeddings. but not for service.
- use LSTMBlockFusedCell for bidirectional LSTM. this is faster than LSTMCell.
- about 3.13 times faster during training time.
- 297.6699993610382 sec -> 94.96637988090515 sec for 1 epoch
- about 1.26 times faster during inference time.
- 0.010652577061606541 sec -> 0.008411417501886556 sec for 1 sentence
- where is the LSTMBlockFusedCell() defined?
https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/contrib/rnn/python/ops/lstm_ops.py vi ../lib/python3.6/site-packages/tensorflow/contrib/rnn/ops/gen_lstm_ops.py https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/ops/lstm_ops.cc https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/kernels/lstm_ops.cc
- about 3.13 times faster during training time.
- use early stopping
- start with small learning rate.
- be careful to use residual connection after multi-head attention or feed forward net.
x = tf.nn.dropout(x + y)
->x = tf.nn.dropout(x_norm + y)
- the f1 of train/dev by token are relatively lower than the f1 of the BiLSTM. but after applying the CRF layer, those f1 by token are increased very sharply.
- does it mean that the Transformer is weak for collecting context for deciding label at the current position? then, how to overcome?
- try to revise the position-wise feed forward net
- padding before and after
- (batch_size, sentence_length, model_dim) -> (batch_size, 1+sentence_length+1, model_dim)
- conv1d with kernel size 1 -> 3
- this is the key to sequence taggging problems.
- padding before and after
- after applying kernel_size 3
- save best model by using token-based f1. token-based f1 is slightly better than chunk-based f1
- be careful for word lowercase when you are using glove6B embeddings. those are all lowercased.
- feed max sentence length to session. this yields huge improvement of inference speed.
- when it comes to using import_meta_graph(), you should run global_variable_initialzer() before restore().
- articles
- tensorflow impl
- keras impl
- pytorch impl
- articles
- tensorflow impl
- articles
- tensorflow impl
- pytorch impl
- articles
- tensorflow impl
- pytorch impl
- articles
- tensorflow impl
- pytorch impl
- tensorflow save and restore from python/C/C++
- save, restore tensorflow models quick complete tutorial
- tensorflow-cmake
- Training a Tensorflow graph in C++ API
- label_image in C++
- how to invoke tf.initialize_all_variables in c tensorflow
- TensorFlow: How to freeze a model and serve it with a python API
- how to read freezed graph from C++
- reducing model loading time and/or memory footprint
- convert_graphdef_memmapped_format
- inference speed up
- GPU
- tensorRT
- install tensorRT
- Speed up TensorFlow Inference on GPUs with TensorRT
- how to use tensorRT
- Speed up Inference by TensorRT
- experiments
- no meaningful difference. is it not effective for batch size 1 ?
- tensorRT
- CPU
- quantizing graph
- tf.contrib.quantize
- tf.contrib.quantize
- Quantizing neural network to 8-bit using Tensorflow(pdf)
- Quantizing deep convolutional networks for efficient inference: A whitepaper
- experiments
- tf.import_graph_def() error after training with tf.contrib.quantize.create_training_graph(), freezing, exporting.
- hmm... something messy.
- tf.import_graph_def() error after training with tf.contrib.quantize.create_training_graph(), freezing, exporting.
- optimize_for_inference, quantize_graph, transform_graph
- tf.contrib.quantize
- tensorflow MKL
- optimizing tensorflow for cpu
- conda tensorflow distribution
- experiments
- no meaningful improvement.
- quantizing graph
- GPU
- tensorflow summary
- tfrecord, tf.data api
- tensorflow runtime include path, library path, check if built_with_cuda enabled.
$ python -c "import tensorflow as tf; print(tf.sysconfig.get_lib())"
$ python -c "import tensorflow as tf; print(tf.sysconfig.get_include())"
$ python -c "import tensorflow as tf; print(int(tf.test.is_built_with_cuda()))"
- tensorflow backend
- implementations of BLAS specification
- OpenBlas, intel MKL, Eigen(more functionality, high level library in C++)
- Nvidia GPU
- CUDA language specification and library
- cuDNN(more functionality, high level library)
- tensorflow
- GPU
- use mainly cuDNN
- some cuBlas, GOOGLE CUDA(customized by google)
- CPU
- use basically Eigen
- support MKL, MKL-DNN
- or Eigen with MKL-DNN backend