Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

couldn't replicate scores on en_conll2003 #16

Open
zhaoxf4 opened this issue Oct 24, 2020 · 11 comments
Open

couldn't replicate scores on en_conll2003 #16

zhaoxf4 opened this issue Oct 24, 2020 · 11 comments

Comments

@zhaoxf4
Copy link

zhaoxf4 commented Oct 24, 2020

I'm sorry to bother you but I couldn't replicate same scores on en_conll2003 dataset.
I only reproduced to 92.12, 1.4 lower than yours.
I check my dataset and ensure the labels are the same as conll2003 annotate software. So I think my en_conll2003 is complete. This dataset is download from here .
Then I write a script named "data_process.py"(python3) to convert conllx format to your dict format. I use '-DOCSTART- -X- -X- O' to split it. I think this script is no problem because there is no error in extract_feature.sh and evaluate.py and I also check the processed dataset. so I think the difference between your dataset and mine is very small.

environment:

cuda 9.0 + cudnn 7.6.5 + python 2.7 + tensorflow-gpu 1.12
the Pre-trained models from your link "acl2020 best models"
bert_large_case
glove 6B.300d for embedding

the only difference is I replace fastText with glove 6B because I couldn't find fasttext/cc.en.300.vec.filtered, I just find a vector file named "cc.en.300.vec" from fastText. I met a error when I try "cc.en.300.vec", this is the first I try fastText and I don't know what happened. The error log ls as follow:

(biaffine-ner) zhaoxiaofeng@omnisky:/data1/zhaoxiaofeng/biaffine-ner$ CUDA_VISIBLE_DEVICES=4 python evaluate.py eng_conll03
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "logs/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "../pretrained/cc.en.300.vec"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "bert-model/bert_conll2003_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "corpus/en_conll2003/train.json"
eval_path = "corpus/en_conll2003/dev.json"
test_path = "corpus/en_conll2003/test.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ../pretrained/cc.en.300.vec...
Traceback (most recent call last):
  File "evaluate.py", line 15, in <module>
    model = biaffine_ner_model.BiaffineNERModel(config)
  File "/data1/zhaoxiaofeng/biaffine-ner/biaffine_ner_model.py", line 17, in __init__
    self.context_embeddings = util.EmbeddingDictionary(config["context_embeddings"])
  File "/data1/zhaoxiaofeng/biaffine-ner/util.py", line 234, in __init__
    self._embeddings = self.load_embedding_dict(self._path)
  File "/data1/zhaoxiaofeng/biaffine-ner/util.py", line 251, in load_embedding_dict
    assert len(embedding) == self.size
AssertionError

Then I continue to use glove 6B because I think embedding shouldn't have huge impact.
This is the log of the en_conll2003 Pre-trained models from your link "acl2020 best models":

(biaffine-ner) zhaoxiaofeng@omnisky:/data1/zhaoxiaofeng/biaffine-ner$ CUDA_VISIBLE_DEVICES=6 python evaluate.py eng_conll03
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "logs/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "../glove/glove.6B.300d.txt"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "bert-model/bert_conll2003_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "corpus/en_conll2003/train.json"
eval_path = "corpus/en_conll2003/dev.json"
test_path = "corpus/en_conll2003/test.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ../glove/glove.6B.300d.txt...
Done loading word embeddings.
/data0/zhaoxiaofeng/.conda/envs/biaffine-ner/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-24 00:13:43.106943: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-24 00:13:45.888032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:88:00.0
totalMemory: 11.91GiB freeMemory: 11.76GiB
2020-10-24 00:13:45.888082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-10-24 00:13:46.248727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-24 00:13:46.248776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2020-10-24 00:13:46.248784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2020-10-24 00:13:46.249496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11378 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:88:00.0, compute capability: 6.1)
Restoring from logs/eng_conll03/model.max.ckpt
Loaded 231 eval examples.
Evaluated 1/231 examples.
Evaluated 11/231 examples.
Evaluated 21/231 examples.
Evaluated 31/231 examples.
Evaluated 41/231 examples.
Evaluated 51/231 examples.
Evaluated 61/231 examples.
Evaluated 71/231 examples.
Evaluated 81/231 examples.
Evaluated 91/231 examples.
Evaluated 101/231 examples.
Evaluated 111/231 examples.
Evaluated 121/231 examples.
Evaluated 131/231 examples.
Evaluated 141/231 examples.
Evaluated 151/231 examples.
Evaluated 161/231 examples.
Evaluated 171/231 examples.
Evaluated 181/231 examples.
Evaluated 191/231 examples.
Evaluated 201/231 examples.
Evaluated 211/231 examples.
Evaluated 221/231 examples.
Evaluated 231/231 examples.
Time used: 21 second, 2207.99 w/s 
Mention F1: 90.30%
Mention recall: 89.24%
Mention precision: 91.39%
****************SUB NER TYPES********************
ORG F1: 89.76%
ORG recall: 91.81%
ORG precision: 87.80%
MISC F1: 77.93%
MISC recall: 73.93%
MISC precision: 82.38%
PER F1: 95.99%
PER recall: 96.29%
PER precision: 95.70%
LOC F1: 90.25%
LOC recall: 86.27%
LOC precision: 94.61%

This is the log of the model I train again:

(biaffine-ner) zhaoxiaofeng@omnisky:/data1/zhaoxiaofeng/biaffine-ner$ CUDA_VISIBLE_DEVICES=6 python evaluate.py eng_conll03
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "logs/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "../glove/glove.6B.300d.txt"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "bert-model/bert_conll2003_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "corpus/en_conll2003/train.json"
eval_path = "corpus/en_conll2003/dev.json"
test_path = "corpus/en_conll2003/test.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ../glove/glove.6B.300d.txt...
Done loading word embeddings.
/data0/zhaoxiaofeng/.conda/envs/biaffine-ner/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-24 16:22:56.141437: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-24 16:22:59.562773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:88:00.0
totalMemory: 11.91GiB freeMemory: 11.76GiB
2020-10-24 16:22:59.562834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-10-24 16:22:59.977785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-24 16:22:59.977846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2020-10-24 16:22:59.977854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2020-10-24 16:22:59.978513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11378 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:88:00.0, compute capability: 6.1)
Restoring from logs/eng_conll03/model.max.ckpt
Loaded 231 eval examples.
Evaluated 1/231 examples.
Evaluated 11/231 examples.
Evaluated 21/231 examples.
Evaluated 31/231 examples.
Evaluated 41/231 examples.
Evaluated 51/231 examples.
Evaluated 61/231 examples.
Evaluated 71/231 examples.
Evaluated 81/231 examples.
Evaluated 91/231 examples.
Evaluated 101/231 examples.
Evaluated 111/231 examples.
Evaluated 121/231 examples.
Evaluated 131/231 examples.
Evaluated 141/231 examples.
Evaluated 151/231 examples.
Evaluated 161/231 examples.
Evaluated 171/231 examples.
Evaluated 181/231 examples.
Evaluated 191/231 examples.
Evaluated 201/231 examples.
Evaluated 211/231 examples.
Evaluated 221/231 examples.
Evaluated 231/231 examples.
Time used: 21 second, 2133.06 w/s 
Mention F1: 92.12%
Mention recall: 92.30%
Mention precision: 91.94%
****************SUB NER TYPES********************
ORG F1: 91.08%
ORG recall: 92.17%
ORG precision: 90.01%
MISC F1: 81.81%
MISC recall: 82.62%
MISC precision: 81.01%
PER F1: 96.64%
PER recall: 97.03%
PER precision: 96.26%
LOC F1: 93.16%
LOC recall: 91.91%
LOC precision: 94.45%

The raw_en_conll2003 and the en_conll2003 I processed and the data_process.py(python3) are in sharelink1 or sharelink2, and the extract_feature.sh and experiments.conf I used also in it. You can check it.

Do you know what's wrong with me?
The dataset, the fastText embedding, or the hyperparameter?
Can you help me? Thank you.

@juntaoy
Copy link
Owner

juntaoy commented Oct 26, 2020

@zhaoxf4 I've checked your data, they are identical to mine and I also tested on my best model they give the same results I reported in the paper. The log is attached below. I got at least 93 over 5 runs, 92.1 is way too low:) I can find three differences:

  1. the fasttext embedding, but I don't think this would be a large issue if you retrain the system yourself. The cc.en.300.vec.filtered is just a subset of the cc.en.300.vec that only contains words appeared in the dataset, so that it loads faster. The reason you got the error is you need to remove the first line of the cc.en.300.vec which is not an embedding.

  2. I train the system using the concatenation of train and dev set and without early stopping.

  3. Due to typos in the early version of the extract_bert_features.sh, the --window_size I actually used is not 128 but its default value 511 in **extract_features.py **, so you need change it to 511 in order to get the same results. I've just updated the extract_bert_features.sh to make the --window_size = 511 to avoid confusing.

WARNING:tensorflow:From` /homes/juntao/miniconda2/envs/tf15/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "char_vocab.eng.conll03.txt"
context_embeddings {
  path = "../fasttext/cc.en.300.vec.filtered"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "../cogsci/bert-model/bert_conll03_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "train_dev.conll03.jsonlines"
eval_path = ""
test_path = "test.yu.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ../fasttext/cc.en.300.vec.filtered...
Done loading word embeddings.
use_lee_lstm
/homes/juntao/miniconda2/envs/tf15/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-26 11:15:59.377855: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-26 11:15:59.683691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2020-10-26 11:15:59.683785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2020-10-26 11:16:00.103129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-26 11:16:00.103192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2020-10-26 11:16:00.103212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2020-10-26 11:16:00.103399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10428 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
Restoring from logs/eng_conll03/model.max.ckpt
Loaded 231 eval examples.
Evaluated 1/231 examples.
Evaluated 11/231 examples.
Evaluated 21/231 examples.
Evaluated 31/231 examples.
Evaluated 41/231 examples.
Evaluated 51/231 examples.
Evaluated 61/231 examples.
Evaluated 71/231 examples.
Evaluated 81/231 examples.
Evaluated 91/231 examples.
Evaluated 101/231 examples.
Evaluated 111/231 examples.
Evaluated 121/231 examples.
Evaluated 131/231 examples.
Evaluated 141/231 examples.
Evaluated 151/231 examples.
Evaluated 161/231 examples.
Evaluated 171/231 examples.
Evaluated 181/231 examples.
Evaluated 191/231 examples.
Evaluated 201/231 examples.
Evaluated 211/231 examples.
Evaluated 221/231 examples.
Evaluated 231/231 examples.
Time used: 29 second, 1588.77 w/s 
test Mention F1: 93.46%
test Mention recall: 93.31%
test Mention precision: 93.62%
93.6&93.3&93.5
****************SUB NER TYPES********************
ORG F1: 92.51%
ORG recall: 93.02%
ORG precision: 92.02%
MISC F1: 84.77%
MISC recall: 84.47%
MISC precision: 85.08%
PER F1: 97.55%
PER recall: 97.40%
PER precision: 97.70%
LOC F1: 94.11%
LOC recall: 93.35%
LOC precision: 94.88%

@wangxinyu0922
Copy link

@juntaoy
Hi, do you have any results with BiLSTM-Softmax/CRF model in CoNLL 03 en? I'm trying to reproduce the results reported in the BERT paper with document contextual but failed. I wonder whether the embedding extracted in biaffine-ner can successfully reproduce it. Thank you.

@juntaoy
Copy link
Owner

juntaoy commented Oct 27, 2020

@wangxinyu0922 I did 3 runs using BiLSMT+CRF model on CoNLL 02 English, the best I can get is 91.9 a bit lower than the BERT paper's 92.8. They might get lucky for that number:)

@wangxinyu0922
Copy link

Thank you, could you give me some details for your CRF model so that I can do more experiments on the topic

  • Is it trained on train+dev set?
  • Is it trained following the configuration/hyperparameters in your paper?

@juntaoy
Copy link
Owner

juntaoy commented Oct 27, 2020

Yes, it is the same as in my paper, apart from the window_size is not 128 but 512, so make sure you download the latest extract_bert_features.sh
I train it on the train+dev for 80k steps which is about 80 epoch

@wangxinyu0922
Copy link

wangxinyu0922 commented Oct 28, 2020

Can you share your extracted BERT embeddings? I used the latest version of extract_bert_features.sh but I cannot reproduce the score of your trained NER model. Maybe the embedding files are distinct.

export BERT_MODEL_PATH="./cased_L-24_H-1024_A-16"
PYTHONPATH=. python extract_features.py --input_file="train.eng.conll03.jsonlines;dev.eng.conll03.jsonlines;test.eng.conll03.jsonlines" --output_file=../bert_features.hdf5 --bert_config_file $BERT_MODEL_PATH/bert_config.json --init_checkpoint $BERT_MODEL_PATH/bert_model.ckpt --vocab_file  $BERT_MODEL_PATH/vocab.txt --do_lower_case=False --stride 1 --window_size 511
$ CUDA_VISIBLE_DEVICES=0 python2 evaluate.py eng_conll03
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "best_models/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "./cc.en.300.vec.filtered"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "./bert_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "extract_bert_features/train_dev.conll03.jsonlines"
eval_path = ""
test_path = "extract_bert_features/test.eng.conll03.jsonlines"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ./cc.en.300.vec.filtered...
Done loading word embeddings.
/home/wangxy/anaconda3/envs/neuronlp/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-28 10:18:26.246744: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-10-28 10:18:26.545662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:18:00.0
totalMemory: 11.75GiB freeMemory: 8.94GiB
2020-10-28 10:18:26.545706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-10-28 10:18:26.988961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-28 10:18:26.989008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2020-10-28 10:18:26.989031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2020-10-28 10:18:26.989214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8617 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:18:00.0, compute capability: 7.0)
Restoring from logs/eng_conll03/model.max.ckpt
Loaded 231 eval examples.
Evaluated 1/231 examples.
Evaluated 11/231 examples.
Evaluated 21/231 examples.
Evaluated 31/231 examples.
Evaluated 41/231 examples.
Evaluated 51/231 examples.
Evaluated 61/231 examples.
Evaluated 71/231 examples.
Evaluated 81/231 examples.
Evaluated 91/231 examples.
Evaluated 101/231 examples.
Evaluated 111/231 examples.
Evaluated 121/231 examples.
Evaluated 131/231 examples.
Evaluated 141/231 examples.
Evaluated 151/231 examples.
Evaluated 161/231 examples.
Evaluated 171/231 examples.
Evaluated 181/231 examples.
Evaluated 191/231 examples.
Evaluated 201/231 examples.
Evaluated 211/231 examples.
Evaluated 221/231 examples.
Evaluated 231/231 examples.
Time used: 21 second, 2165.41 w/s
Mention F1: 50.09%
Mention recall: 43.34%
Mention precision: 59.33%
****************SUB NER TYPES********************
ORG F1: 52.18%
ORG recall: 45.67%
ORG precision: 60.87%
MISC F1: 59.15%
MISC recall: 53.27%
MISC precision: 66.48%
PER F1: 23.45%
PER recall: 18.72%
PER precision: 31.39%
LOC F1: 76.24%
LOC recall: 71.13%
LOC precision: 82.15%

Except the embeddings, I find a difference between the logs that use_lee_lstm in your log. I wonder whether it affects the result.

@wangxinyu0922
Copy link

Well, I successfully reproduce the F1 after I use the jsonlines file provided by @zhaoxf4 .

Running experiment: eng_conll03_new
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "best_models/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "./cc.en.300.vec"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "./bert_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "extract_bert_features/train_dev.eng.conll03.jsonlines"
eval_path = ""
test_path = "new_conll_03_test.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03_new"
Loading word embeddings from ./cc.en.300.vec...
Done loading word embeddings.
/home/wangxy/anaconda3/envs/neuronlp/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-28 11:30:31.376861: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-10-28 11:30:31.654372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:3b:00.0
totalMemory: 11.75GiB freeMemory: 6.33GiB
2020-10-28 11:30:31.654416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-10-28 11:30:32.224585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-28 11:30:32.224634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2020-10-28 11:30:32.224646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2020-10-28 11:30:32.224883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6076 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:3b:00.0, compute capability: 7.0)
Restoring from logs/eng_conll03_new/model.max.ckpt
Loaded 231 eval examples.
Evaluated 1/231 examples.
Evaluated 11/231 examples.
Evaluated 21/231 examples.
Evaluated 31/231 examples.
Evaluated 41/231 examples.
Evaluated 51/231 examples.
Evaluated 61/231 examples.
Evaluated 71/231 examples.
Evaluated 81/231 examples.
Evaluated 91/231 examples.
Evaluated 101/231 examples.
Evaluated 111/231 examples.
Evaluated 121/231 examples.
Evaluated 131/231 examples.
Evaluated 141/231 examples.
Evaluated 151/231 examples.
Evaluated 161/231 examples.
Evaluated 171/231 examples.
Evaluated 181/231 examples.
Evaluated 191/231 examples.
Evaluated 201/231 examples.
Evaluated 211/231 examples.
Evaluated 221/231 examples.
Evaluated 231/231 examples.
Time used: 22 second, 2070.13 w/s
Mention F1: 93.47%
Mention recall: 93.33%
Mention precision: 93.62%
****************SUB NER TYPES********************
ORG F1: 92.51%
ORG recall: 93.02%
ORG precision: 92.02%
MISC F1: 84.86%
MISC recall: 84.62%
MISC precision: 85.10%
PER F1: 97.55%
PER recall: 97.40%
PER precision: 97.70%
LOC F1: 94.11%
LOC recall: 93.35%
LOC precision: 94.88%

@juntaoy
Copy link
Owner

juntaoy commented Oct 28, 2020

@wangxinyu0922 the use_lee_lstm will not affect the results:) the final version does use the custom LSTM from Lee et al 2018 system. It is just some experiment I did to see if we can get the same results by using the default TensorFlow LSTM, the answer is yes the improvement from custom LSTM is minimal (0.1-0.2).

@zhaoxf4
Copy link
Author

zhaoxf4 commented Oct 28, 2020

@juntaoy OK, I think I have replicated it on eng_conll2003 approximately.
the six evaluating results on six re-train models (windows_size=511, train+dev for train 80000 steps) are as follow:

index\time 1 2 3 4 5 6
F1 93.14 93.07 92.83 93.16 93.26 93.00
recall 93.43 92.30 92.30 93.22 93.48 92.81
precision 92.86 93.07 93.37 93.10 93.04 93.19

and evaluate your ACL2020 best model can gain:

Mention F1: 93.47%
Mention recall: 93.33%
Mention precision: 93.62%
****************SUB NER TYPES********************
ORG F1: 92.51%
ORG recall: 93.02%
ORG precision: 92.02%
MISC F1: 84.86%
MISC recall: 84.62%
MISC precision: 85.10%
PER F1: 97.55%
PER recall: 97.40%
PER precision: 97.70%
LOC F1: 94.11%
LOC recall: 93.35%
LOC precision: 94.88%

Are there any random items that didn't set seed? I want to know the reason of differences between six times and I'm checking the code.

@juntaoy
Copy link
Owner

juntaoy commented Oct 28, 2020

I didn't set seed for the final version. I tried some times ago, and for an unknown reason, after I fixed both python seed and TensorFlow seed I still get different results, it somehow does not work for my code, not sure if this is because of the multiple threading I used.

@lzf00
Copy link

lzf00 commented Sep 4, 2022

很抱歉打扰您,但我无法在 en_conll2003 数据集上复制相同的分数。 我只复制到92.12,比你的低 1.4 。 我检查了我的数据集并确保标签与conll2003 annotate software相同。所以我认为我的 en_conll2003 是完整的。这个数据集是从这里下载的。 然后我编写了一个名为“ data_process.py”(python3)的脚本来将 conllx 格式转换为您的 dict 格式。我使用' -DOCSTART- -X- -X- O'来分割它。我认为这个脚本没有问题,因为extract_feature.sh和evaluate.py没有错误,我还检查了处理后的数据集。所以我认为您的数据集和我的数据集之间的差异非常小

环境:

cuda 9.0 + cudnn 7.6.5 + python 2.7 + tensorflow-gpu 1.12
the Pre-trained models from your link "acl2020 best models"
bert_large_case
glove 6B.300d for embedding

唯一的区别是我用glove 6B替换了 fastText因为我找不到fasttext/cc.en.300.vec.filtered ,我只是从fastText中找到了一个名为**“cc.en.300.vec”的矢量文件。我在尝试"cc.en.300.vec"时遇到错误,这是我第一次尝试 fastText,我不知道发生了什么。错误日志如下:**

(biaffine-ner) zhaoxiaofeng@omnisky:/data1/zhaoxiaofeng/biaffine-ner$ CUDA_VISIBLE_DEVICES=4 python evaluate.py eng_conll03
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "logs/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "../pretrained/cc.en.300.vec"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "bert-model/bert_conll2003_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "corpus/en_conll2003/train.json"
eval_path = "corpus/en_conll2003/dev.json"
test_path = "corpus/en_conll2003/test.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ../pretrained/cc.en.300.vec...
Traceback (most recent call last):
  File "evaluate.py", line 15, in <module>
    model = biaffine_ner_model.BiaffineNERModel(config)
  File "/data1/zhaoxiaofeng/biaffine-ner/biaffine_ner_model.py", line 17, in __init__
    self.context_embeddings = util.EmbeddingDictionary(config["context_embeddings"])
  File "/data1/zhaoxiaofeng/biaffine-ner/util.py", line 234, in __init__
    self._embeddings = self.load_embedding_dict(self._path)
  File "/data1/zhaoxiaofeng/biaffine-ner/util.py", line 251, in load_embedding_dict
    assert len(embedding) == self.size
AssertionError

然后我继续使用glove 6B,因为我认为嵌入不应该有很大的影响。 这是来自链接“acl2020 最佳模型”的 en_conll2003 预训练模型的日志:

(biaffine-ner) zhaoxiaofeng@omnisky:/data1/zhaoxiaofeng/biaffine-ner$ CUDA_VISIBLE_DEVICES=6 python evaluate.py eng_conll03
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "logs/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "../glove/glove.6B.300d.txt"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "bert-model/bert_conll2003_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "corpus/en_conll2003/train.json"
eval_path = "corpus/en_conll2003/dev.json"
test_path = "corpus/en_conll2003/test.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ../glove/glove.6B.300d.txt...
Done loading word embeddings.
/data0/zhaoxiaofeng/.conda/envs/biaffine-ner/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-24 00:13:43.106943: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-24 00:13:45.888032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:88:00.0
totalMemory: 11.91GiB freeMemory: 11.76GiB
2020-10-24 00:13:45.888082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-10-24 00:13:46.248727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-24 00:13:46.248776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2020-10-24 00:13:46.248784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2020-10-24 00:13:46.249496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11378 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:88:00.0, compute capability: 6.1)
Restoring from logs/eng_conll03/model.max.ckpt
Loaded 231 eval examples.
Evaluated 1/231 examples.
Evaluated 11/231 examples.
Evaluated 21/231 examples.
Evaluated 31/231 examples.
Evaluated 41/231 examples.
Evaluated 51/231 examples.
Evaluated 61/231 examples.
Evaluated 71/231 examples.
Evaluated 81/231 examples.
Evaluated 91/231 examples.
Evaluated 101/231 examples.
Evaluated 111/231 examples.
Evaluated 121/231 examples.
Evaluated 131/231 examples.
Evaluated 141/231 examples.
Evaluated 151/231 examples.
Evaluated 161/231 examples.
Evaluated 171/231 examples.
Evaluated 181/231 examples.
Evaluated 191/231 examples.
Evaluated 201/231 examples.
Evaluated 211/231 examples.
Evaluated 221/231 examples.
Evaluated 231/231 examples.
Time used: 21 second, 2207.99 w/s 
Mention F1: 90.30%
Mention recall: 89.24%
Mention precision: 91.39%
****************SUB NER TYPES********************
ORG F1: 89.76%
ORG recall: 91.81%
ORG precision: 87.80%
MISC F1: 77.93%
MISC recall: 73.93%
MISC precision: 82.38%
PER F1: 95.99%
PER recall: 96.29%
PER precision: 95.70%
LOC F1: 90.25%
LOC recall: 86.27%
LOC precision: 94.61%

这是我再次训练的模型的日志:

(biaffine-ner) zhaoxiaofeng@omnisky:/data1/zhaoxiaofeng/biaffine-ner$ CUDA_VISIBLE_DEVICES=6 python evaluate.py eng_conll03
Running experiment: eng_conll03
ffnn_size = 150
ffnn_depth = 2
filter_widths = [
  3
  4
  5
]
filter_size = 50
char_embedding_size = 8
char_vocab_path = "logs/char_vocab.eng.conll03.txt"
context_embeddings {
  path = "../glove/glove.6B.300d.txt"
  size = 300
}
contextualization_size = 200
contextualization_layers = 3
lm_size = 1024
lm_layers = 4
lm_path = "bert-model/bert_conll2003_features.hdf5"
max_gradient_norm = 5.0
lstm_dropout_rate = 0.4
lexical_dropout_rate = 0.5
dropout_rate = 0.2
optimizer = "adam"
learning_rate = 0.001
decay_rate = 0.999
decay_frequency = 100
train_path = "corpus/en_conll2003/train.json"
eval_path = "corpus/en_conll2003/dev.json"
test_path = "corpus/en_conll2003/test.json"
ner_types = [
  "ORG"
  "MISC"
  "PER"
  "LOC"
]
eval_frequency = 500
report_frequency = 100
log_root = "logs"
max_step = 80000
flat_ner = true
log_dir = "logs/eng_conll03"
Loading word embeddings from ../glove/glove.6B.300d.txt...
Done loading word embeddings.
/data0/zhaoxiaofeng/.conda/envs/biaffine-ner/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-10-24 16:22:56.141437: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-24 16:22:59.562773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:88:00.0
totalMemory: 11.91GiB freeMemory: 11.76GiB
2020-10-24 16:22:59.562834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-10-24 16:22:59.977785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-24 16:22:59.977846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2020-10-24 16:22:59.977854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2020-10-24 16:22:59.978513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11378 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:88:00.0, compute capability: 6.1)
Restoring from logs/eng_conll03/model.max.ckpt
Loaded 231 eval examples.
Evaluated 1/231 examples.
Evaluated 11/231 examples.
Evaluated 21/231 examples.
Evaluated 31/231 examples.
Evaluated 41/231 examples.
Evaluated 51/231 examples.
Evaluated 61/231 examples.
Evaluated 71/231 examples.
Evaluated 81/231 examples.
Evaluated 91/231 examples.
Evaluated 101/231 examples.
Evaluated 111/231 examples.
Evaluated 121/231 examples.
Evaluated 131/231 examples.
Evaluated 141/231 examples.
Evaluated 151/231 examples.
Evaluated 161/231 examples.
Evaluated 171/231 examples.
Evaluated 181/231 examples.
Evaluated 191/231 examples.
Evaluated 201/231 examples.
Evaluated 211/231 examples.
Evaluated 221/231 examples.
Evaluated 231/231 examples.
Time used: 21 second, 2133.06 w/s 
Mention F1: 92.12%
Mention recall: 92.30%
Mention precision: 91.94%
****************SUB NER TYPES********************
ORG F1: 91.08%
ORG recall: 92.17%
ORG precision: 90.01%
MISC F1: 81.81%
MISC recall: 82.62%
MISC precision: 81.01%
PER F1: 96.64%
PER recall: 97.03%
PER precision: 96.26%
LOC F1: 93.16%
LOC recall: 91.91%
LOC precision: 94.45%

我处理raw_en_conll2003en_conll2003以及data_process.py(python3)在sharelink1sharelink2中,我也在其中使用了extract_feature.sh和Experiments.conf。你可以检查一下。

你知道我怎么了吗? 数据集、fastText 嵌入还是超参数? 你能帮助我吗?谢谢你。

您好,我用了你提供的数据集和模型设置复现了eng_conll2003实验,但是我碰到了模型路径保存的问题,我不知道改如何修改模型保存路径。。。
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants