Plans to release sequence tagging task fine-tuning code? #33

egez · 2018-11-02T05:42:00Z

It seems that the fine-tuning code for CoNLL-2003 NER task (as described in the paper) isn't in the current release. Any plan for releasing that part?

jacobdevlin-google · 2018-11-02T05:49:50Z

For maintainability and simplicity this is all we're planning on releasing other than the multilingual models and the GPU memory workaround (keep in mind we don't actually have clean standalone implementations of NER, SWAG, or the other GLUE tasks, the implementations we used in the paper are much messier). For NER we used the recipe at the bottom of the Tokenization section of the README.

egez · 2018-11-02T05:52:17Z

Understood. Thanks for the quick clarification.

ymcdull · 2018-11-07T16:52:40Z

Can I know for the sequence tagging task, how should we design the loss function?
From "run_classifier.py", it just uses "loss = tf.reduce_mean(per_example_loss)", and each "example" will be a sentence or a pair of sentences.
But for sequence tagging task, for each sentence, we will have multiple tokens with labels. I'm a bit confused how to design the loss function here. Any ideas? Thanks.

jacobdevlin-google · 2018-11-07T18:24:23Z

Generally for tagging tasks you do a reduce mean across all tokens in the batch. But you will need to handle the padding (and potentially wordpiece that don't have a label) examples correctly, i.e., by having a weight of 1.0 for the tokens that have a prediction and 0.0 for the tokens that don't. You can use tf.losses.softmax_cross_entropy to do it correctly (make sure to pass in a tensor to weights also).

ymcdull · 2018-11-07T19:17:32Z

Awesome. Thanks for the quick reply :)

wkevwang · 2018-11-15T23:07:45Z

Hi, thanks so much for this impressive work! In order to do NER, what kind of model could I use on top of the token embeddings from model.get_sequence_output() to get token-level classification? Could I simply use a dense layer on top of model.get_sequence_output() (with weights of dimensions of [seq_length, hidden_size]) to output one classification per token? I assume I would also have to pad the input sequence as well.

maksna · 2018-11-16T08:40:53Z

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

ymcdull · 2018-11-16T09:15:48Z

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

I think segment B is used for question answering relevant tasks? What I am doing here is use the original sequence in segment A, leave segment B as empty, and make labels a list with the same length as A.

maksna · 2018-11-19T01:41:21Z

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

I think segment B is used for question answering relevant tasks? What I am doing here is use the original sequence in segment A, leave segment B as empty, and make labels a list with the same length as

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

I think segment B is used for question answering relevant tasks? What I am doing here is use the original sequence in segment A, leave segment B as empty, and make labels a list with the same length as A.

thank you.

Albert-Ma · 2018-11-25T02:01:54Z

Hi, can you release the hyper parameter of NER task when fine-tuning.

I use the default and the f1 is not so good as your paper said.

Thanks.
@jacobdevlin-google

TanyaZhao · 2018-12-21T07:05:10Z

@jacobdevlin-google
Hi, I don't quite understand how to use bert hidden states as the input to downstream NER task (on CoNLL 2003).
My current fine-tuning processes are as follows:

(1) Given a batch of sentences, first, convert these sentences into word piece tikens and pad with batch max_seq_len.
(2) Feed these word piece tokens into a pre-trained bert base models (cased model), and get 12-layered hidden states.
(3) Fetch the hidden states of the last layer, and get the hidden state of the first sub-token according to orig_to_tok_map mentioned in https://github.com/google-research/bert#tokenization.
(4) Finally, feed the hidden states into a classifier (or bi-lstm ner mdels).

I adopted the hyperparameters described in the paper (batch_size=16, lr(adam)=5e-5). And I trained for 100 epochs on a tesla p100 GPU. But I can't get the ideal results as mentioned in the paper.

Would you mind give me some advice? Thank you so much !

maziyarpanahi · 2019-08-13T21:30:23Z

@jacobdevlin-google
Hi, I don't quite understand how to use bert hidden states as the input to downstream NER task (on CoNLL 2003).
My current fine-tuning processes are as follows:

(1) Given a batch of sentences, first, convert these sentences into word piece tikens and pad with batch max_seq_len.
(2) Feed these word piece tokens into a pre-trained bert base models (cased model), and get 12-layered hidden states.
(3) Fetch the hidden states of the last layer, and get the hidden state of the first sub-token according to orig_to_tok_map mentioned in https://github.com/google-research/bert#tokenization.
(4) Finally, feed the hidden states into a classifier (or bi-lstm ner mdels).

I adopted the hyperparameters described in the paper (batch_size=16, lr(adam)=5e-5). And I trained for 100 epochs on a tesla p100 GPU. But I can't get the ideal results as mentioned in the paper.

Would you mind give me some advice? Thank you so much !

Hi @TanyaZhao,

If you don't mind me asking, what was the maximum F1 you have achieved according to your steps? (I am just wondering how far it is from the one in the paper or other state-of-the-art scores)

Many thanks

ghaddarAbs · 2020-06-28T14:13:02Z

In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.

#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re




# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']

def truecase_sentence(tokens):
   word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
   lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]

   if len(lst) and len(lst) == len(word_lst):
       parts = truecase.get_true_case(' '.join(lst)).split()

       # the trucaser have its own tokenization ...
       # skip if the number of word dosen't match
       if len(parts) != len(word_lst): return tokens

       for (w, idx), nw in zip(word_lst, parts):
           tokens[idx] = nw

# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']

With these configurations and preprocessing, I was able to reach 92.8 with spanBert-large:

spanBert-large
lr= 5e-6
train_epoch= 40
batch_size= 96
max_seq_len = 64 (you need to intelligently split the sequence)
dropout (hidden\attention) = .2
crf (gather first sub tokens and apply crf)

to use large batch_size (96 or 128) for fine-tuning you can use the method below:

seq_lst = split_long_sequence(tokenizer, tokens, tags)
for sent_part_num, (tok_lst, tag_lst) in enumerate(seq_lst):
   if not tok_lst: continue
   # after decoding just sum up the sentence parts

def split_long_sequence(tokenizer, tokens, tags, max_seq_len=64):

   # 1 because of [CLS] token
   count, punct_index, tag_window = 1, [], []
   tmp_tags = [0, 0] + tags + [0, 0]

   for idx, token in enumerate(tokens):
       bert_lst = tokenizer.tokenize(token)
       count += len(bert_lst)

       if re.match(r'[^a-zA-Z0-9]', token):
           punct_index.append(idx)

       t_idx = idx + 2
       if idx and all([t == 0 for t in tmp_tags[t_idx-2:t_idx+2]]):
           tag_window.append(idx)

   if count < max_seq_len:
       return [(tokens, tags)]

   pick_lst = tag_window if tag_window else punct_index
   if not pick_lst:
       mid = len(tokens) // 2
   else:
       index_lst = [(i, math.fabs(i - len(tokens)//2)) for i in pick_lst]
       index_lst.sort(key=lambda x:x[1])
       mid = index_lst[0][0]

   l1 = split_long_sequence(tokenizer, tokens[:mid], tags[:mid], max_seq_len)
   l2 = split_long_sequence(tokenizer, tokens[mid:], tags[mid:], max_seq_len)

   return l1 + l2

jacobdevlin-google closed this as completed Nov 2, 2018

bheinzerling mentioned this issue Nov 30, 2018

Feature extraction for sequential labelling huggingface/transformers#64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans to release sequence tagging task fine-tuning code? #33

Plans to release sequence tagging task fine-tuning code? #33

egez commented Nov 2, 2018

jacobdevlin-google commented Nov 2, 2018

egez commented Nov 2, 2018

ymcdull commented Nov 7, 2018

jacobdevlin-google commented Nov 7, 2018

ymcdull commented Nov 7, 2018

wkevwang commented Nov 15, 2018

maksna commented Nov 16, 2018

ymcdull commented Nov 16, 2018 •

edited

Loading

maksna commented Nov 19, 2018

Albert-Ma commented Nov 25, 2018

TanyaZhao commented Dec 21, 2018

maziyarpanahi commented Aug 13, 2019

ghaddarAbs commented Jun 28, 2020

Plans to release sequence tagging task fine-tuning code? #33

Plans to release sequence tagging task fine-tuning code? #33

Comments

egez commented Nov 2, 2018

jacobdevlin-google commented Nov 2, 2018

egez commented Nov 2, 2018

ymcdull commented Nov 7, 2018

jacobdevlin-google commented Nov 7, 2018

ymcdull commented Nov 7, 2018

wkevwang commented Nov 15, 2018

maksna commented Nov 16, 2018

ymcdull commented Nov 16, 2018 • edited Loading

maksna commented Nov 19, 2018

Albert-Ma commented Nov 25, 2018

TanyaZhao commented Dec 21, 2018

maziyarpanahi commented Aug 13, 2019

ghaddarAbs commented Jun 28, 2020

ymcdull commented Nov 16, 2018 •

edited

Loading