Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans to release sequence tagging task fine-tuning code? #33

Closed
egez opened this issue Nov 2, 2018 · 13 comments
Closed

Plans to release sequence tagging task fine-tuning code? #33

egez opened this issue Nov 2, 2018 · 13 comments

Comments

@egez
Copy link

egez commented Nov 2, 2018

It seems that the fine-tuning code for CoNLL-2003 NER task (as described in the paper) isn't in the current release. Any plan for releasing that part?

@jacobdevlin-google
Copy link
Contributor

For maintainability and simplicity this is all we're planning on releasing other than the multilingual models and the GPU memory workaround (keep in mind we don't actually have clean standalone implementations of NER, SWAG, or the other GLUE tasks, the implementations we used in the paper are much messier). For NER we used the recipe at the bottom of the Tokenization section of the README.

@egez
Copy link
Author

egez commented Nov 2, 2018

Understood. Thanks for the quick clarification.

@ymcdull
Copy link

ymcdull commented Nov 7, 2018

Can I know for the sequence tagging task, how should we design the loss function?
From "run_classifier.py", it just uses "loss = tf.reduce_mean(per_example_loss)", and each "example" will be a sentence or a pair of sentences.
But for sequence tagging task, for each sentence, we will have multiple tokens with labels. I'm a bit confused how to design the loss function here. Any ideas? Thanks.

@jacobdevlin-google
Copy link
Contributor

Generally for tagging tasks you do a reduce mean across all tokens in the batch. But you will need to handle the padding (and potentially wordpiece that don't have a label) examples correctly, i.e., by having a weight of 1.0 for the tokens that have a prediction and 0.0 for the tokens that don't. You can use tf.losses.softmax_cross_entropy to do it correctly (make sure to pass in a tensor to weights also).

@ymcdull
Copy link

ymcdull commented Nov 7, 2018

Awesome. Thanks for the quick reply :)

@wkevwang
Copy link

Hi, thanks so much for this impressive work! In order to do NER, what kind of model could I use on top of the token embeddings from model.get_sequence_output() to get token-level classification? Could I simply use a dense layer on top of model.get_sequence_output() (with weights of dimensions of [seq_length, hidden_size]) to output one classification per token? I assume I would also have to pad the input sequence as well.

@maksna
Copy link

maksna commented Nov 16, 2018

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

@ymcdull
Copy link

ymcdull commented Nov 16, 2018

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

I think segment B is used for question answering relevant tasks? What I am doing here is use the original sequence in segment A, leave segment B as empty, and make labels a list with the same length as A.

@maksna
Copy link

maksna commented Nov 19, 2018

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

I think segment B is used for question answering relevant tasks? What I am doing here is use the original sequence in segment A, leave segment B as empty, and make labels a list with the same length as

Awesome. Thanks for the quick reply :)

Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B?

I think segment B is used for question answering relevant tasks? What I am doing here is use the original sequence in segment A, leave segment B as empty, and make labels a list with the same length as A.

thank you.

@Albert-Ma
Copy link

Hi, can you release the hyper parameter of NER task when fine-tuning.

I use the default and the f1 is not so good as your paper said.

Thanks.
@jacobdevlin-google

@TanyaZhao
Copy link

@jacobdevlin-google
Hi, I don't quite understand how to use bert hidden states as the input to downstream NER task (on CoNLL 2003).
My current fine-tuning processes are as follows:

(1) Given a batch of sentences, first, convert these sentences into word piece tikens and pad with batch max_seq_len.
(2) Feed these word piece tokens into a pre-trained bert base models (cased model), and get 12-layered hidden states.
(3) Fetch the hidden states of the last layer, and get the hidden state of the first sub-token according to orig_to_tok_map mentioned in https://github.com/google-research/bert#tokenization.
(4) Finally, feed the hidden states into a classifier (or bi-lstm ner mdels).

I adopted the hyperparameters described in the paper (batch_size=16, lr(adam)=5e-5). And I trained for 100 epochs on a tesla p100 GPU. But I can't get the ideal results as mentioned in the paper.

Would you mind give me some advice? Thank you so much !

@maziyarpanahi
Copy link

@jacobdevlin-google
Hi, I don't quite understand how to use bert hidden states as the input to downstream NER task (on CoNLL 2003).
My current fine-tuning processes are as follows:

(1) Given a batch of sentences, first, convert these sentences into word piece tikens and pad with batch max_seq_len.
(2) Feed these word piece tokens into a pre-trained bert base models (cased model), and get 12-layered hidden states.
(3) Fetch the hidden states of the last layer, and get the hidden state of the first sub-token according to orig_to_tok_map mentioned in https://github.com/google-research/bert#tokenization.
(4) Finally, feed the hidden states into a classifier (or bi-lstm ner mdels).

I adopted the hyperparameters described in the paper (batch_size=16, lr(adam)=5e-5). And I trained for 100 epochs on a tesla p100 GPU. But I can't get the ideal results as mentioned in the paper.

Would you mind give me some advice? Thank you so much !

Hi @TanyaZhao,

If you don't mind me asking, what was the maximum F1 you have achieved according to your steps? (I am just wondering how far it is from the one in the paper or other state-of-the-art scores)

Many thanks

@ghaddarAbs
Copy link

In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.

#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re




# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']

def truecase_sentence(tokens):
   word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
   lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]

   if len(lst) and len(lst) == len(word_lst):
       parts = truecase.get_true_case(' '.join(lst)).split()

       # the trucaser have its own tokenization ...
       # skip if the number of word dosen't match
       if len(parts) != len(word_lst): return tokens

       for (w, idx), nw in zip(word_lst, parts):
           tokens[idx] = nw

# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']

With these configurations and preprocessing, I was able to reach 92.8 with spanBert-large:

  • spanBert-large
  • lr= 5e-6
  • train_epoch= 40
  • batch_size= 96
  • max_seq_len = 64 (you need to intelligently split the sequence)
  • dropout (hidden\attention) = .2
  • crf (gather first sub tokens and apply crf)

to use large batch_size (96 or 128) for fine-tuning you can use the method below:

seq_lst = split_long_sequence(tokenizer, tokens, tags)
for sent_part_num, (tok_lst, tag_lst) in enumerate(seq_lst):
   if not tok_lst: continue
   # after decoding just sum up the sentence parts

def split_long_sequence(tokenizer, tokens, tags, max_seq_len=64):

   # 1 because of [CLS] token
   count, punct_index, tag_window = 1, [], []
   tmp_tags = [0, 0] + tags + [0, 0]

   for idx, token in enumerate(tokens):
       bert_lst = tokenizer.tokenize(token)
       count += len(bert_lst)

       if re.match(r'[^a-zA-Z0-9]', token):
           punct_index.append(idx)

       t_idx = idx + 2
       if idx and all([t == 0 for t in tmp_tags[t_idx-2:t_idx+2]]):
           tag_window.append(idx)

   if count < max_seq_len:
       return [(tokens, tags)]

   pick_lst = tag_window if tag_window else punct_index
   if not pick_lst:
       mid = len(tokens) // 2
   else:
       index_lst = [(i, math.fabs(i - len(tokens)//2)) for i in pick_lst]
       index_lst.sort(key=lambda x:x[1])
       mid = index_lst[0][0]

   l1 = split_long_sequence(tokenizer, tokens[:mid], tags[:mid], max_seq_len)
   l2 = split_long_sequence(tokenizer, tokens[mid:], tags[mid:], max_seq_len)

   return l1 + l2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants