-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans to release sequence tagging task fine-tuning code? #33
Comments
For maintainability and simplicity this is all we're planning on releasing other than the multilingual models and the GPU memory workaround (keep in mind we don't actually have clean standalone implementations of NER, SWAG, or the other GLUE tasks, the implementations we used in the paper are much messier). For NER we used the recipe at the bottom of the Tokenization section of the README. |
Understood. Thanks for the quick clarification. |
Can I know for the sequence tagging task, how should we design the loss function? |
Generally for tagging tasks you do a reduce mean across all tokens in the batch. But you will need to handle the padding (and potentially wordpiece that don't have a label) examples correctly, i.e., by having a weight of 1.0 for the tokens that have a prediction and 0.0 for the tokens that don't. You can use |
Awesome. Thanks for the quick reply :) |
Hi, thanks so much for this impressive work! In order to do NER, what kind of model could I use on top of the token embeddings from |
Do you use two segments in the ner task,the origin sequence is the segment A,the label sequence is segment B? |
I think segment B is used for question answering relevant tasks? What I am doing here is use the original sequence in segment A, leave segment B as empty, and make labels a list with the same length as A. |
thank you. |
Hi, can you release the hyper parameter of NER task when fine-tuning. I use the default and the f1 is not so good as your paper said. Thanks. |
@jacobdevlin-google (1) Given a batch of sentences, first, convert these sentences into word piece tikens and pad with batch max_seq_len. I adopted the hyperparameters described in the paper (batch_size=16, lr(adam)=5e-5). And I trained for 100 epochs on a tesla p100 GPU. But I can't get the ideal results as mentioned in the paper. Would you mind give me some advice? Thank you so much ! |
Hi @TanyaZhao, If you don't mind me asking, what was the maximum F1 you have achieved according to your steps? (I am just wondering how far it is from the one in the paper or other state-of-the-art scores) Many thanks |
In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.
With these configurations and preprocessing, I was able to reach 92.8 with spanBert-large:
to use large batch_size (96 or 128) for fine-tuning you can use the method below:
|
It seems that the fine-tuning code for CoNLL-2003 NER task (as described in the paper) isn't in the current release. Any plan for releasing that part?
The text was updated successfully, but these errors were encountered: