-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature extraction for sequential labelling #64
Comments
Well that seems like a good approach. Maybe you can find some inspiration in the code of the |
Thanks. It worked. However, a interesting issue about BERT is that it's highly sensitive to learning rate, which makes it very difficult to combine with other models |
@zhaoxy92 what sequence labeling task are you doing? I've got CoNLL'03 NER running with the The best dev F1 score i've gotten after The best configuration for me so far is:
Also, properly averaging the loss is important: Not just Another tip, truncating the input (#66) enables much larger batch sizes. Without it the largest possible batch size was 56, but with truncating 160 is possible. |
I am also working on CoNLL03. Similar results as you got. |
@bheinzerling with the risk of going off topic here, would you mind sharing your code? I'd love to read and adapt it for a similar sequential classification task. |
I have some code for preparing batches here: The important methods are subword_tokenize_to_ids and subword_tokenize, you can probably ignore the other stuff. With this, feature extraction for each sentence, i.e. a list of tokens, is simply: bert = dougu.bert.Bert.Model("bert-base-cased")
featurized_sentences = []
for tokens in sentences:
features = {}
features["bert_ids"], features["bert_mask"], features["bert_token_starts"] = bert.subword_tokenize_to_ids(tokens)
featurized_sentences.append(features) Then I use a custom collate function for a DataLoader that turns featurized_sentences into batches: def collate_fn(featurized_sentences_batch):
bert_batch = [torch.cat(features[key] for features in featurized_sentences], dim=0) for key in ("bert_ids", "bert_mask", "bert_token_starts")]
return bert_batch A simple sequence tagger module would look something like this: class SequenceTagger(torch.nn.Module):
def __init__(self, data_parallel=True):
bert = BertModel.from_pretrained("bert-base-cased").to(device=torch.device("cuda"))
if data_parallel:
self.bert = torch.nn.DataParallel(bert)
else:
self.bert = bert
bert_dim = 786 # (or get the dim from BertEmbeddings)
n_labels = 5 # need to set this for your task
self.out = torch.nn.Linear(bert_dim, n_labels)
... # droput, log_softmax...
def forward(self, bert_batch, true_labels):
bert_ids, bert_mask, bert_token_starts = bert_batch
# truncate to longest sequence length in batch (usually much smaller than 512) to save GPU RAM
max_length = (bert_mask != 0).max(0)[0].nonzero()[-1].item()
if max_length < bert_ids.shape[1]:
bert_ids = bert_ids[:, :max_length]
bert_mask = bert_mask[:, :max_length]
segment_ids = torch.zeros_like(bert_mask) # dummy segment IDs, since we only have one sentence
bert_last_layer = self.bert(bert_ids, segment_ids)[0][-1]
# select the states representing each token start, for each instance in the batch
bert_token_reprs = [
layer[starts.nonzero().squeeze(1)]
for layer, starts in zip(bert_last_layer, bert_token_starts)]
# need to pad because sentence length varies
padded_bert_token_reprs = pad_sequence(
bert_token_reprs, batch_first=True, padding_value=-1)
# output/classification layer: input bert states and get log probabilities for cross entropy loss
pred_logits = self.log_softmax(self.out(self.dropout(padded_bert_token_reprs)))
mask = true_labels != -1 # I did set label = -1 for all padding tokens somewhere else
loss = cross_entropy(pred_logits, true_labels)
# average/reduce the loss according to the actual number of of predictions (i.e. one prediction per token).
loss /= mask.float().sum()
return loss Wrote this without checking if it runs (my actual code is tied into some other things so I cannot just copy&paste it), but it should help you get started. |
@bheinzerling Thanks a lot for the starter, got awesome results! |
Thanks for sharing these tips here! It helps a lot. I tried to finetune BERT on multiple imbalanced datasets and found the result quite unstable... For an imbalanced dataset, I mean there are much more O labels than the others under the {B,I,O} tagging scheme. Tried weighted cross-entropy loss but the performance is still not as expected. Has anyone met the same issue? Thanks! |
Hi~@bheinzerling |
@kugwzk I didn't do any more CoNLL'03 runs since the numbers reported in the BERT paper were apparently achieved by using document context, which is different from the standard sentence-based evaluation. You can find more details here: allenai/allennlp#2067 (comment) |
Hmmm...I think they should tell that in the paper...And do you know where to find that they used document context? |
That's what the folks over at allennlp said. I don't know where they got this information, maybe personal communication with one of the BERT authors? |
Anyway, thank you very much for tell me that. |
https://github.com/kamalkraj/BERT-NER |
https://github.com/JianLiu91/bert_ner gives a solution that is very easy to understand. |
Hi all, I am trying to train the BERT model on some data that I have. However, I am having trouble understanding how to adjust the labels following tokenization. I am trying to perform word level classification (similar to NER) If I have the following tokenized sentence and its' labels:
Then after using the BERT tokenizer I get the following: Also, I adjust my label array as follows: N.B. Tokens such as eng-30-01258617-a are not tokenized further as I included an ignore list which contains words and tokens that I do not want tokenized and I swapped them with the [unusedXXX] tokens found in the vocab.txt file. Notice how the last word 'frailty' is transformed into ['frail', '##ty'] and the label '1' which was used for the whole word is now placed under each word piece. Is this the correct way of doing it? If you would like a more in-depth explanation of what I am trying to achieve you can read the following: https://stackoverflow.com/questions/56129165/how-to-handle-labels-when-using-the-berts-wordpiece-tokenizer Any help would be greatly appreciated! Thanks in advance |
@dangal95, adjusting the original labels is probably not the best way. A simpler method that works well is described in this issue, here #64 (comment) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@nijianmo Hi, I am recently considering using weighted loss in NER task. I wonder if you have tried weighted crf or weighted softmax in pytorch implementation. If so, did you get a good performance ? Thanks in advance. |
This repository does not use a CRF for NER classification? Anyway, parameters of a CRF depend on the data distribution you have. These links might be usefull: https://towardsdatascience.com/conditional-random-field-tutorial-in-pytorch-ca0d04499463 and https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html |
@srslynow Thanks for your answer! I am familiar with CRF, but kind of confused how to set the weight decay when the CRF is connected with BERT. The authors or huggingface seem not to have mentioned how to set weight decay beside the BERT structure. |
Thanks to #64 (comment), I could get the implementation to work - for anyone else that's struggling to reproduce the results: https://github.com/chnsh/BERT-NER-CoNLL |
BERT-NER in Tensorflow 2.0 |
Hi, I am trying to make your code work, and here is my setup: I re-declare as free functions and constants everything that is needed
and then i try to add your extra code.
it is Some questions:
is this the same with Also i do not understand what the comment means... ( # dummy segment IDs, since we only have one sentence)
|
Is this development makes outdated this conversation? Can you please clarify? transformers/examples/utils_ner.py Line 85 in 93d2fff
|
I guess so |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi @nijianmo, did you find any workaround for this? Thanks! |
Hi everyone! Thanks for your posts! I was wondering - could anyone post an explicit example of how the properly formatted data for NER using BERT would look like? It is not entirely clean to me from the paper and the comments I've found. Let's say we have a following sentence and labels:
Would data that we input to the model be something like this:
? Thank you! |
|
@bheinzerling |
When I wrote that code, |
Hi, could you explain why adjusting the original labels is not suggested? It seems quite easy and straightforward. # reference: https://github.com/huggingface/transformers/issues/64#issuecomment-443703063
def flatten(list_of_lists):
for list in list_of_lists:
for item in list:
yield item
def subword_tokenize(tokens, labels):
assert len(tokens) == len(labels)
subwords = list(map(tokenizer.tokenize, tokens))
subword_lengths = list(map(len, subwords))
subwords = [CLS] + list(flatten(subwords)) + [SEP]
token_start_idxs = 1 + np.cumsum([0] + subword_lengths[:-1])
bert_labels = [[label] + (sublen-1) * ["X"] for sublen, label in zip(subword_lengths, labels)]
bert_labels = ["O"] + list(flatten(bert_labels)) + ["O"]
assert len(subwords) == len(bert_labels)
return subwords, token_start_idxs, bert_labels
|
Hello,if we have the following sentence:
Would “Johanson” be processed like this?
or like this?
thank you! |
The middle one is right, you need to add a label to labels ‘I-PERS’ |
Hello, I'm confused about the labels for [CLS] and [PAD] tokens. Assume that I have originally have 4 labels for each word [0, 1, 2, 3, 4] should I add [CLS] and [PAD] as another label? I see that in the example here [CLS] and [SEP] takes labels '2'. Does making the attention 0 for those positions solve this? |
This repository have showed how to add a CRF layer on transformers to get a better performance on token classification task. |
tks alot @shushanxingzhe |
@shushanxingzhe : I think you are using label 'O' as padding label in your code. From my view point, you should have another label 'PAD' for padding instead using 'O' label |
Could someone please tell me how to use CRF with decode padding. When i code as below, i always get err: expected seq=18 but got 13 for next line "tags = torch.Tensor(tags)" |
Can we just remove the non-first subtokens during feature processing if we are treating NER problem as a classification problem? Example: cleaned_sent = ['[CLS]', 'john', 'johan', 'lives', 'in', 'ramat', 'gan', '.', '[SEP]'] |
Hi, I have a question in terms of using BERT for sequential labeling task.
Please correct me if I'm wrong.
My understanding is:
Is this entire process correct? I followed this procedure but could not have any results.
Thank you!
The text was updated successfully, but these errors were encountered: