Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bert-crf #29

Open
Astudnew opened this issue Feb 4, 2021 · 3 comments
Open

bert-crf #29

Astudnew opened this issue Feb 4, 2021 · 3 comments

Comments

@Astudnew
Copy link

Astudnew commented Feb 4, 2021

Hello
in this part of your code in bertcrf class (forward fn), you write it is for pass the first token but i don't understand how this hanpped (and the len of seq_logist and seq_lables doed not change it the same length with sub tokens , CLS and SEP )

"for seq_logits, seq_labels, seq_mask in zip(logits, labels, mask):
# Index logits and labels using prediction mask to pass only the
# first subtoken of each word to CRF.
seq_logits = seq_logits[seq_mask].unsqueeze(0)
seq_labels = seq_labels[seq_mask].unsqueeze(0)
loss -= self.crf(seq_logits, seq_labels,
reduction='token_mean')"

@fabiocapsouza
Copy link
Contributor

Hi @Phd-Student2018 ,
I don't know if I understood your question, but here is an example of this indexing:

suppose we have the following words, tokens and labels

words = ["My", "name", "is", "Fabio"]
tokens = ["[CLS]", "My", "name", "is", "Fa", "##bio", "[SEP]"]
label_tags = ["X", "O", "O", "O", "B-PERSON", "X", "X"]  # X is ignore
labels = [-100, 0, 0, 0, 1, -100, -100]   # label tags converted to int ids
seq_mask = [False, True, True, True, True, False, False]   # False for special tokens and word continuations ("##")

# The CRF layer must receive only the logits and labels of the tokens ["My", "name", "is", "Fa"]
# B = batch size
# S = sequence length
# C = number of classes/tags
# logits.shape == (B, S, C)
# labels.shape == (B, S)
# After zip:
# seq_logits.shape == (S, C)
# seq_labels.shape == (S,)

# The indexing of seq_logits and seq_labels by seq_mask will produce:
# seq_logits.shape == (P, C)
# seq_labels.shape == (P,)
# The unsqueeze adds back the batch dimension: (1, P, C) and (1, P)

P is the number of words given by basic whitespace and punctuation tokenization, P = seq_mask.sum()

Hope it helps

@Astudnew
Copy link
Author

Astudnew commented Feb 5, 2021

Yes , it is very helpful
Thank you very much

@Astudnew
Copy link
Author

Astudnew commented Feb 5, 2021

please , another question
for testing ,to compare prediction list with original label list(y-true ), how we can get y-true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants