-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit context boundaries in Transformer embeddings #3073
Conversation
Hi @alanakbik |
flair/embeddings/transformer.py
Outdated
@@ -624,6 +627,10 @@ def __expand_sentence_with_context(self, sentence) -> Tuple[List[Token], int]: | |||
left_context = sentence.left_context(self.context_length, self.respect_document_boundaries) | |||
right_context = sentence.right_context(self.context_length, self.respect_document_boundaries) | |||
|
|||
if self.use_context_separator: | |||
left_context = left_context + [Token("[KONTEXT]")] | |||
right_context = [Token("[KONTEXT]")] + right_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this token [CONTEXT]
instead of [KONTEXT]
? I think we benefit from not mixing up two languages.
From my understanding, changing the text of the context won't change the training, as it is added as a special token.
Besides that, I would prefer that text to be extracted to a constant (similar to https://github.com/flairNLP/flair/blob/master/flair/models/sequence_tagger_utils/crf.py#L5 with the START_TOKEN
), so we won't have the case that typos in a single string can break the code in future.
Thanks for the feedback! I trained without dev so the numbers are actually pretty much the same as in the FLERT paper. |
This PR adds the option of setting a "context separator" to Transformer embeddings. If set, a new special token is added to the transformer tokenizer dictionary. This token is at the beginning and end of a Sentence to separate it from context.
To use the example from #3063:
Assume the current sentence is
Peter Blackburn
and the previous sentence ends withto boycott British lamb .
, while the next sentence starts withBRUSSELS 1996-08-22 The European Commission
.In this case,
use_context_separator=False
, the embedding is produced from this string:to boycott British lamb . Peter Blackburn BRUSSELS 1996-08-22 The European Commission
use_context_separator=True
, the embedding is produced from this stringto boycott British lamb . [KONTEXT] Peter Blackburn [KONTEXT] BRUSSELS 1996-08-22 The European Commission
Option 1 is how it was before. Option 2 is the new standard and explicitly marks up the boundaries of the current sentence using the [KONTEXT] token.
To evaluate, we trained FLERT models on CoNLL-03 for two transformers and different context markers. Each setup is trained for 7 different seeds: reported test F1 is averaged over all 7 runs, with standard deviation:
The context marker generally does not seem to hurt F1 score and it will potentially enable us to address the context issue described in #3063.