Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit context boundaries in Transformer embeddings #3073

Merged
merged 6 commits into from
Jan 30, 2023

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Jan 27, 2023

This PR adds the option of setting a "context separator" to Transformer embeddings. If set, a new special token is added to the transformer tokenizer dictionary. This token is at the beginning and end of a Sentence to separate it from context.

To use the example from #3063:

Assume the current sentence is Peter Blackburn and the previous sentence ends with to boycott British lamb ., while the next sentence starts with BRUSSELS 1996-08-22 The European Commission.

In this case,

  1. if use_context_separator=False, the embedding is produced from this string: to boycott British lamb . Peter Blackburn BRUSSELS 1996-08-22 The European Commission
  2. if use_context_separator=True, the embedding is produced from this string to boycott British lamb . [KONTEXT] Peter Blackburn [KONTEXT] BRUSSELS 1996-08-22 The European Commission

Option 1 is how it was before. Option 2 is the new standard and explicitly marks up the boundaries of the current sentence using the [KONTEXT] token.

To evaluate, we trained FLERT models on CoNLL-03 for two transformers and different context markers. Each setup is trained for 7 different seeds: reported test F1 is averaged over all 7 runs, with standard deviation:

Transformer Context-Marker CoNLL-03 Test F1
bert-base-uncased none 91.52 +- 0.16
[SEP] 91.38 +- 0.18
[KONTEXT] 91.56 +- 0.17
xlm-roberta-large none 93.73 +- 0.2
[SEP] 93.76 +- 0.13
[KONTEXT] 93.92 +- 0.14

The context marker generally does not seem to hurt F1 score and it will potentially enable us to address the context issue described in #3063.

@helpmefindaname
Copy link
Collaborator

Hi @alanakbik
This looks like a good solution for Tars-embeddings and even an improvement for non-tars embeddings.
I noticed the values you reported are lower than the final score reported in the Flert paper. Did you use the default train_with_dev=False? That would explain the discrepancy.
If you did use train_with_dev=True instead, do you know why the values are lower than the original?

@@ -624,6 +627,10 @@ def __expand_sentence_with_context(self, sentence) -> Tuple[List[Token], int]:
left_context = sentence.left_context(self.context_length, self.respect_document_boundaries)
right_context = sentence.right_context(self.context_length, self.respect_document_boundaries)

if self.use_context_separator:
left_context = left_context + [Token("[KONTEXT]")]
right_context = [Token("[KONTEXT]")] + right_context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this token [CONTEXT] instead of [KONTEXT]? I think we benefit from not mixing up two languages.
From my understanding, changing the text of the context won't change the training, as it is added as a special token.

Besides that, I would prefer that text to be extracted to a constant (similar to https://github.com/flairNLP/flair/blob/master/flair/models/sequence_tagger_utils/crf.py#L5 with the START_TOKEN), so we won't have the case that typos in a single string can break the code in future.

@alanakbik
Copy link
Collaborator Author

alanakbik commented Jan 30, 2023

Did you use the default train_with_dev=False? That would explain the discrepancy.

Thanks for the feedback! I trained without dev so the numbers are actually pretty much the same as in the FLERT paper.

@alanakbik alanakbik merged commit f99ad60 into master Jan 30, 2023
@alanakbik alanakbik deleted the context_boundaries branch January 30, 2023 12:24
alanakbik added a commit that referenced this pull request Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants