Explicit context boundaries in Transformer embeddings #3073

alanakbik · 2023-01-27T14:32:05Z

This PR adds the option of setting a "context separator" to Transformer embeddings. If set, a new special token is added to the transformer tokenizer dictionary. This token is at the beginning and end of a Sentence to separate it from context.

To use the example from #3063:

Assume the current sentence is Peter Blackburn and the previous sentence ends with to boycott British lamb ., while the next sentence starts with BRUSSELS 1996-08-22 The European Commission.

In this case,

if use_context_separator=False, the embedding is produced from this string: to boycott British lamb . Peter Blackburn BRUSSELS 1996-08-22 The European Commission
if use_context_separator=True, the embedding is produced from this string to boycott British lamb . [KONTEXT] Peter Blackburn [KONTEXT] BRUSSELS 1996-08-22 The European Commission

Option 1 is how it was before. Option 2 is the new standard and explicitly marks up the boundaries of the current sentence using the [KONTEXT] token.

To evaluate, we trained FLERT models on CoNLL-03 for two transformers and different context markers. Each setup is trained for 7 different seeds: reported test F1 is averaged over all 7 runs, with standard deviation:

Transformer	Context-Marker	CoNLL-03 Test F1
bert-base-uncased	none	91.52 +- 0.16
	[SEP]	91.38 +- 0.18
	[KONTEXT]	91.56 +- 0.17
xlm-roberta-large	none	93.73 +- 0.2
	[SEP]	93.76 +- 0.13
	[KONTEXT]	93.92 +- 0.14

The context marker generally does not seem to hurt F1 score and it will potentially enable us to address the context issue described in #3063.

helpmefindaname · 2023-01-28T11:51:12Z

Hi @alanakbik
This looks like a good solution for Tars-embeddings and even an improvement for non-tars embeddings.
I noticed the values you reported are lower than the final score reported in the Flert paper. Did you use the default train_with_dev=False? That would explain the discrepancy.
If you did use train_with_dev=True instead, do you know why the values are lower than the original?

helpmefindaname · 2023-01-28T11:56:35Z

flair/embeddings/transformer.py

@@ -624,6 +627,10 @@ def __expand_sentence_with_context(self, sentence) -> Tuple[List[Token], int]:
            left_context = sentence.left_context(self.context_length, self.respect_document_boundaries)
            right_context = sentence.right_context(self.context_length, self.respect_document_boundaries)

+            if self.use_context_separator:
+                left_context = left_context + [Token("[KONTEXT]")]
+                right_context = [Token("[KONTEXT]")] + right_context


Can we call this token [CONTEXT] instead of [KONTEXT]? I think we benefit from not mixing up two languages.
From my understanding, changing the text of the context won't change the training, as it is added as a special token.

Besides that, I would prefer that text to be extracted to a constant (similar to https://github.com/flairNLP/flair/blob/master/flair/models/sequence_tagger_utils/crf.py#L5 with the START_TOKEN), so we won't have the case that typos in a single string can break the code in future.

alanakbik · 2023-01-30T10:33:48Z

Did you use the default train_with_dev=False? That would explain the discrepancy.

Thanks for the feedback! I trained without dev so the numbers are actually pretty much the same as in the FLERT paper.

alanakbik added 5 commits January 25, 2023 16:52

Add context boundaries

ce1c459

Explicit document boundaries

347cc78

Add option to add different context tokens

4f354ce

Support legacy models without context separator

c4d7d9e

Add correct token

6da65a4

helpmefindaname reviewed Jan 28, 2023

View reviewed changes

Add tests and make special marker a constant

fceba71

alanakbik merged commit f99ad60 into master Jan 30, 2023

alanakbik deleted the context_boundaries branch January 30, 2023 12:24

alanakbik added a commit that referenced this pull request Jan 31, 2023

GH-3073: rename context boundary

13efc92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit context boundaries in Transformer embeddings #3073

Explicit context boundaries in Transformer embeddings #3073

alanakbik commented Jan 27, 2023 •

edited

Loading

helpmefindaname commented Jan 28, 2023

helpmefindaname Jan 28, 2023

alanakbik commented Jan 30, 2023 •

edited

Loading

Explicit context boundaries in Transformer embeddings #3073

Explicit context boundaries in Transformer embeddings #3073

Conversation

alanakbik commented Jan 27, 2023 • edited Loading

helpmefindaname commented Jan 28, 2023

helpmefindaname Jan 28, 2023

Choose a reason for hiding this comment

alanakbik commented Jan 30, 2023 • edited Loading

alanakbik commented Jan 27, 2023 •

edited

Loading

alanakbik commented Jan 30, 2023 •

edited

Loading