Relation Classifier Encoding Strategies #3023

dobbersc · 2022-12-13T11:56:25Z

This PR implements several encoding strategies from this paper for the relation classifier. The original relation classifier was designed for the (typed) entity mask encoding strategy. Now, the model is more general, with the encoding strategy as a hyperparameter. The interface also enables us to easily add more encoding strategies.

…being fully configurable

- Rename mentions of `masked_sentence` to `encoded_sentence`

…n into one test - Add transformation test for entity-mask and typed-entity-mask

alanakbik

Thanks @dobbersc for improving this!

In testing this PR, I noticed some discrepancies between the parameters of RelationExtractor and RelationClassifier related to the (kind of confusing) issue of whether a RE dataset is fully annotated like RE_ENGLISH_CONLL04 and thus usable without a prior "relation identification" step, or whether only a subset of entity pairs is annotated like in TACRED or SemEval.

For RE_ENGLISH_CONLL04 one would instantiate the RelationClassifier like this:

model: RelationClassifier = RelationClassifier(
    embeddings=embeddings,
    label_dictionary=label_dictionary,
    label_type="relation",
    entity_label_types="ner",
    cross_augmentation=True,
)

and the RelationExtractor like this:

model: RelationExtractor = RelationExtractor(
    embeddings=embeddings,
    label_dictionary=label_dictionary,
    label_type="relation",
    entity_label_type="ner",
    train_on_gold_pairs_only=False,
)

So, the params entity_label_type / entity_label_types and train_on_gold_pairs_only / cross_augmentation have different names. Perhaps align the naming?

dobbersc · 2022-12-19T21:22:59Z

Thanks, @alanakbik for testing the PR.

So, the params entity_label_type / entity_label_types and train_on_gold_pairs_only / cross_augmentation have different names. Perhaps align the naming?

I agree that cross_augmentation is properly less descriptive for the end-user and what they care about than train_on_gold_pairs_only. We want to convey that this flag is relevant for datasets that are not fully annotated, and thus only the gold entity pairs should be used for training. How about we name it train_on_gold_entity_pairs or train_on_gold_relations then? We could also think about moving this parameter to the transform_corpus etc. function since it is only required for model training and not for inference.

The entity_label_types is intentionally written with an s at the end since the new RelationClassifier also allows multiple entity label types. The old RelationExtractor does not support this feature. This is consistent with other parameters, e.g. sentences in the predict function.

alanakbik · 2022-12-20T15:02:55Z

@dobbersc thanks, merging now!

dobbersc added 23 commits December 7, 2022 21:38

Implement functional encoding strategies

fcfdd97

Implement object-oriented encoding strategies

52d65f3

Add special tokens to transformer embeddings if specified

9bb142e

Refactor encoding strategy special tokens to boolean flag instead of …

456c0bf

…being fully configurable

Add examples for the selection strategies

1587d85

- Rename _create_masked_sentence -> _encoded_sentence

1a300ad

- Rename mentions of `masked_sentence` to `encoded_sentence`

Adjusts docstrings to fit the encoding strategies

46b3e0f

Cite relevant source paper

468470a

Fix mypy

e8681e8

Use raw strings for markers

86100ef

- Parameterize transformation test with and without cross augmentatio…

1141502

…n into one test - Add transformation test for entity-mask and typed-entity-mask

Remove dead functions

746879b

Isort

13c5f38

Isort

ff1e40e

Correct encoding strategy examples

f9b7155

Add the remaining tests for the encoding strategy

0934b3b

Fix special tokens

b7991c8

Outsource encoding strategy test templates

56063dc

Use protocol from typing extensions to support python 3.7

da3a564

Isort

d114d3c

Refactor tests readability

18f7816

Make encoding strategy abstract

2a2b49e

Remove unused imports

a3430f7

alanakbik reviewed Dec 19, 2022

View reviewed changes

alanakbik merged commit 3b91e8c into master Dec 20, 2022

alanakbik deleted the relation-classifier-encoding-strategies branch December 20, 2022 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relation Classifier Encoding Strategies #3023

Relation Classifier Encoding Strategies #3023

dobbersc commented Dec 13, 2022

alanakbik left a comment

dobbersc commented Dec 19, 2022

alanakbik commented Dec 20, 2022

Relation Classifier Encoding Strategies #3023

Relation Classifier Encoding Strategies #3023

Conversation

dobbersc commented Dec 13, 2022

alanakbik left a comment

Choose a reason for hiding this comment

dobbersc commented Dec 19, 2022

alanakbik commented Dec 20, 2022