Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Add support of tokenized input for coref and srl predictors #2076

Merged
merged 3 commits into from
Nov 18, 2018
Merged

Add support of tokenized input for coref and srl predictors #2076

merged 3 commits into from
Nov 18, 2018

Conversation

AlonEirew
Copy link
Contributor

Adding method to support predicting on tokenized input in coref.py and semantic_role_labeler.py API methods.

Added methods:
CorefPredictor.predict_from_list(self, tokenized_document: List[str]) -> JsonDict:
SemanticRoleLabelerPredictor.predict_from_list(self, tokenized_sentence: List[str]) -> JsonDict:

This is very useful when user data is already tokenized (for example annotated corpus for some NLP task) and user would like to predict on that tokenized data without losing the original tokens ID's..

Avoid a workflow of: Creating document/sentence text from tokenized data -> input created text to predict method -> predict will tokenize the input text -> align output tokens with original tokens.

Usage code example in:
for coref: allennlp/tests/predictors/coref_test.py, test: test_uses_named_inputs
for srl: allennlp/tests/predictors/srl_test.py, test: test_uses_named_inputs

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great thing to add, thanks for the PR. A few things need fixing in how you did it, though.

@@ -154,6 +155,14 @@ def split_words(self, sentence: str) -> List[Token]:
# This works because our Token class matches spacy's.
return _remove_spaces(self.spacy(sentence))

def tokens_from_list(self, words: List[str]) -> List[Token]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just add methods like this to a subclass without adding them to the base class. This breaks the API. You call self._tokenizer.tokens_from_list() in the Predictor below, but that will crash with any tokenizer except the spacy tokenizer.

@@ -53,6 +55,38 @@ def predict(self, document: str) -> JsonDict:
"""
return self.predict_json({"document" : document})

def predict_from_list(self, tokenized_document: List[str]) -> JsonDict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this to be called predict_tokenized - it's a lot more obvious what the "list" is supposed to be.

"""
return self.predict_words_list(tokenized_document)

def predict_words_list(self, words_list: List[str]) -> JsonDict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you have this extra method - it looks like it's doing exactly the same thing as predict_from_list, except it's not documented. Just remove this entirely.

Create an instance from words list represent an already tokenized document,
for skipping tokenization when that information already exist for the user
"""
spacy_document = self._spacy.tokenizer.tokens_from_list(document_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're calling the same spacy pipeline on the document twice here; once inside tokenizer.tokens_from_list, and then once again here. I think all you need to do is call spacy_document = self._spacy.tokens_from_list(words) here (without adding the method onto the SpacyWordSplitter), then continue with this logic, and it should just work. We don't need the extra method on SpacyWordSplitter at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here im using the actual Spacy tokenizer and not the WordSplitter one (which I use in srl), it will return a spacy.Doc object, I then run the pipeline on the doc object only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see - so yeah, you really don't need that extra method at all.

instance = self._words_list_to_instance(words_list)
return self.predict_instance(instance)

def _words_list_to_instance(self, document_list: List[str]) -> Instance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

words is a better name here than document_list.


def predict_from_list(self, tokenized_sentence: List[str]) -> JsonDict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_tokenized here too.

"description": description,
"tags": tags,
})
def predict_words_list(self, words_list: List[str]) -> JsonDict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again the logic in this method should just be moved into predict_tokenized - this is a duplicate method.

Create an instance list of works document, for skipping tokenization when that
information already exist for the user
"""
tokens = self._tokenizer.tokens_from_list(words_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't rely on this method actually existing, because you didn't create the method on the base class (I'm a little surprised that mypy didn't catch this; maybe because we're playing a little loose with the tokenizers inside of a predictor already...). But it's a lot easier than this: just do tokens = [Token(word) for word in words].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean here I should just add tokens = [Token(word) for word in words] at WordSplitter base class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the line that you have, you should have tokens = [Token(word) for word in words]. There's no need for any extra methods on WordSplitter for any of what you're implementing.

Copy link
Contributor Author

@AlonEirew AlonEirew Nov 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spacy POS tagging is required, I've tried to follow the class logic were all spacy tokenization and pipeline is under the hood via class SpacyWordSplitter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's especially non-obvious from the diff I was looking at. And now that I look at the things that were hidden, I understand why mypy didn't catch this - I didn't realize that we specifically instantiated a SpacyWordSplitter here. I'd just do what you did in the coref predictor - call self._tokenizer.tokenizer.tokens_from_list(), or whatever the method is.

words = result.get("words")
assert words == ["The", "squirrel", "wrote", "a", "unit", "test",
"to", "make", "sure", "its", "nuts", "worked", "as", "designed", "."]
num_words = len(words)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not remove these blank lines. They separate coherent segments.

@AlonEirew
Copy link
Contributor Author

Hi Matt, I've fixed all comments, you can review.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@matt-gardner matt-gardner merged commit af902a3 into allenai:master Nov 18, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants