-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add support of tokenized input for coref and srl predictors #2076
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great thing to add, thanks for the PR. A few things need fixing in how you did it, though.
@@ -154,6 +155,14 @@ def split_words(self, sentence: str) -> List[Token]: | |||
# This works because our Token class matches spacy's. | |||
return _remove_spaces(self.spacy(sentence)) | |||
|
|||
def tokens_from_list(self, words: List[str]) -> List[Token]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't just add methods like this to a subclass without adding them to the base class. This breaks the API. You call self._tokenizer.tokens_from_list()
in the Predictor
below, but that will crash with any tokenizer except the spacy tokenizer.
allennlp/predictors/coref.py
Outdated
@@ -53,6 +55,38 @@ def predict(self, document: str) -> JsonDict: | |||
""" | |||
return self.predict_json({"document" : document}) | |||
|
|||
def predict_from_list(self, tokenized_document: List[str]) -> JsonDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this to be called predict_tokenized
- it's a lot more obvious what the "list" is supposed to be.
allennlp/predictors/coref.py
Outdated
""" | ||
return self.predict_words_list(tokenized_document) | ||
|
||
def predict_words_list(self, words_list: List[str]) -> JsonDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why you have this extra method - it looks like it's doing exactly the same thing as predict_from_list
, except it's not documented. Just remove this entirely.
allennlp/predictors/coref.py
Outdated
Create an instance from words list represent an already tokenized document, | ||
for skipping tokenization when that information already exist for the user | ||
""" | ||
spacy_document = self._spacy.tokenizer.tokens_from_list(document_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're calling the same spacy pipeline on the document twice here; once inside tokenizer.tokens_from_list
, and then once again here. I think all you need to do is call spacy_document = self._spacy.tokens_from_list(words)
here (without adding the method onto the SpacyWordSplitter
), then continue with this logic, and it should just work. We don't need the extra method on SpacyWordSplitter
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here im using the actual Spacy tokenizer and not the WordSplitter one (which I use in srl), it will return a spacy.Doc object, I then run the pipeline on the doc object only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see - so yeah, you really don't need that extra method at all.
allennlp/predictors/coref.py
Outdated
instance = self._words_list_to_instance(words_list) | ||
return self.predict_instance(instance) | ||
|
||
def _words_list_to_instance(self, document_list: List[str]) -> Instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
words
is a better name here than document_list
.
|
||
def predict_from_list(self, tokenized_sentence: List[str]) -> JsonDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict_tokenized
here too.
"description": description, | ||
"tags": tags, | ||
}) | ||
def predict_words_list(self, words_list: List[str]) -> JsonDict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again the logic in this method should just be moved into predict_tokenized
- this is a duplicate method.
Create an instance list of works document, for skipping tokenization when that | ||
information already exist for the user | ||
""" | ||
tokens = self._tokenizer.tokens_from_list(words_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't rely on this method actually existing, because you didn't create the method on the base class (I'm a little surprised that mypy didn't catch this; maybe because we're playing a little loose with the tokenizers inside of a predictor already...). But it's a lot easier than this: just do tokens = [Token(word) for word in words]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean here I should just add tokens = [Token(word) for word in words]
at WordSplitter base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the line that you have, you should have tokens = [Token(word) for word in words]
. There's no need for any extra methods on WordSplitter
for any of what you're implementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacy POS tagging is required, I've tried to follow the class logic were all spacy tokenization and pipeline is under the hood via class SpacyWordSplitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's especially non-obvious from the diff I was looking at. And now that I look at the things that were hidden, I understand why mypy didn't catch this - I didn't realize that we specifically instantiated a SpacyWordSplitter
here. I'd just do what you did in the coref predictor - call self._tokenizer.tokenizer.tokens_from_list()
, or whatever the method is.
words = result.get("words") | ||
assert words == ["The", "squirrel", "wrote", "a", "unit", "test", | ||
"to", "make", "sure", "its", "nuts", "worked", "as", "designed", "."] | ||
num_words = len(words) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd not remove these blank lines. They separate coherent segments.
Hi Matt, I've fixed all comments, you can review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Adding method to support predicting on tokenized input in
coref.py
andsemantic_role_labeler.py
API methods.Added methods:
CorefPredictor.predict_from_list(self, tokenized_document: List[str]) -> JsonDict:
SemanticRoleLabelerPredictor.predict_from_list(self, tokenized_sentence: List[str]) -> JsonDict:
This is very useful when user data is already tokenized (for example annotated corpus for some NLP task) and user would like to predict on that tokenized data without losing the original tokens ID's..
Avoid a workflow of: Creating document/sentence text from tokenized data -> input created text to predict method -> predict will tokenize the input text -> align output tokens with original tokens.
Usage code example in:
for coref:
allennlp/tests/predictors/coref_test.py
, test:test_uses_named_inputs
for srl:
allennlp/tests/predictors/srl_test.py
, test:test_uses_named_inputs