Add support of tokenized input for coref and srl predictors #2076

AlonEirew · 2018-11-18T15:12:58Z

Adding method to support predicting on tokenized input in coref.py and semantic_role_labeler.py API methods.

Added methods:
CorefPredictor.predict_from_list(self, tokenized_document: List[str]) -> JsonDict:
SemanticRoleLabelerPredictor.predict_from_list(self, tokenized_sentence: List[str]) -> JsonDict:

This is very useful when user data is already tokenized (for example annotated corpus for some NLP task) and user would like to predict on that tokenized data without losing the original tokens ID's..

Avoid a workflow of: Creating document/sentence text from tokenized data -> input created text to predict method -> predict will tokenize the input text -> align output tokens with original tokens.

Usage code example in:
for coref: allennlp/tests/predictors/coref_test.py, test: test_uses_named_inputs
for srl: allennlp/tests/predictors/srl_test.py, test: test_uses_named_inputs

matt-gardner

I think this is a great thing to add, thanks for the PR. A few things need fixing in how you did it, though.

matt-gardner · 2018-11-18T15:55:53Z

allennlp/data/tokenizers/word_splitter.py

@@ -154,6 +155,14 @@ def split_words(self, sentence: str) -> List[Token]:
        # This works because our Token class matches spacy's.
        return _remove_spaces(self.spacy(sentence))

+    def tokens_from_list(self, words: List[str]) -> List[Token]:


We can't just add methods like this to a subclass without adding them to the base class. This breaks the API. You call self._tokenizer.tokens_from_list() in the Predictor below, but that will crash with any tokenizer except the spacy tokenizer.

matt-gardner · 2018-11-18T15:57:01Z

allennlp/predictors/coref.py

@@ -53,6 +55,38 @@ def predict(self, document: str) -> JsonDict:
        """
        return self.predict_json({"document" : document})

+    def predict_from_list(self, tokenized_document: List[str]) -> JsonDict:


I'd prefer this to be called predict_tokenized - it's a lot more obvious what the "list" is supposed to be.

matt-gardner · 2018-11-18T15:58:37Z

allennlp/predictors/coref.py

+        """
+        return self.predict_words_list(tokenized_document)
+
+    def predict_words_list(self, words_list: List[str]) -> JsonDict:


I don't know why you have this extra method - it looks like it's doing exactly the same thing as predict_from_list, except it's not documented. Just remove this entirely.

matt-gardner · 2018-11-18T16:02:40Z

allennlp/predictors/coref.py

+        Create an instance from words list represent an already tokenized document,
+        for skipping tokenization when that information already exist for the user
+        """
+        spacy_document = self._spacy.tokenizer.tokens_from_list(document_list)


It looks like you're calling the same spacy pipeline on the document twice here; once inside tokenizer.tokens_from_list, and then once again here. I think all you need to do is call spacy_document = self._spacy.tokens_from_list(words) here (without adding the method onto the SpacyWordSplitter), then continue with this logic, and it should just work. We don't need the extra method on SpacyWordSplitter at all.

Here im using the actual Spacy tokenizer and not the WordSplitter one (which I use in srl), it will return a spacy.Doc object, I then run the pipeline on the doc object only once.

Oh, I see - so yeah, you really don't need that extra method at all.

matt-gardner · 2018-11-18T16:02:55Z

allennlp/predictors/coref.py

+        instance = self._words_list_to_instance(words_list)
+        return self.predict_instance(instance)
+
+    def _words_list_to_instance(self, document_list: List[str]) -> Instance:


words is a better name here than document_list.

matt-gardner · 2018-11-18T16:04:14Z

allennlp/predictors/semantic_role_labeler.py


+    def predict_from_list(self, tokenized_sentence: List[str]) -> JsonDict:


predict_tokenized here too.

matt-gardner · 2018-11-18T16:08:52Z

allennlp/predictors/semantic_role_labeler.py

-                    "description": description,
-                    "tags": tags,
-            })
+    def predict_words_list(self, words_list: List[str]) -> JsonDict:


Again the logic in this method should just be moved into predict_tokenized - this is a duplicate method.

matt-gardner · 2018-11-18T16:10:20Z

allennlp/predictors/semantic_role_labeler.py

+        Create an instance list of works document, for skipping tokenization when that
+        information already exist for the user
+        """
+        tokens = self._tokenizer.tokens_from_list(words_list)


You can't rely on this method actually existing, because you didn't create the method on the base class (I'm a little surprised that mypy didn't catch this; maybe because we're playing a little loose with the tokenizers inside of a predictor already...). But it's a lot easier than this: just do tokens = [Token(word) for word in words].

Did you mean here I should just add tokens = [Token(word) for word in words] at WordSplitter base class?

Instead of the line that you have, you should have tokens = [Token(word) for word in words]. There's no need for any extra methods on WordSplitter for any of what you're implementing.

Spacy POS tagging is required, I've tried to follow the class logic were all spacy tokenization and pipeline is under the hood via class SpacyWordSplitter

That's especially non-obvious from the diff I was looking at. And now that I look at the things that were hidden, I understand why mypy didn't catch this - I didn't realize that we specifically instantiated a SpacyWordSplitter here. I'd just do what you did in the coref predictor - call self._tokenizer.tokenizer.tokens_from_list(), or whatever the method is.

matt-gardner · 2018-11-18T16:13:31Z

allennlp/tests/predictors/srl_test.py

        words = result.get("words")
        assert words == ["The", "squirrel", "wrote", "a", "unit", "test",
                         "to", "make", "sure", "its", "nuts", "worked", "as", "designed", "."]
        num_words = len(words)
-


I'd not remove these blank lines. They separate coherent segments.

AlonEirew · 2018-11-18T22:39:59Z

Hi Matt, I've fixed all comments, you can review.

matt-gardner

Thanks!

Alon Eirew added 2 commits November 18, 2018 13:57

add support of tokenized input for coref and srl predictors

6ed5f4b

change method signature

d2cbb5a

matt-gardner reviewed Nov 18, 2018

View reviewed changes

fix PR review comments

3abcc5d

matt-gardner approved these changes Nov 18, 2018

View reviewed changes

matt-gardner merged commit af902a3 into allenai:master Nov 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support of tokenized input for coref and srl predictors #2076

Add support of tokenized input for coref and srl predictors #2076

AlonEirew commented Nov 18, 2018

matt-gardner left a comment

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

AlonEirew Nov 18, 2018

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

AlonEirew Nov 18, 2018

matt-gardner Nov 18, 2018

AlonEirew Nov 18, 2018 •

edited

Loading

matt-gardner Nov 18, 2018

matt-gardner Nov 18, 2018

AlonEirew commented Nov 18, 2018

matt-gardner left a comment


		def predict_from_list(self, tokenized_sentence: List[str]) -> JsonDict:

Add support of tokenized input for coref and srl predictors #2076

Add support of tokenized input for coref and srl predictors #2076

Conversation

AlonEirew commented Nov 18, 2018

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlonEirew Nov 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlonEirew commented Nov 18, 2018

matt-gardner left a comment

Choose a reason for hiding this comment

AlonEirew Nov 18, 2018 •

edited

Loading