inconsistent sentence boundaries before and after serialization #322

thricedotted · 2016-04-06T18:25:37Z

I've been running into a problem where a parse's sentence boundaries change after converting it to a bytestring:

> text = u"I bought a couch from IKEA. It wasn't very comfortable."

> parse = nlp(text)

> parse_from_bytes = Doc(nlp.vocab).from_bytes(parse.to_bytes())

> [s.text for s in parse.sents]
[u"I bought a couch from IKEA. It wasn't very comfortable."]

> [s.text for s in parse_from_bytes.sents]
[u'I bought a couch from IKEA.', u"It wasn't very comfortable."]

> parse.to_bytes() == parse_from_bytes.to_bytes()
True

This happened to be one where the sentence boundaries were more correct after the conversion, but I have other examples where ~~it actually breaks the parse~~ EDIT: the parse is already broken; in the original, two ROOTs appear in the same sentence, whereas in the from_bytes version, the ROOTs are forced to be in different sentences.

Not sure if this means there is a bug in the serialization or initial sentence boundary detection!

honnibal · 2016-04-07T14:38:36Z

Thanks, there's definitely something wrong here.

honnibal · 2016-04-14T12:06:02Z

(atting @wbwseeker because we were talking about this bug on Slack)

I've just gone back over the code and realised that I'd forgotten how my transition system works, with respect to the Break transition. It's really not written down anywhere, and it's in fact rather different from the paper that the code cites as inspiration. So, I'll give some background here.

The intention is that all sentences are connected trees, so there's one word per sentence that is its own head, and that has the label ROOT. Mostly, sentence boundaries are inserted by the Break action. The Break action flags the first word of the buffer as the start of the next sentence. The parser then acts as though the buffer is exhausted until the stack has only one word. That is, it continues parsing using the "Unshift" action to connect the stack, until only one word is left. That word then becomes the root of the sentence, it's popped, and parsing continues.

There is however another way that we can get a sentence boundary. If the buffer is fully exhausted (i.e. we're really at the end of the sentence), the parser might end up with two root words on the stack. It's then allowed to join them with a left or right arc, using the label ROOT. This should be interpreted as saying "These are both root words, of different sentences. Insert a sentence boundary between them." In the code, there's a flag USE_ROOT_ARC_SEGMENT that toggles this behaviour. It was used as a baseline strategy when I was experimenting with the definition of the Break transition.

At some point, the code to actually insert the sentence boundaries during this USE_ROOT_ARC_SEGMENT strategy got dropped. I think the place to make the change should be here: https://github.com/spacy-io/spaCy/blob/master/spacy/syntax/arc_eager.pyx#L396 . I think all we'll need is something like st._sent[st._sent[i].left_edge].sent_start = True here.

Below you can find the transition sequence taken by the current model for the example sentence. You can see the final R-ROOT action, which connects the two root words. Note that the tokenization problem "IKEA." is the underlying cause for the model's initial mistake here, which is how it ends up trying to use this error-correction mechanism to arrive at the correct parse.

Another important part of the post-mortem here is that it's really noticeable that I've got a lot of fairly intricate logic in the transition system that has only been supported by informal experiments, and hasn't been written up anywhere. This isn't very satisfying. I really wanted to have a paper that explained the joint sentence boundary detection and parsing mechanism, and presented the whole-document evaluations. But I never got the CoreNLP comparison done, and the priority was always to keep developing. The decisions should at least be written up somewhere, with whatever results are available.

    >>> import spacy
    >>> nlp = spacy.load('en')
    >>> string = u"I bought a couch from IKEA. It wasn't very comfortable."
    >>> doc = nlp.tokenizer(string)
    >>> nlp.tagger(doc)
    >>> with nlp.parser.step_through(doc) as state:
    ...   while not state.is_final:
    ...     action = state.predict()
    ...     print(action)
    ...     state.transition(action)
    ... 
    L-nsubj
    S
    L-det
    R-dobj
    D
    R-prep
    R-pobj
    S
    L-nsubj
    D
    D
    S
    R-neg
    S
    L-advmod
    D
    R-acomp
    D
    R-punct
    R-ROOT

robclewley · 2016-04-21T03:24:51Z

This bug is hurting one of my projects too that relies on being able to serialize large docs to avoid re-parsing when they are utilized later. Is there any workaround on the outside or internal patch that you can think of?

honnibal · 2016-04-21T07:51:43Z

Haven't tested this yet but you could replace the call to doc.sents with
this:

def iter_sents(doc):
for token in doc:
if token.dep_ == 'ROOT':
sent_start = token.left_edge
sent_end = token.right_edge + 1
yield doc[sent_start : sent_end]

This might not do what you want though --- this inserts extra sentence
boundaries. If you're wanting to keep the boundaries set before
serialisation, you'll have a tougher time. The bug is that the sent_start
flag isn't being set correctly during parsing in a minority of cases. So
the planned fix would be to have those sentence boundaries always being
inserted.

On Thursday, April 21, 2016, Robert Clewley notifications@github.com
wrote:

This bug is hurting one of my projects too that relies on being able to
serialize large docs to avoid re-parsing when they are utilized later. Is
there any workaround on the outside or internal patch that you can think of?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#322 (comment)

robclewley · 2016-04-21T12:13:51Z

Yes, I don't want the extra boundaries. Could I extract these edges before serializing and force new sentences with these edges after using __setstate__ and __getstate__?

robclewley · 2016-04-24T01:57:00Z

Thank you, this does seem to work for now. For future reference, you just need the .i on the two edge attributes to get the required integers.

…responsibility for setting sentence boundaries. Re Issue #322

lock · 2018-05-09T13:12:05Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Apr 7, 2016

wbwseeker mentioned this issue Apr 21, 2016

introduce sentence boundaries for additional root tokens #346

Merged

honnibal pushed a commit that referenced this issue Apr 25, 2016

* Revise transition system so that the Break transition retains sole …

7c2d2de

…responsibility for setting sentence boundaries. Re Issue #322

honnibal closed this as completed May 4, 2016

honnibal mentioned this issue May 7, 2017

💫 Improve annotation serialisation #1045

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent sentence boundaries before and after serialization #322

inconsistent sentence boundaries before and after serialization #322

thricedotted commented Apr 6, 2016

honnibal commented Apr 7, 2016

honnibal commented Apr 14, 2016

robclewley commented Apr 21, 2016

honnibal commented Apr 21, 2016

robclewley commented Apr 21, 2016

robclewley commented Apr 24, 2016

lock bot commented May 9, 2018

inconsistent sentence boundaries before and after serialization #322

inconsistent sentence boundaries before and after serialization #322

Comments

thricedotted commented Apr 6, 2016

honnibal commented Apr 7, 2016

honnibal commented Apr 14, 2016

robclewley commented Apr 21, 2016

honnibal commented Apr 21, 2016

robclewley commented Apr 21, 2016

robclewley commented Apr 24, 2016

lock bot commented May 9, 2018