Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent sentence boundaries before and after serialization #322

Closed
thricedotted opened this issue Apr 6, 2016 · 7 comments
Closed
Labels
bug Bugs and behaviour differing from documentation

Comments

@thricedotted
Copy link

I've been running into a problem where a parse's sentence boundaries change after converting it to a bytestring:

> text = u"I bought a couch from IKEA. It wasn't very comfortable."

> parse = nlp(text)

> parse_from_bytes = Doc(nlp.vocab).from_bytes(parse.to_bytes())

> [s.text for s in parse.sents]
[u"I bought a couch from IKEA. It wasn't very comfortable."]

> [s.text for s in parse_from_bytes.sents]
[u'I bought a couch from IKEA.', u"It wasn't very comfortable."]

> parse.to_bytes() == parse_from_bytes.to_bytes()
True

This happened to be one where the sentence boundaries were more correct after the conversion, but I have other examples where it actually breaks the parse EDIT: the parse is already broken; in the original, two ROOTs appear in the same sentence, whereas in the from_bytes version, the ROOTs are forced to be in different sentences.

Not sure if this means there is a bug in the serialization or initial sentence boundary detection!

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Apr 7, 2016
@honnibal
Copy link
Member

honnibal commented Apr 7, 2016

Thanks, there's definitely something wrong here.

@honnibal
Copy link
Member

(atting @wbwseeker because we were talking about this bug on Slack)

I've just gone back over the code and realised that I'd forgotten how my transition system works, with respect to the Break transition. It's really not written down anywhere, and it's in fact rather different from the paper that the code cites as inspiration. So, I'll give some background here.

The intention is that all sentences are connected trees, so there's one word per sentence that is its own head, and that has the label ROOT. Mostly, sentence boundaries are inserted by the Break action. The Break action flags the first word of the buffer as the start of the next sentence. The parser then acts as though the buffer is exhausted until the stack has only one word. That is, it continues parsing using the "Unshift" action to connect the stack, until only one word is left. That word then becomes the root of the sentence, it's popped, and parsing continues.

There is however another way that we can get a sentence boundary. If the buffer is fully exhausted (i.e. we're really at the end of the sentence), the parser might end up with two root words on the stack. It's then allowed to join them with a left or right arc, using the label ROOT. This should be interpreted as saying "These are both root words, of different sentences. Insert a sentence boundary between them." In the code, there's a flag USE_ROOT_ARC_SEGMENT that toggles this behaviour. It was used as a baseline strategy when I was experimenting with the definition of the Break transition.

At some point, the code to actually insert the sentence boundaries during this USE_ROOT_ARC_SEGMENT strategy got dropped. I think the place to make the change should be here: https://github.com/spacy-io/spaCy/blob/master/spacy/syntax/arc_eager.pyx#L396 . I think all we'll need is something like st._sent[st._sent[i].left_edge].sent_start = True here.

Below you can find the transition sequence taken by the current model for the example sentence. You can see the final R-ROOT action, which connects the two root words. Note that the tokenization problem "IKEA." is the underlying cause for the model's initial mistake here, which is how it ends up trying to use this error-correction mechanism to arrive at the correct parse.

Another important part of the post-mortem here is that it's really noticeable that I've got a lot of fairly intricate logic in the transition system that has only been supported by informal experiments, and hasn't been written up anywhere. This isn't very satisfying. I really wanted to have a paper that explained the joint sentence boundary detection and parsing mechanism, and presented the whole-document evaluations. But I never got the CoreNLP comparison done, and the priority was always to keep developing. The decisions should at least be written up somewhere, with whatever results are available.

    >>> import spacy
    >>> nlp = spacy.load('en')
    >>> string = u"I bought a couch from IKEA. It wasn't very comfortable."
    >>> doc = nlp.tokenizer(string)
    >>> nlp.tagger(doc)
    >>> with nlp.parser.step_through(doc) as state:
    ...   while not state.is_final:
    ...     action = state.predict()
    ...     print(action)
    ...     state.transition(action)
    ... 
    L-nsubj
    S
    L-det
    R-dobj
    D
    R-prep
    R-pobj
    S
    L-nsubj
    D
    D
    S
    R-neg
    S
    L-advmod
    D
    R-acomp
    D
    R-punct
    R-ROOT

@robclewley
Copy link

This bug is hurting one of my projects too that relies on being able to serialize large docs to avoid re-parsing when they are utilized later. Is there any workaround on the outside or internal patch that you can think of?

@honnibal
Copy link
Member

Haven't tested this yet but you could replace the call to doc.sents with
this:

def iter_sents(doc):
for token in doc:
if token.dep_ == 'ROOT':
sent_start = token.left_edge
sent_end = token.right_edge + 1
yield doc[sent_start : sent_end]

This might not do what you want though --- this inserts extra sentence
boundaries. If you're wanting to keep the boundaries set before
serialisation, you'll have a tougher time. The bug is that the sent_start
flag isn't being set correctly during parsing in a minority of cases. So
the planned fix would be to have those sentence boundaries always being
inserted.

On Thursday, April 21, 2016, Robert Clewley notifications@github.com
wrote:

This bug is hurting one of my projects too that relies on being able to
serialize large docs to avoid re-parsing when they are utilized later. Is
there any workaround on the outside or internal patch that you can think of?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#322 (comment)

@robclewley
Copy link

Yes, I don't want the extra boundaries. Could I extract these edges before serializing and force new sentences with these edges after using __setstate__ and __getstate__?

@robclewley
Copy link

Thank you, this does seem to work for now. For future reference, you just need the .i on the two edge attributes to get the required integers.

honnibal pushed a commit that referenced this issue Apr 25, 2016
…responsibility for setting sentence boundaries. Re Issue #322
@honnibal honnibal closed this as completed May 4, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants