Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error from adding arbitrary fixup rules to pipeline #600

Closed
cchu613 opened this issue Nov 2, 2016 · 3 comments
Closed

Error from adding arbitrary fixup rules to pipeline #600

cchu613 opened this issue Nov 2, 2016 · 3 comments
Labels
bug Bugs and behaviour differing from documentation docs Documentation and website

Comments

@cchu613
Copy link

cchu613 commented Nov 2, 2016

Hello! I'm a newbie to natural language processing and am trying to use spaCy for an information extraction project. So far everything has been great, except that in sentences like "One killed in Bucks County shooting", shooting gets tagged as a verb instead of a noun.

Here is my code (only slightly modified from the tutorial titled Customizing the Pipeline):

def arbitrary_fixup_rules(doc):
    for token in doc:
        if token.lower == u'shooting'
            token.tag_ = u'NN'

def custom_pipeline(nlp):
    return (nlp.tagger, arbitrary_fixup_rules, nlp.parser, nlp.entity)

nlp = spacy.load('en', create_pipeline=custom_pipeline)

However, running

doc = nlp(u'One dead in Bucks County shooting.')

resulted in
AttributeError: attribute 'tag_' of 'spacy.tokens.token.Token' objects is not writable

python 2.7, spacy version 1.1.2

@honnibal honnibal added docs Documentation and website bug Bugs and behaviour differing from documentation labels Nov 2, 2016
@honnibal
Copy link
Member

honnibal commented Nov 2, 2016

Hm! There's a gap in the API there — a missing attribute setter. Thanks.

@honnibal
Copy link
Member

honnibal commented Nov 2, 2016

This should be fixed in master. We also noticed a page missing from the docs, which we've just put up.

The missing page describes the API for the tokenizer. It's relevant here because it's another way to do what you want here. The tokenizer.add_special_case() method lets you add a rule saying how to segment some string into component tokens. You can then add custom attributes to these tokens.

For instance, you can do something like this:

nlp.tokenizer.add_special_case('shooting', [{"F": "shooting", "pos": "NN"}])

The attribute keys are currently a bit idiosyncratic. It recognises:

  • F: The string of the subtoken.
  • pos: The part-of-speech to assign to the subtoken.
  • L: The lemma (base form) to assign to the the subtoken.

Soon this will be fixed, and it'll support the same token attributes as the rest of the library.

@honnibal honnibal closed this as completed Nov 2, 2016
honnibal added a commit that referenced this issue Nov 2, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation docs Documentation and website
Projects
None yet
Development

No branches or pull requests

2 participants