Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle support for matcher object #930

Closed
m93s opened this issue Mar 27, 2017 · 4 comments
Closed

Pickle support for matcher object #930

m93s opened this issue Mar 27, 2017 · 4 comments
Labels
usage General spaCy usage

Comments

@m93s
Copy link

m93s commented Mar 27, 2017

I'm adding a number of entities and patterns(about 2000) to the matcher and this number will keep increasing. I've tried a number of ways(Pickle, dill etc.) to pickle this trained matcher so I can avoid retraining every time, but nothing has worked out so far.

I know that support for pickle is lined up for version 2.0, but is there any workaround right now?

My Environment

  • Python version: 2.7.13
  • Platform: Darwin-16.0.0-x86_64-i386-64bit
  • spaCy version: 1.7.3
  • Installed models: cache, en, en-1.1.0, en_glove_cc_300_1m_vectors-1.0.
@honnibal
Copy link
Member

honnibal commented Mar 27, 2017

Why can't you just reuse the function that adds the entities on load? Like, presumably you have something like:

add_patterns(matcher, patterns_file)

Why is this worse than:

matcher = pickle.load(pickle_file)

There's no real "training" of the matcher. Once the pickle function is added, it'll just readd the patterns to the matcher, in the same way your add funtion will.

@m93s
Copy link
Author

m93s commented Mar 27, 2017

I was looking to ship it so it would start matching right out the box, but since there isn't any real training and hence no real time taken, I guess it's perfectly fine to do it like you mentioned.
Also, is there a default template for a json that I can directly give to the matcher, or should I create a json, write a parser, iterate over it and add entities like the below snippet mentioned in the docs:

matcher.add_entity(
    "GoogleNow", # Entity ID -- Helps you act on the match.
    {"ent_type": "PRODUCT", "wiki_en": "Google_Now"}, # Arbitrary attributes (optional)
)

matcher.add_pattern(
    "GoogleNow", # Entity ID -- Created if doesn't exist.
    [ # The pattern is a list of *Token Specifiers*.
        { # This Token Specifier matches tokens whose orth field is "Google"
          ORTH: "Google"
        },
        { # This Token Specifier matches tokens whose orth field is "Now"
          ORTH: "Now"
        }
    ],
    label=None # Can associate a label to the pattern-match, to handle it better.
)

Is there any documentation available for all the attributes available to the matcher?
I see there are a number of them available here , but have only been able to use a few of them.

On an unrelated topic, if I wanted to add a new entity through GoldParse, what is the minimum number of training examples you would recommend. I've seen around 5000 examples mentioned somewhere on the repo.

@honnibal honnibal added the usage General spaCy usage label Mar 31, 2017
@honnibal
Copy link
Member

honnibal commented Apr 7, 2017

Documentation for the attributes is still pending unfortunately.

Also, "on an unrelated topic" isn't great on the tracker :). Life is a little easier if we keep the threads well organised. I've written some things about this in other NER threads, or you can ask for tips on the gitter.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants