-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to ADD extra named entities #187
Comments
Hey, All the code for training is there, but the documentation is lacking, and you'll need a substantial amount of training data. This is the training script, that trains the tagger, parser and NER: https://github.com/honnibal/spaCy/blob/master/bin/parser/train.py#L82 I agree that there needs to be documentation for this. Sorry for the delay on getting that done. |
Hi, Many thanks for the reply. Will go through the script at the earliest opportunity. Cheers, |
As of v0.100, it should be possible to train new classes over the top of the old model. I don't know whether this will actually be nice for accuracy. The API for GoldParse isn't so nice, but for now this should work: import plac
from spacy.en import English
from spacy.gold import GoldParse
def main(out_loc):
nlp = English(parser=False) # Avoid loading the parser, for quick load times
# Run the tokenizer and tagger (but not the entity recognizer)
doc = nlp.tokenizer(u'Lions and tigers and grizzly bears!')
nlp.tagger(doc)
nlp.entity.add_label('ANIMAL') # <-- New in v0.100
# Create a GoldParse object. This should have a better API...
indices = tuple(range(len(doc)))
words = [w.text for w in doc]
tags = [w.tag_ for w in doc]
heads = [0 for _ in doc]
deps = ['' for _ in doc]
# This is the only part we care about. We want BILOU format
ner = ['U-ANIMAL', 'O', 'U-ANIMAL', 'O', 'B-ANIMAL', 'L-ANIMAL', 'O']
# Create the GoldParse
annot = GoldParse(doc, (indices, words, tags, heads, deps, ner))
# Update the weights with the example
# Here we iterate until we get it entirely correct. In practice this is probably a bad idea.
# Note that we've added a class to the existing model here! We "resume"
# training the previous model. Whether this is good or not I can't say, you'll have to
# experiment.
loss = nlp.entity.train(doc, annot)
i = 0
while loss != 0 and i < 1000:
loss = nlp.entity.train(doc, annot)
i += 1
print("Used %d iterations" % i)
nlp.entity(doc)
for ent in doc.ents:
print(ent.text, ent.label_)
nlp.entity.model.dump(out_loc)
if __name__ == '__main__':
plac.call(main)
|
Thanks for the discussion. I'm new both to Python and spaCy (and NLP in general), so apologies in advance if I've missed something obvious here, but I did notice that the example provided by @honnibal doesn't work with the latest version of spaCy running under Python 3.5. 1 The example has:
but that that method is no longer available - the code should be
When I make that change, I get this error:
which, from an examination of ner.pyx looks as if the exception is being thrown here: I tried passing both BILOU format and entity formats into the GoldParse constructor, with the exact same result. 2 There is also an example train_ner which offers an alternative example of training the Entity Recognizer. This worked for me except that, crucially, I was unable to modify to accept my own Entity Type. Here are my relevant modifications (I had to take the code around 'loss' from the other example to make it work , sort-of)
and
... and when feeding in a couple of sentences containing 'Acute Peptic Ulcer', the code:
prints this to the console:
So why don't I see something like:
Again, I may be out of my league here, being new to both Python and spaCy, but any help would be much appreciated! |
Why do you think you can no longer use: nlp.entity.train(doc, annot) Unless i've missed something this is still present in the most recent version of spaCy. I successfully added new entities and got my results back with a new instantiation of spaCy by using code very similar to honnibal's. The simplified code to load the saved model looks like this nlp = spacy.load('en')
# train spacy with custom data
# Add the tags and training data
nlp.entity.add_label(entlabel)
nlp.entity.model.load(trainingfile.model) My point is really that you need to add the label again when you re-instantiate spaCy. Simply loading the training file is not enough. |
Thanks for the feedback @DomHudson
the exact exception stacjk trace I get with that code in both @honnibal 's above example (whether using BILOU or positional training data) is: File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main I'll have to admit that my experience of Pyhton is limited (I'm primarily a Java developer) and so I may be missing something obvious, but a quick search of the spaCy repository reveals that there is no instance of a train(...) function. I pulled the master branch on 11/16/2016.
So, for example, after training, I run this test inline, using the simple input text file:
The test code is:
and that yields:
Unfortunately, this is not consistent, even with loading the prior persisted model. I was wondering whether somewhere along the way I'm not explicitly associating the previously generated config.json with the previously trained and persisted model?? For now I'll have to assume that this is due to a lack of an adequate volume of training data. I will post further in a separate thread as soon as I have this stabilized. Thanks again for your help. |
@jewellcj - Looks like you were able to train a model to identify DiseaseorSyndrome. Thanks. |
@BrijeshKaria - yes we were able to train the model (in prototype/tryout only code) initially focusing on just one entity. I guess the key piece of code was this:
where we invoke the above function as follows:
We sort of abandoned this however as to effectively train a spaCy model you need to have a very large Gold Standard corpora. Instead we have been focusing on plain-old entity matching, and we have successfully used the spaCy Matcher with a spaCy Gazetteer generated from our Taxonomy, allowing us to use spaCy to normalize terms, thus improving the accuracy of our internal taxonomy search function. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hi,
First, I would like to thank you for your great work.
I was wondering whether there is any way how to add extra named entities like 'animal' to the model.
I was looking into the documentation without any success. All I could currently find in the documentation is the mention that you could add your own entity recogniser but only that it should accept doc and label entities. I have also seen this #144 but it does not provide any example how to retrain the model or how to add your own model. I think it would be to much benefit if examples how to train your model and/or how to specify your own NER entities (and positive or negative examples) would be added to the documentation.
Many thanks,
Jakub
The text was updated successfully, but these errors were encountered: