Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ADD extra named entities #187

Closed
jaksmid opened this issue Nov 23, 2015 · 9 comments
Closed

How to ADD extra named entities #187

jaksmid opened this issue Nov 23, 2015 · 9 comments
Labels
docs Documentation and website

Comments

@jaksmid
Copy link

jaksmid commented Nov 23, 2015

Hi,

First, I would like to thank you for your great work.

I was wondering whether there is any way how to add extra named entities like 'animal' to the model.
I was looking into the documentation without any success. All I could currently find in the documentation is the mention that you could add your own entity recogniser but only that it should accept doc and label entities. I have also seen this #144 but it does not provide any example how to retrain the model or how to add your own model. I think it would be to much benefit if examples how to train your model and/or how to specify your own NER entities (and positive or negative examples) would be added to the documentation.

Many thanks,
Jakub

@honnibal
Copy link
Member

Hey,

All the code for training is there, but the documentation is lacking, and you'll need a substantial amount of training data.

This is the training script, that trains the tagger, parser and NER:

https://github.com/honnibal/spaCy/blob/master/bin/parser/train.py#L82

I agree that there needs to be documentation for this. Sorry for the delay on getting that done.

@jaksmid
Copy link
Author

jaksmid commented Nov 30, 2015

Hi,

Many thanks for the reply. Will go through the script at the earliest opportunity.

Cheers,
Jakub

@honnibal honnibal added the docs Documentation and website label Jan 16, 2016
@honnibal
Copy link
Member

As of v0.100, it should be possible to train new classes over the top of the old model. I don't know whether this will actually be nice for accuracy. The API for GoldParse isn't so nice, but for now this should work:

import plac

from spacy.en import English
from spacy.gold import GoldParse


def main(out_loc):
    nlp = English(parser=False) # Avoid loading the parser, for quick load times
    # Run the tokenizer and tagger (but not the entity recognizer)
    doc = nlp.tokenizer(u'Lions and tigers and grizzly bears!')
    nlp.tagger(doc) 

    nlp.entity.add_label('ANIMAL') # <-- New in v0.100

    # Create a GoldParse object. This should have a better API...
    indices = tuple(range(len(doc)))
    words = [w.text for w in doc]
    tags = [w.tag_ for w in doc]
    heads = [0 for _ in doc]
    deps = ['' for _ in doc]
    # This is the only part we care about. We want BILOU format
    ner = ['U-ANIMAL', 'O', 'U-ANIMAL', 'O', 'B-ANIMAL', 'L-ANIMAL', 'O']

    # Create the GoldParse
    annot = GoldParse(doc, (indices, words, tags, heads, deps, ner))

    # Update the weights with the example
    # Here we iterate until we get it entirely correct. In practice this is probably a bad idea.
    # Note that we've added a class to the existing model here! We "resume"
    # training the previous model. Whether this is good or not I can't say, you'll have to
    # experiment.
    loss = nlp.entity.train(doc, annot)
    i = 0
    while loss != 0 and i < 1000:
        loss = nlp.entity.train(doc, annot)
        i += 1
    print("Used %d iterations" % i)

    nlp.entity(doc)
    for ent in doc.ents:
        print(ent.text, ent.label_)
    nlp.entity.model.dump(out_loc)

if __name__ == '__main__':
    plac.call(main)
$ python examples/add_entity_type.py /tmp/animals.model
Used 2 iterations
(u'Lions', u'ANIMAL')
(u'tigers', u'ANIMAL')
(u'grizzly bears', u'ANIMAL')

@jewellcj
Copy link

Thanks for the discussion. I'm new both to Python and spaCy (and NLP in general), so apologies in advance if I've missed something obvious here, but I did notice that the example provided by @honnibal doesn't work with the latest version of spaCy running under Python 3.5.

1 The example has:

nlp.entity.train(doc annot)

but that that method is no longer available - the code should be

nlp.entity.update(doc, annot)

When I make that change, I get this error:

File "spacy/syntax/parser.pyx", line 247, in spacy.syntax.parser.Parser.update (spacy/syntax/parser.cpp:7788)
File "spacy/syntax/ner.pyx", line 93, in spacy.syntax.ner.BiluoPushDown.preprocess_gold (spacy/syntax/ner.cpp:4782)
File "spacy/syntax/ner.pyx", line 112, in spacy.syntax.ner.BiluoPushDown.lookup_transition (spacy/syntax/ner.cpp:5145)
TypeError: argument of type 'NoneType' is not iterable

which, from an examination of ner.pyx looks as if the exception is being thrown here:
for i in range(self.n_moves):

I tried passing both BILOU format and entity formats into the GoldParse constructor, with the exact same result.

2 There is also an example train_ner which offers an alternative example of training the Entity Recognizer. This worked for me except that, crucially, I was unable to modify to accept my own Entity Type. Here are my relevant modifications (I had to take the code around 'loss' from the other example to make it work , sort-of)

def train_ner(nlp, train_data, entity_types):
    ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
        loss = nlp.entity.update(doc, gold)
        i = 0
        while loss != 0 and i < 1000:
            loss = nlp.entity.update(doc, gold)


    ner.model.end_training()
    return ner

and

...
    nlp = English()
    sty='DiseaseOrSyndrome'
    nlp.entity.add_label(sty) 
    entity = 'Acute Peptic Ulcer'
    train_data = [
        (
            'Acute peptic ulcer NOS'
          ,[(0, 18, sty)
            ]
        )
...
  ]
    ner = train_ner(nlp, train_data, [sty])

...

and when feeding in a couple of sentences containing 'Acute Peptic Ulcer', the code:

       for ent in parsedDoc.ents:
            print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))


prints this to the console:

349 ORG acute peptic ulcer
349 ORG acute peptic ulcer

So why don't I see something like:

nnn DiseaseOrSyndrome acute peptic ulcer
nnn DiseaseOrSyndrome acute peptic ulcer

Again, I may be out of my league here, being new to both Python and spaCy, but any help would be much appreciated!

@DomHudson
Copy link
Contributor

DomHudson commented Nov 18, 2016

Why do you think you can no longer use:

nlp.entity.train(doc, annot)

Unless i've missed something this is still present in the most recent version of spaCy. I successfully added new entities and got my results back with a new instantiation of spaCy by using code very similar to honnibal's.

The simplified code to load the saved model looks like this

nlp = spacy.load('en')
# train spacy with custom data

# Add the tags and training data
nlp.entity.add_label(entlabel)
nlp.entity.model.load(trainingfile.model)

My point is really that you need to add the label again when you re-instantiate spaCy. Simply loading the training file is not enough.

@jewellcj
Copy link

Thanks for the feedback @DomHudson

  1. As far as:
    nlp.entity.train(doc, annot)

the exact exception stacjk trace I get with that code in both @honnibal 's above example (whether using BILOU or positional training data) is:

File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'
File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'

I'll have to admit that my experience of Pyhton is limited (I'm primarily a Java developer) and so I may be missing something obvious, but a quick search of the spaCy repository reveals that there is no instance of a train(...) function. I pulled the master branch on 11/16/2016.

  1. Having said all that, my original issue is resolved (thanks in part to your suggestion that I add the label again and reload the training model to 'add' to the spaCy model); I now have a working example that does correctly train with the new entity type, at least intermittently so.

So, for example, after training, I run this test inline, using the simple input text file:

The patient has an acute peptic ulcer.
There is no sign of an acute peptic ulcer.

The test code is:

  `if test_doc is not None:
    nlp=English()
    nlp.entity.add_label(sty) 
    nlp.entity.model.load(str(model_dir / 'model'))
    
    with open(test_doc, 'r') as filein:
        test_doc_str=filein.read()

    parsedDoc=nlp(test_doc_str)
    for word in parsedDoc:
        print(word.text, word.tag_, word.ent_type_, word.ent_iob)
    print('\nResult of spaCy parse (named entities)')
    for ent in parsedDoc.ents:
        print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))
    print('\nResult of spaCy parse (noun chunks)')
    for np in parsedDoc.noun_chunks:
        print(np)`

and that yields:

The DT 2
patient NN 2
has VBZ 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2

SP 2
There EX 2
is VBZ 2
no DT 2
sign NN 2
of IN 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2

Result of spaCy parse (named entities)
1510242 DiseaseOrSyndrome acute peptic ulcer
1510242 DiseaseOrSyndrome acute peptic ulcer

Result of spaCy parse (noun chunks)
The patient
an acute peptic ulcer
no sign
an acute peptic ulcer

Unfortunately, this is not consistent, even with loading the prior persisted model. I was wondering whether somewhere along the way I'm not explicitly associating the previously generated config.json with the previously trained and persisted model??

For now I'll have to assume that this is due to a lack of an adequate volume of training data. I will post further in a separate thread as soon as I have this stabilized.

Thanks again for your help.

@BrijeshKaria
Copy link

@jewellcj - Looks like you were able to train a model to identify DiseaseorSyndrome.
I am still not able to make it work.
Will you be able to share training script? I tried various things suggested above with no luck.

Thanks.

@jewellcj
Copy link

@BrijeshKaria - yes we were able to train the model (in prototype/tryout only code) initially focusing on just one entity. I guess the key piece of code was this:

def train_ner(nlp, train_data):
    for itn in range(10):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp(raw_text)
            nlp.tagger(doc)
            gold = GoldParse(doc, entities=entity_offsets)
            i = 0
            loss = nlp.entity.update(doc, gold)
            while loss != 0 and i < 1000:
                loss = nlp.entity.update(doc, gold)
                i += 1
    nlp.entity(doc)
    nlp.entity.model.end_training()
    return nlp

where we invoke the above function as follows:

    ...
    sty='T047:DiseaseOrSyndrome'
    nlp.entity.add_label(sty) 
    train_data = [
        (
            'Acute peptic ulcer NOS'
          ,[(0, 18, sty)
            ]
        )
        ,
        (
        'Acute peptic ulcer of duodenum'
          ,[(0, 18, sty)
            ]
        ),
#... etc.
         
    ]
    nlp=train_ner(nlp, train_data)

We sort of abandoned this however as to effectively train a spaCy model you need to have a very large Gold Standard corpora.

Instead we have been focusing on plain-old entity matching, and we have successfully used the spaCy Matcher with a spaCy Gazetteer generated from our Taxonomy, allowing us to use spaCy to normalize terms, thus improving the accuracy of our internal taxonomy search function.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website
Projects
None yet
Development

No branches or pull requests

6 participants