How to ADD extra named entities #187

jaksmid · 2015-11-23T16:19:15Z

Hi,

First, I would like to thank you for your great work.

I was wondering whether there is any way how to add extra named entities like 'animal' to the model.
I was looking into the documentation without any success. All I could currently find in the documentation is the mention that you could add your own entity recogniser but only that it should accept doc and label entities. I have also seen this #144 but it does not provide any example how to retrain the model or how to add your own model. I think it would be to much benefit if examples how to train your model and/or how to specify your own NER entities (and positive or negative examples) would be added to the documentation.

Many thanks,
Jakub

honnibal · 2015-11-27T09:48:19Z

Hey,

All the code for training is there, but the documentation is lacking, and you'll need a substantial amount of training data.

This is the training script, that trains the tagger, parser and NER:

https://github.com/honnibal/spaCy/blob/master/bin/parser/train.py#L82

I agree that there needs to be documentation for this. Sorry for the delay on getting that done.

jaksmid · 2015-11-30T09:46:41Z

Hi,

Many thanks for the reply. Will go through the script at the earliest opportunity.

Cheers,
Jakub

honnibal · 2016-01-19T23:51:17Z

As of v0.100, it should be possible to train new classes over the top of the old model. I don't know whether this will actually be nice for accuracy. The API for GoldParse isn't so nice, but for now this should work:

import plac

from spacy.en import English
from spacy.gold import GoldParse


def main(out_loc):
    nlp = English(parser=False) # Avoid loading the parser, for quick load times
    # Run the tokenizer and tagger (but not the entity recognizer)
    doc = nlp.tokenizer(u'Lions and tigers and grizzly bears!')
    nlp.tagger(doc) 

    nlp.entity.add_label('ANIMAL') # <-- New in v0.100

    # Create a GoldParse object. This should have a better API...
    indices = tuple(range(len(doc)))
    words = [w.text for w in doc]
    tags = [w.tag_ for w in doc]
    heads = [0 for _ in doc]
    deps = ['' for _ in doc]
    # This is the only part we care about. We want BILOU format
    ner = ['U-ANIMAL', 'O', 'U-ANIMAL', 'O', 'B-ANIMAL', 'L-ANIMAL', 'O']

    # Create the GoldParse
    annot = GoldParse(doc, (indices, words, tags, heads, deps, ner))

    # Update the weights with the example
    # Here we iterate until we get it entirely correct. In practice this is probably a bad idea.
    # Note that we've added a class to the existing model here! We "resume"
    # training the previous model. Whether this is good or not I can't say, you'll have to
    # experiment.
    loss = nlp.entity.train(doc, annot)
    i = 0
    while loss != 0 and i < 1000:
        loss = nlp.entity.train(doc, annot)
        i += 1
    print("Used %d iterations" % i)

    nlp.entity(doc)
    for ent in doc.ents:
        print(ent.text, ent.label_)
    nlp.entity.model.dump(out_loc)

if __name__ == '__main__':
    plac.call(main)

$ python examples/add_entity_type.py /tmp/animals.model
Used 2 iterations
(u'Lions', u'ANIMAL')
(u'tigers', u'ANIMAL')
(u'grizzly bears', u'ANIMAL')

jewellcj · 2016-11-16T21:56:29Z

Thanks for the discussion. I'm new both to Python and spaCy (and NLP in general), so apologies in advance if I've missed something obvious here, but I did notice that the example provided by @honnibal doesn't work with the latest version of spaCy running under Python 3.5.

1 The example has:

nlp.entity.train(doc annot)

but that that method is no longer available - the code should be

nlp.entity.update(doc, annot)

When I make that change, I get this error:

File "spacy/syntax/parser.pyx", line 247, in spacy.syntax.parser.Parser.update (spacy/syntax/parser.cpp:7788)
File "spacy/syntax/ner.pyx", line 93, in spacy.syntax.ner.BiluoPushDown.preprocess_gold (spacy/syntax/ner.cpp:4782)
File "spacy/syntax/ner.pyx", line 112, in spacy.syntax.ner.BiluoPushDown.lookup_transition (spacy/syntax/ner.cpp:5145)
TypeError: argument of type 'NoneType' is not iterable

which, from an examination of ner.pyx looks as if the exception is being thrown here:
for i in range(self.n_moves):

I tried passing both BILOU format and entity formats into the GoldParse constructor, with the exact same result.

2 There is also an example train_ner which offers an alternative example of training the Entity Recognizer. This worked for me except that, crucially, I was unable to modify to accept my own Entity Type. Here are my relevant modifications (I had to take the code around 'loss' from the other example to make it work , sort-of)

def train_ner(nlp, train_data, entity_types):
    ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
        loss = nlp.entity.update(doc, gold)
        i = 0
        while loss != 0 and i < 1000:
            loss = nlp.entity.update(doc, gold)


    ner.model.end_training()
    return ner

and

...
    nlp = English()
    sty='DiseaseOrSyndrome'
    nlp.entity.add_label(sty) 
    entity = 'Acute Peptic Ulcer'
    train_data = [
        (
            'Acute peptic ulcer NOS'
          ,[(0, 18, sty)
            ]
        )
...
  ]
    ner = train_ner(nlp, train_data, [sty])

...

and when feeding in a couple of sentences containing 'Acute Peptic Ulcer', the code:

       for ent in parsedDoc.ents:
            print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))

prints this to the console:

349 ORG acute peptic ulcer
349 ORG acute peptic ulcer

So why don't I see something like:

nnn DiseaseOrSyndrome acute peptic ulcer
nnn DiseaseOrSyndrome acute peptic ulcer

Again, I may be out of my league here, being new to both Python and spaCy, but any help would be much appreciated!

DomHudson · 2016-11-18T09:21:12Z

Why do you think you can no longer use:

nlp.entity.train(doc, annot)

Unless i've missed something this is still present in the most recent version of spaCy. I successfully added new entities and got my results back with a new instantiation of spaCy by using code very similar to honnibal's.

The simplified code to load the saved model looks like this

nlp = spacy.load('en')
# train spacy with custom data

# Add the tags and training data
nlp.entity.add_label(entlabel)
nlp.entity.model.load(trainingfile.model)

My point is really that you need to add the label again when you re-instantiate spaCy. Simply loading the training file is not enough.

jewellcj · 2016-11-22T02:07:29Z

Thanks for the feedback @DomHudson

As far as:
nlp.entity.train(doc, annot)

the exact exception stacjk trace I get with that code in both @honnibal 's above example (whether using BILOU or positional training data) is:

File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'
File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'

I'll have to admit that my experience of Pyhton is limited (I'm primarily a Java developer) and so I may be missing something obvious, but a quick search of the spaCy repository reveals that there is no instance of a train(...) function. I pulled the master branch on 11/16/2016.

Having said all that, my original issue is resolved (thanks in part to your suggestion that I add the label again and reload the training model to 'add' to the spaCy model); I now have a working example that does correctly train with the new entity type, at least intermittently so.

So, for example, after training, I run this test inline, using the simple input text file:

The patient has an acute peptic ulcer.
There is no sign of an acute peptic ulcer.

The test code is:

  `if test_doc is not None:
    nlp=English()
    nlp.entity.add_label(sty) 
    nlp.entity.model.load(str(model_dir / 'model'))
    
    with open(test_doc, 'r') as filein:
        test_doc_str=filein.read()

    parsedDoc=nlp(test_doc_str)
    for word in parsedDoc:
        print(word.text, word.tag_, word.ent_type_, word.ent_iob)
    print('\nResult of spaCy parse (named entities)')
    for ent in parsedDoc.ents:
        print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))
    print('\nResult of spaCy parse (noun chunks)')
    for np in parsedDoc.noun_chunks:
        print(np)`

and that yields:

The DT 2
patient NN 2
has VBZ 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2

SP 2
There EX 2
is VBZ 2
no DT 2
sign NN 2
of IN 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2

Result of spaCy parse (named entities)
1510242 DiseaseOrSyndrome acute peptic ulcer
1510242 DiseaseOrSyndrome acute peptic ulcer

Result of spaCy parse (noun chunks)
The patient
an acute peptic ulcer
no sign
an acute peptic ulcer

Unfortunately, this is not consistent, even with loading the prior persisted model. I was wondering whether somewhere along the way I'm not explicitly associating the previously generated config.json with the previously trained and persisted model??

For now I'll have to assume that this is due to a lack of an adequate volume of training data. I will post further in a separate thread as soon as I have this stabilized.

Thanks again for your help.

BrijeshKaria · 2017-01-15T14:28:27Z

@jewellcj - Looks like you were able to train a model to identify DiseaseorSyndrome.
I am still not able to make it work.
Will you be able to share training script? I tried various things suggested above with no luck.

Thanks.

jewellcj · 2017-01-17T20:39:27Z

@BrijeshKaria - yes we were able to train the model (in prototype/tryout only code) initially focusing on just one entity. I guess the key piece of code was this:

def train_ner(nlp, train_data):
    for itn in range(10):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp(raw_text)
            nlp.tagger(doc)
            gold = GoldParse(doc, entities=entity_offsets)
            i = 0
            loss = nlp.entity.update(doc, gold)
            while loss != 0 and i < 1000:
                loss = nlp.entity.update(doc, gold)
                i += 1
    nlp.entity(doc)
    nlp.entity.model.end_training()
    return nlp

where we invoke the above function as follows:

    ...
    sty='T047:DiseaseOrSyndrome'
    nlp.entity.add_label(sty) 
    train_data = [
        (
            'Acute peptic ulcer NOS'
          ,[(0, 18, sty)
            ]
        )
        ,
        (
        'Acute peptic ulcer of duodenum'
          ,[(0, 18, sty)
            ]
        ),
#... etc.
         
    ]
    nlp=train_ner(nlp, train_data)

We sort of abandoned this however as to effectively train a spaCy model you need to have a very large Gold Standard corpora.

Instead we have been focusing on plain-old entity matching, and we have successfully used the spaCy Matcher with a spaCy Gazetteer generated from our Taxonomy, allowing us to use spaCy to normalize terms, thus improving the accuracy of our internal taxonomy search function.

lock · 2018-05-09T04:38:39Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the docs Documentation and website label Jan 16, 2016

ryangrimm mentioned this issue Mar 29, 2016

List index out of range error from GoldParse #314

Closed

danielsgriffin mentioned this issue Apr 19, 2016

spaCy labeling is sometimes very poor TheGadflyProject/TheGadflyProject#44

Open

viksit mentioned this issue Jul 26, 2016

Annotating BILOU tags from another system #461

Closed

This was referenced Aug 12, 2016

Named Entity Recognition : how to get accepted words for Entity type? #475

Closed

How to add and train new Entity? Clarifications #479

Closed

leanderme mentioned this issue Aug 28, 2016

Training NER model #490

Closed

ines mentioned this issue Oct 22, 2016

💫 Document workflow: Training the tagger, parser and entity recogniser #553

Closed

ines closed this as completed Oct 22, 2016

KhrystynaKosenko mentioned this issue Nov 7, 2016

Named Entity Recognition: How to add 2 new classes to existing entities if I have a list of all possible words in txt file? #612

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to ADD extra named entities #187

How to ADD extra named entities #187

jaksmid commented Nov 23, 2015

honnibal commented Nov 27, 2015

jaksmid commented Nov 30, 2015

honnibal commented Jan 19, 2016

jewellcj commented Nov 16, 2016

DomHudson commented Nov 18, 2016 •

edited

Loading

jewellcj commented Nov 22, 2016

BrijeshKaria commented Jan 15, 2017

jewellcj commented Jan 17, 2017

lock bot commented May 9, 2018

How to ADD extra named entities #187

How to ADD extra named entities #187

Comments

jaksmid commented Nov 23, 2015

honnibal commented Nov 27, 2015

jaksmid commented Nov 30, 2015

honnibal commented Jan 19, 2016

jewellcj commented Nov 16, 2016

DomHudson commented Nov 18, 2016 • edited Loading

jewellcj commented Nov 22, 2016

BrijeshKaria commented Jan 15, 2017

jewellcj commented Jan 17, 2017

lock bot commented May 9, 2018

DomHudson commented Nov 18, 2016 •

edited

Loading