Is the NER actually retrainable? #887

jcbgamboa · 2017-03-13T14:19:50Z

I have seen issues #773 and #881 . I had been trying to use spaCy for training an entity recognizer like those of these bot APIs (e.g., Google API.ai, WIT.ai, ...). I have a function like this: (the main loop is basically copied from the tutorial)

  1 def train_entities(nlp, path, n_iterations = 5):
  2     nlp.entity = EntityRecognizer(nlp.vocab, entity_types=['GPE'])
  3
  4     train_data = load_train_data(path)
  5
  6     if train_data is None:
  7         return nlp.entity
  8
  9     # Very much based on
 10     # https://spacy.io/docs/usage/entity-recognition#updating
 11     for itn in range(n_iterations):
 12         random.shuffle(train_data)
 13         for raw_text, entity_offsets in train_data:
 14             doc = nlp.make_doc(raw_text)
 15             gold = GoldParse(doc, entities = entity_offsets)
 16
 17             nlp.tagger(doc)
 18             nlp.entity.update(doc, gold)
 19
 20     nlp.entity.model.end_training()
 21     return nlp

where load_train_data() gives me vectors in the format of the tutorial. I get the entities by running processed_sentence = nlp(sentence) (where sentence is a unicode string), and then accessing processed_sentence.ents. I had initially tried with just a few examples, and it didn't work. Then I read this (in #773 ):

I get that people want to train on a few dozen sentences. I think people shouldn't want that.

So I thought I would try to overfit the model by training it in the same sentences for a big number of epochs and see what happens. I found lots of addresses in http://results.openaddresses.io/. I chose the addresses in Thüringen (Germany) and randomly picked 25000 addresses there. I want to cause the entity recognizer to take these as "GPE". Using 12 small sentence templates, I generated 23105 sentences using these addresses, with taggings saying in which character a GPE started and in which character it ended (just like in the tutorial). There are less sentences because some sentences required two addresses (they are like "I moved from {} to {}").

Finally, I trained the entity recognizer (using the function above and this dataset) for 5, 20, 50 and 100 epochs. Still, it seems all this training didn't make any difference. I.e., when I try any of the sentences in my training set (the sentences I trained it on), it still gives me the same results it used to give without any training. E.g., the sentence

Waldstraße, 10, Dermbach is where I live
(which is one of the sentences in the training set)

still gives me these three entities (the following printing formatting is for my convenience)

Entity: Waldstra├ƒe, label: ORG, start: 0, end: 1
Entity: 10, label: DATE, start: 2, end: 3
Entity: Dermbach, label: GPE, start: 4, end: 5

(see issue #858 for the Unicode strangeness -- shouldn't be a problem here)

which are the same as it originally would give me without any training.

Am I doing anything wrong? Is this training procedure somehow wrong? Any ideas? Should I try more epochs?

[Now I'll probably take a look at how RASA NLU does it... because if they manage to make it work, then I am probably doing some silly mitake]

Your Environment

Operating System: Windows
Python Version Used: 2.7.13
spaCy Version Used: 1.6.0
Environment Information: Windows 7 Ultimate SP1, 64 bit

The text was updated successfully, but these errors were encountered:

honnibal · 2017-03-13T14:26:21Z

Sorry that this has been a bit unstable. The code in Thinc 6.5.0 is training properly for me -- I'm currently working on getting new models up for the next release of spaCy.

So, try either updating to the latest thinc, or at least remove the call to .end_training().

lock · 2018-05-09T01:38:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the usage General spaCy usage label Mar 22, 2017

honnibal closed this as completed Mar 22, 2017

ramonrod mentioned this issue Apr 6, 2017

What format to use for training data and NER-model #959

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the NER actually retrainable? #887

Is the NER actually retrainable? #887

jcbgamboa commented Mar 13, 2017

honnibal commented Mar 13, 2017

lock bot commented May 9, 2018

Is the NER *actually* retrainable? #887

Is the NER *actually* retrainable? #887

Comments

jcbgamboa commented Mar 13, 2017

Your Environment

honnibal commented Mar 13, 2017

lock bot commented May 9, 2018

Is the NER actually retrainable? #887

Is the NER actually retrainable? #887