Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the NER *actually* retrainable? #887

Closed
jcbgamboa opened this issue Mar 13, 2017 · 2 comments
Closed

Is the NER *actually* retrainable? #887

jcbgamboa opened this issue Mar 13, 2017 · 2 comments
Labels
usage General spaCy usage

Comments

@jcbgamboa
Copy link
Contributor

I have seen issues #773 and #881 . I had been trying to use spaCy for training an entity recognizer like those of these bot APIs (e.g., Google API.ai, WIT.ai, ...). I have a function like this: (the main loop is basically copied from the tutorial)

  1 def train_entities(nlp, path, n_iterations = 5):
  2     nlp.entity = EntityRecognizer(nlp.vocab, entity_types=['GPE'])
  3
  4     train_data = load_train_data(path)
  5
  6     if train_data is None:
  7         return nlp.entity
  8
  9     # Very much based on
 10     # https://spacy.io/docs/usage/entity-recognition#updating
 11     for itn in range(n_iterations):
 12         random.shuffle(train_data)
 13         for raw_text, entity_offsets in train_data:
 14             doc = nlp.make_doc(raw_text)
 15             gold = GoldParse(doc, entities = entity_offsets)
 16
 17             nlp.tagger(doc)
 18             nlp.entity.update(doc, gold)
 19
 20     nlp.entity.model.end_training()
 21     return nlp

where load_train_data() gives me vectors in the format of the tutorial. I get the entities by running processed_sentence = nlp(sentence) (where sentence is a unicode string), and then accessing processed_sentence.ents. I had initially tried with just a few examples, and it didn't work. Then I read this (in #773 ):

I get that people want to train on a few dozen sentences. I think people shouldn't want that.

So I thought I would try to overfit the model by training it in the same sentences for a big number of epochs and see what happens. I found lots of addresses in http://results.openaddresses.io/. I chose the addresses in Thüringen (Germany) and randomly picked 25000 addresses there. I want to cause the entity recognizer to take these as "GPE". Using 12 small sentence templates, I generated 23105 sentences using these addresses, with taggings saying in which character a GPE started and in which character it ended (just like in the tutorial). There are less sentences because some sentences required two addresses (they are like "I moved from {} to {}").

Finally, I trained the entity recognizer (using the function above and this dataset) for 5, 20, 50 and 100 epochs. Still, it seems all this training didn't make any difference. I.e., when I try any of the sentences in my training set (the sentences I trained it on), it still gives me the same results it used to give without any training. E.g., the sentence

Waldstraße, 10, Dermbach is where I live
(which is one of the sentences in the training set)

still gives me these three entities (the following printing formatting is for my convenience)

Entity: Waldstraße, label: ORG, start: 0, end: 1
Entity: 10, label: DATE, start: 2, end: 3
Entity: Dermbach, label: GPE, start: 4, end: 5

(see issue #858 for the Unicode strangeness -- shouldn't be a problem here)

which are the same as it originally would give me without any training.

Am I doing anything wrong? Is this training procedure somehow wrong? Any ideas? Should I try more epochs?

[Now I'll probably take a look at how RASA NLU does it... because if they manage to make it work, then I am probably doing some silly mitake]

Your Environment

  • Operating System: Windows
  • Python Version Used: 2.7.13
  • spaCy Version Used: 1.6.0
  • Environment Information: Windows 7 Ultimate SP1, 64 bit
@honnibal
Copy link
Member

Sorry that this has been a bit unstable. The code in Thinc 6.5.0 is training properly for me -- I'm currently working on getting new models up for the next release of spaCy.

So, try either updating to the latest thinc, or at least remove the call to .end_training().

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants