What format to use for training data and NER-model #959

ramonrod · 2017-04-06T14:52:45Z

Hello,

I have been trying to train a model with the same method as #887 is using, just for a test case.
I have a question, what would be the best format for a training corpus to import in spacy. I have a text-file with a list of of entities that requires new entities for tagging.
Let me explain my case, I follow the update.training script like this:

 nlp = spacy.load('en_core_web_md', entity=False, parser=False)

ner= EntityRecognizer(nlp.vocab, entity_types=['FINANCE'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
 
        nlp.tagger(doc)
        ner.update(doc, gold)
ner.model.end_training()

I add my training data as entity_offsets:

train_data = [
    ('Monetary contracts are financial instruments between parties', [(23, 44, 'FINANCE')])
]

This is working fine for the one example and new entity tag. Obviously I want to be able to add more than one example. The Idea is to create a text file with tagged sentences, the question is what format does spacy needs for training data, should I keep with entity_offset from the examples (this will be a very tedious task for 1000's of sentences) or is there another method to prepare the file, like:

financial instruments   FINANCE
contracts   FINANCE
Product OBJ
of O
Microsoft ORG
   etc ...

And how can I pass the corpus in spcay using the mentioned method? Do I have to use the new created model or can I add the new entities to the old model, how can this be achieved?

Thanks

Your Environment

spaCy version: 1.7.3
Platform: Windows-7-6.1.7601-SP1
Python version: 3.6.0
Installed models: en, en_core_web_md

The text was updated successfully, but these errors were encountered:

ramonrod · 2017-04-11T07:53:43Z

Hi,
I managed to import a file with training data that would be recognized by the training method described above.
The list will look like this:

Financial instruments can be real or virtual documents, 0 21 FINANCE
The number of units of the financial instrument, 27 47 FINANCE
or the number of derivative contracts in the transaction, 17 37 BANKING
Date and time when the transaction was executed, 23 34 ORDER
...

Now my question, somehow the training is not performing well, I supposed this is due to the small training data. I get all entries in test corpus tagged as FINANCE or all tagged by BANKING. How big does my train data need to be to get a better performance?

I guess I will have to annotate a bigger corpus for may training data. Can this be done in a different way?

What algorithm is behind the spacy Named Entity Recognizer?

Thanks

ines · 2017-04-16T21:42:55Z

The new version 1.8.0 comes with bug fixes to the NER training procedure and a new save_to_directory() method. We've also updated the docs with more information on training and NER training in particular:

Workflow: Training the Named Entity Recognizer
Workflow: Saving and loading models
Example: Training an additional entity type
Command line interface for initialising, training and packaging models

I hope this helps!

ramonrod · 2017-04-18T06:53:18Z

Thanks Ines, yes this helps a lot. One last thing, I am also interest in finding relations between the entities. For example using regex to find specific words between two entities.

Is there an approachable way to this with spacy once a have a trained model with new entities?

DomHudson · 2017-04-23T08:31:10Z

Well the document class returned by nlp(text) is iterable so would simply looping through that collecting the tokens of interest work for you?

If you want to analyse more complex relationships you could walk up the syntactic tree to deduce the distance between the words.

SATHVIKRAJU · 2017-09-28T19:09:18Z

Hi everyone, I have similar questions as that of ramonrod as to how to generate the training data.I was successful in training the model for new entities on one type of entity where the training data was manually tagged as shown in the docs but however this would be tedious if I were to train it for 1000 or more examples. I wanted to know if there was a better way in doing them / other formats in which I can annotate them . I am new to NLP and Spacy and would appreciate the help.

phanibulusu · 2018-01-30T13:15:37Z

Hi @SATHVIKRAJU @ramonrod.

Have you figured out any way how to automate the annotation of training data.
I am stuck in the same issue, if you have any turnaround that would save me alot of time.

I would like to train my model to recognize the finance/technology related word. I have some job names like 'capex', 'DefCalc' etc. I want my model to recognize these entities.

Appreciate the help !

Thanks,
Phani

ramonrod · 2018-02-01T17:16:11Z

Hi all,
apparently there is no complete automated way how to this, at least not to my knowledge.
I would recommend you to take a look at following packages (python-compatible):

Snorkel (https://github.com/HazyResearch/snorkel#learning-how-to-use-snorkel)
Prodigy (https://prodi.gy/)

Hope this can help you.

amukka · 2018-04-03T01:20:09Z

Hi Could you please help in making the training data. It was commented that he has managed to import the data using text file and training on top of it. Will this works?

lock · 2018-05-07T17:53:03Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the usage General spaCy usage label Apr 13, 2017

ines closed this as completed Apr 16, 2017

lock bot locked as resolved and limited conversation to collaborators May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What format to use for training data and NER-model #959

What format to use for training data and NER-model #959

ramonrod commented Apr 6, 2017 •

edited

Loading

ramonrod commented Apr 11, 2017

ines commented Apr 16, 2017

ramonrod commented Apr 18, 2017

DomHudson commented Apr 23, 2017 •

edited

Loading

SATHVIKRAJU commented Sep 28, 2017

phanibulusu commented Jan 30, 2018

ramonrod commented Feb 1, 2018

amukka commented Apr 3, 2018

lock bot commented May 7, 2018

What format to use for training data and NER-model #959

What format to use for training data and NER-model #959

Comments

ramonrod commented Apr 6, 2017 • edited Loading

Your Environment

ramonrod commented Apr 11, 2017

ines commented Apr 16, 2017

ramonrod commented Apr 18, 2017

DomHudson commented Apr 23, 2017 • edited Loading

SATHVIKRAJU commented Sep 28, 2017

phanibulusu commented Jan 30, 2018

ramonrod commented Feb 1, 2018

amukka commented Apr 3, 2018

lock bot commented May 7, 2018

ramonrod commented Apr 6, 2017 •

edited

Loading

DomHudson commented Apr 23, 2017 •

edited

Loading