Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What format to use for training data and NER-model #959

Closed
ramonrod opened this issue Apr 6, 2017 · 9 comments
Closed

What format to use for training data and NER-model #959

ramonrod opened this issue Apr 6, 2017 · 9 comments
Labels
usage General spaCy usage

Comments

@ramonrod
Copy link

ramonrod commented Apr 6, 2017

Hello,

I have been trying to train a model with the same method as #887 is using, just for a test case.
I have a question, what would be the best format for a training corpus to import in spacy. I have a text-file with a list of of entities that requires new entities for tagging.
Let me explain my case, I follow the update.training script like this:

 nlp = spacy.load('en_core_web_md', entity=False, parser=False)

ner= EntityRecognizer(nlp.vocab, entity_types=['FINANCE'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
 
        nlp.tagger(doc)
        ner.update(doc, gold)
ner.model.end_training()

I add my training data as entity_offsets:

train_data = [
    ('Monetary contracts are financial instruments between parties', [(23, 44, 'FINANCE')])
]

This is working fine for the one example and new entity tag. Obviously I want to be able to add more than one example. The Idea is to create a text file with tagged sentences, the question is what format does spacy needs for training data, should I keep with entity_offset from the examples (this will be a very tedious task for 1000's of sentences) or is there another method to prepare the file, like:

financial instruments   FINANCE
contracts   FINANCE
Product OBJ
of O
Microsoft ORG
   etc ...

And how can I pass the corpus in spcay using the mentioned method? Do I have to use the new created model or can I add the new entities to the old model, how can this be achieved?

Thanks

Your Environment

  • spaCy version: 1.7.3
  • Platform: Windows-7-6.1.7601-SP1
  • Python version: 3.6.0
  • Installed models: en, en_core_web_md
@ramonrod
Copy link
Author

Hi,
I managed to import a file with training data that would be recognized by the training method described above.
The list will look like this:

Financial instruments can be real or virtual documents, 0 21 FINANCE
The number of units of the financial instrument, 27 47 FINANCE
or the number of derivative contracts in the transaction, 17 37 BANKING
Date and time when the transaction was executed, 23 34 ORDER
...

Now my question, somehow the training is not performing well, I supposed this is due to the small training data. I get all entries in test corpus tagged as FINANCE or all tagged by BANKING. How big does my train data need to be to get a better performance?

I guess I will have to annotate a bigger corpus for may training data. Can this be done in a different way?

What algorithm is behind the spacy Named Entity Recognizer?

Thanks

@honnibal honnibal added the usage General spaCy usage label Apr 13, 2017
@ines
Copy link
Member

ines commented Apr 16, 2017

The new version 1.8.0 comes with bug fixes to the NER training procedure and a new save_to_directory() method. We've also updated the docs with more information on training and NER training in particular:

I hope this helps!

@ines ines closed this as completed Apr 16, 2017
@ramonrod
Copy link
Author

Thanks Ines, yes this helps a lot. One last thing, I am also interest in finding relations between the entities. For example using regex to find specific words between two entities.

Is there an approachable way to this with spacy once a have a trained model with new entities?

@DomHudson
Copy link
Contributor

DomHudson commented Apr 23, 2017

Well the document class returned by nlp(text) is iterable so would simply looping through that collecting the tokens of interest work for you?

If you want to analyse more complex relationships you could walk up the syntactic tree to deduce the distance between the words.

@SATHVIKRAJU
Copy link

Hi everyone, I have similar questions as that of ramonrod as to how to generate the training data.I was successful in training the model for new entities on one type of entity where the training data was manually tagged as shown in the docs but however this would be tedious if I were to train it for 1000 or more examples. I wanted to know if there was a better way in doing them / other formats in which I can annotate them . I am new to NLP and Spacy and would appreciate the help.

@phanibulusu
Copy link

Hi @SATHVIKRAJU @ramonrod.

Have you figured out any way how to automate the annotation of training data.
I am stuck in the same issue, if you have any turnaround that would save me alot of time.

I would like to train my model to recognize the finance/technology related word. I have some job names like 'capex', 'DefCalc' etc. I want my model to recognize these entities.

Appreciate the help !

Thanks,
Phani

@ramonrod
Copy link
Author

ramonrod commented Feb 1, 2018

Hi all,
apparently there is no complete automated way how to this, at least not to my knowledge.
I would recommend you to take a look at following packages (python-compatible):

Hope this can help you.

@amukka
Copy link

amukka commented Apr 3, 2018

Hi Could you please help in making the training data. It was commented that he has managed to import the data using text file and training on top of it. Will this works?

@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

7 participants