Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting New Entities In Model #13

Open
Evangel-coder opened this issue Oct 25, 2023 · 8 comments
Open

Detecting New Entities In Model #13

Evangel-coder opened this issue Oct 25, 2023 · 8 comments

Comments

@Evangel-coder
Copy link

Hi,

Would like to ask how do I train/ fine-tune the model to detect for new entities with the "ID" tag , e.g medicare number, phone number with +65. Real appreciate any insights on that!

@prajwal967
Copy link
Collaborator

Hi,

Sorry for the delayed response.
Do you have a dataset with these new entities that you want to train the model on?

If you do, then you would need to get the data in this form: notes.jsonl

Then you can follow the steps given in this notebook: Train.ipynb and replace the files accordingly.

Let us know if that doesn't work!

@Evangel-coder
Copy link
Author

Evangel-coder commented Nov 15, 2023

Hi,
That’s alright! Really appreciate the quick comment, still annotating the dataset manually
Have some qns to clarify,

  1. either we have to use prodigy or just annotate it manually for the dataset ?
  2. For me to train it to detect Medicare number, can I do this with just a dataset that only contains Medicare number. If so, what would be the optimal number of data points needed.
    Also, saw that there was this model available on hugging face:
  3. would like to ask if it was possible to have just use the autotokenisor & Auto-model library to load up and train the model as it skips the tedious process of getting the libraries in for the model to work.

Really appreciate the insights given!

@prajwal967
Copy link
Collaborator

Hi, sorry for the delay.

  1. We used prodigy for annotation - while you can do it manually, it it more efficient to do it using prodigy.
  2. Yes, if you want to train a model only to detect medicare number, you can train it against a dataset with only medicare number. However, this model won't be able to predict other attributes (e.g. name, date etc).
  3. Yes, if you are training a new model from scratch you can use the AutoClasses. If you have the dataset you can follow the steps given here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py
    • Our code mostly follows thier approach, but their code is more up to date and might be a better starting point if you're training something new.

Let us know if you have any other questions, thanks!

@Evangel-coder
Copy link
Author

Hi,

Appreciate the informative response and wanted to clarify so I can fine-tune the model to work with a higher capability of detecting Medicare numbers as well. Just that I would need the I2B2 data with the I2B2 data, including a variety of Medicare number in it. And that data has to be in the stated data format that was described in repo. Hope to hear from you soon!

@prajwal967
Copy link
Collaborator

Yes, that sounds about right! Let us know if there are any issues, thanks!

@Evangel-coder
Copy link
Author

Alright, thanks for that clarification. I was under the impression that if I were to just fine-tune the model with just US Medicare numbers, the model would just add on to its capability of not just detecting 'Medicare number' but also continuing to detect other attributes like 'name' and 'date' at the same accuracy with Medicare as well?

@prajwal967
Copy link
Collaborator

Yes, that could work, but there is a possibility that the model might forget what it has learnt previously (the accuracy of detecting other types of PHI might decrease)

Also, do you have a dataset with just medicare numbers? If so what does that dataset look like?

@Evangel-coder
Copy link
Author

Evangel-coder commented Jan 10, 2024

Hi,I've got a dataset; its just in the format of JSON, as described in the instructions provided. I'll try it out first, will let you on the results, is there any email contact I can use to contact you (if possible)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants