Multiclass Textcat - Minimum and optimal number of samples? #13750
Unanswered
qacollective
asked this question in
Help: Model Advice
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
There seems to be a lot of documentation 'out there' for Spacy and in general for data format conversion and the use of Spacy as a tool, but there seems to be very little guidance on the ideal content and quantity of training data.
So my main questions here are:
1. For Spacy, what is the minimum and optimal number of samples needed for accurate mutually exclusive multiclass text categorization?
2. What is the relationship between the number of samples and the number of classes?
I know this is a subject of study in the broader ML community, but I wanted to get guidance with Spacy's specific implementation in mind. I also know that there are a lot of variables that might influence an answer like the length of the input text in each sample and the uniqueness of the language used in each class. So here is some background:
Background
My unrefined training dataset from production is both imbalanced and classified by a manual hierarchical codification process. The classification is performed entirely off a single descriptive text field, which can vary from 10 to 100 words.
Due to the hierarchical nature of the classification, I have some level of flexibility in training multiple textcat models at different levels of the hierarchy to allow for simpler training and higher accuracy.
This is why I am keenly interested in understanding the best relationship between number of samples and number of classes in this instance, so I can carve up the problem and oversample (with synthetic data) & undersample to reach an optimal state for each model.
I would very much appreciate any links or other pointers to specific use of Spacy or any applicable more general papers.
Thanks!
QA
Beta Was this translation helpful? Give feedback.
All reactions