Multiclass Textcat - Minimum and optimal number of samples? #13750

qacollective · 2025-02-11T03:51:46Z

qacollective
Feb 11, 2025

There seems to be a lot of documentation 'out there' for Spacy and in general for data format conversion and the use of Spacy as a tool, but there seems to be very little guidance on the ideal content and quantity of training data.

So my main questions here are:
1. For Spacy, what is the minimum and optimal number of samples needed for accurate mutually exclusive multiclass text categorization?
2. What is the relationship between the number of samples and the number of classes?

I know this is a subject of study in the broader ML community, but I wanted to get guidance with Spacy's specific implementation in mind. I also know that there are a lot of variables that might influence an answer like the length of the input text in each sample and the uniqueness of the language used in each class. So here is some background:

Background

My unrefined training dataset from production is both imbalanced and classified by a manual hierarchical codification process. The classification is performed entirely off a single descriptive text field, which can vary from 10 to 100 words.

Due to the hierarchical nature of the classification, I have some level of flexibility in training multiple textcat models at different levels of the hierarchy to allow for simpler training and higher accuracy.

This is why I am keenly interested in understanding the best relationship between number of samples and number of classes in this instance, so I can carve up the problem and oversample (with synthetic data) & undersample to reach an optimal state for each model.

I would very much appreciate any links or other pointers to specific use of Spacy or any applicable more general papers.

Thanks!
QA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiclass Textcat - Minimum and optimal number of samples? #13750

{{title}}

Replies: 0 comments

Select a reply

Multiclass Textcat - Minimum and optimal number of samples? #13750

qacollective Feb 11, 2025

Background

Replies: 0 comments

qacollective
Feb 11, 2025