Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a binary classification filtering model for the top 25% of labels #69

Open
bruffridge opened this issue Jul 21, 2021 · 0 comments
Assignees

Comments

@bruffridge
Copy link
Member

bruffridge commented Jul 21, 2021

The long-term plan is to train a model that classifies scientific papers based on their biomimicry function from our taxonomy of 100 leaf categories. Let's start by training a model that can accurately label using the top 25% most used labels in our ground truth dataset, with plans to train more comprehensive models in the future. During this early stage, most documents sent for classification will be "out of domain" for the initial label set; that is, they are papers with labels outside of the initial top 25%. If you train a model with the initial top 25% labels and use it with all of your documents, the model will attempt to classify the "out of domain" documents using one of the existing labels, making it less accurate.

In scenarios when you expect your set of labels to expand over time, we recommend training two models using the initial smaller label set:

Classification model (Issue #70): A model that classifies documents into the current set of labels
Filtering model (this issue): A model that predicts whether a document fits within the current set of labels or is "out of domain"
Submit each document to the filtering model first, and only send documents to the classification model that are "in domain."

With the example described above, the classification model identifies the biomimicry function of a document and the filtering model makes a binary prediction about whether a document belongs to any of the functions for which the classification model has labels.

To train the filtering model, use the same set of documents you used for the classification model, except label each document as "in domain" instead of using a specific label from your set. Add an equivalent number of documents for which the current label set is not appropriate, and label them as "out of domain."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants