Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check textcat values for validity #11763

Merged
merged 6 commits into from
Nov 17, 2022

Conversation

polm
Copy link
Contributor

@polm polm commented Nov 7, 2022

Description

The textcat components expect all category values to be either 1 or 0 if present. When this expectation is not met the errors generated are surprising and don't make the issue clear, like in #11757.

This change checks sample data during initialization and throws an error if invalid values are found. That's not perfect, since it might not catch things if only a few examples have bad values, but in the typical case it should prevent confusion by quickly identifying the issue.

Types of change

data validation

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@polm polm added enhancement Feature requests and improvements feat / textcat Feature: Text Classifier labels Nov 7, 2022
Merge needed because errors added at the same time.
@adrianeboyd
Copy link
Contributor

This seems like a check that belongs in debug data rather than initialize. Is there another method that would handle this more cleanly during training, without being too slow? In get_loss maybe?

@polm
Copy link
Contributor Author

polm commented Nov 7, 2022

I wouldn't want to put this just in debug data, as then we'd still be stuck with unclear error messages at training time. It would make sense to also add it there, though. The idea is to check all data, right?

Looking at get_loss, I overlooked this, but the current code in textcat is in _validate_categories, which is called in both initialization and get_loss. I haven't timed it but since we're already checking all the values I don't think it's that slow?

In textcat_multilabel I put the code directly in the initialize function, since _validate_categories isn't called there and this isn't category validation, but I could put it there so it works the same way. In this case the values are not all checked so it might make a difference, but if we want to check every item I'm not sure there's a better way. I can check with one of the sample projects.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Nov 7, 2022

I guess if it's not in debug data it would make sense to add it. I think it makes sense to add this check in place that is relevant for all training data and not just in initialize. We've also had cases where people had values like 0.75 where the training and/or eval didn't work either.

@polm
Copy link
Contributor Author

polm commented Nov 14, 2022

Turns out this is already in debug data:

if (
gold_train_data["n_cats_bad_values"] > 0
or gold_dev_data["n_cats_bad_values"] > 0
):
msg.fail(
"Unsupported values for cats: the supported values are "
"1.0/True and 0.0/False."
)

I'll go ahead and modify this to check all data while training, rather than just running in initialize, but then I guess it'll be ready?

The _validate_categories is called in update, which for multilabel is
inherited from the single label component.
@polm
Copy link
Contributor Author

polm commented Nov 14, 2022

Tested this with textcat_goemotions CNN config, which I would expect to be more affected by speed changes, but there was no discernable change in the speed of training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / textcat Feature: Text Classifier
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants