Check textcat values for validity #11763

polm · 2022-11-07T10:08:27Z

Description

The textcat components expect all category values to be either 1 or 0 if present. When this expectation is not met the errors generated are surprising and don't make the issue clear, like in #11757.

This change checks sample data during initialization and throws an error if invalid values are found. That's not perfect, since it might not catch things if only a few examples have bad values, but in the typical case it should prevent confusion by quickly identifying the issue.

Types of change

data validation

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

Merge needed because errors added at the same time.

spacy/pipeline/textcat.py

adrianeboyd · 2022-11-07T10:36:39Z

This seems like a check that belongs in debug data rather than initialize. Is there another method that would handle this more cleanly during training, without being too slow? In get_loss maybe?

polm · 2022-11-07T11:17:35Z

I wouldn't want to put this just in debug data, as then we'd still be stuck with unclear error messages at training time. It would make sense to also add it there, though. The idea is to check all data, right?

Looking at get_loss, I overlooked this, but the current code in textcat is in _validate_categories, which is called in both initialization and get_loss. I haven't timed it but since we're already checking all the values I don't think it's that slow?

In textcat_multilabel I put the code directly in the initialize function, since _validate_categories isn't called there and this isn't category validation, but I could put it there so it works the same way. In this case the values are not all checked so it might make a difference, but if we want to check every item I'm not sure there's a better way. I can check with one of the sample projects.

adrianeboyd · 2022-11-07T13:26:17Z

I guess if it's not in debug data it would make sense to add it. I think it makes sense to add this check in place that is relevant for all training data and not just in initialize. We've also had cases where people had values like 0.75 where the training and/or eval didn't work either.

polm · 2022-11-14T09:28:01Z

Turns out this is already in debug data:

spaCy/spacy/cli/debug_data.py

Lines 449 to 456 in 3478ff1

 if ( 

 gold_train_data["n_cats_bad_values"] > 0 

 or gold_dev_data["n_cats_bad_values"] > 0 

 ): 

 msg.fail( 

 "Unsupported values for cats: the supported values are " 

 "1.0/True and 0.0/False." 

 )

I'll go ahead and modify this to check all data while training, rather than just running in initialize, but then I guess it'll be ready?

The _validate_categories is called in update, which for multilabel is inherited from the single label component.

polm · 2022-11-14T10:32:54Z

Tested this with textcat_goemotions CNN config, which I would expect to be more affected by speed changes, but there was no discernable change in the speed of training.

Check textcat values for validity

0289676

polm added enhancement Feature requests and improvements feat / textcat Feature: Text Classifier labels Nov 7, 2022

Merge branch 'master' into feature/check-category-values

264ff21

Merge needed because errors added at the same time.

adrianeboyd reviewed Nov 7, 2022

View reviewed changes

spacy/pipeline/textcat.py Outdated Show resolved Hide resolved

Fix error numbers

5d12e98

polm mentioned this pull request Nov 7, 2022

OverflowError: Invalid Nan value when encoding double #11757

Closed

Clean up vals reference

49baa1b

polm added 2 commits November 14, 2022 19:31

Check category value validity through training

f51a638

The _validate_categories is called in update, which for multilabel is inherited from the single label component.

Formatting

970ca8a

adrianeboyd merged commit 75bb7ad into explosion:master Nov 17, 2022

polm mentioned this pull request Jan 3, 2023

JSON dump error during meta.json dump in case of NaN values for losses #10217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check textcat values for validity #11763

Check textcat values for validity #11763

polm commented Nov 7, 2022

adrianeboyd commented Nov 7, 2022

polm commented Nov 7, 2022

adrianeboyd commented Nov 7, 2022 •

edited

Loading

polm commented Nov 14, 2022

polm commented Nov 14, 2022

Check textcat values for validity #11763

Check textcat values for validity #11763

Conversation

polm commented Nov 7, 2022

Description

Types of change

Checklist

adrianeboyd commented Nov 7, 2022

polm commented Nov 7, 2022

adrianeboyd commented Nov 7, 2022 • edited Loading

polm commented Nov 14, 2022

polm commented Nov 14, 2022

adrianeboyd commented Nov 7, 2022 •

edited

Loading