Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find-threshold: CLI command for multi-label classifier threshold tuning #11280

Merged

Conversation

rmitsch
Copy link
Contributor

@rmitsch rmitsch commented Aug 8, 2022

Goal

Add a find-threshold CLI command investigating different threshold values for classification models and returning the ones maximizing the specified score.

Description

  • New CLI command find-threshold; API call is find_threshold().
  • New tests in spacy.tests.test_cli.
  • Docs will be added once the code has been reviewed.

Supported options are:

  • pipe_name: Which pipe to evaluate (with pipelines with multiple MultiLabel_TextCategorizer components the name has to be specified, otherwise it's optional).
  • average: Whether to use micro or macro to compute F-score over all labels.
  • n_trials: Number of sample points in threshold space between 0 and 1.
  • beta: Beta coefficient for F-score calculation.

Types of change

New feature.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@rmitsch rmitsch self-assigned this Aug 8, 2022
@rmitsch rmitsch added enhancement Feature requests and improvements feat / textcat Feature: Text Classifier feat / cli Feature: Command-line interface and removed feat / textcat Feature: Text Classifier labels Aug 8, 2022
@rmitsch rmitsch changed the title Feature/classifier threshold tuning find-threshold: CLI command for multi-label classifier threshold tuning Aug 8, 2022
@adrianeboyd
Copy link
Contributor

I realize this is a draft, but some general concerns:

  • there are multiple components with thresholds, can this be implemented more generally?
  • I'd rather see the scoring code extended (e.g., for beta in PRFScore and Scorer.score_cats) than for this to be completely reimplemented here
  • the args should look similar to spacy evaluate (note that it's a known problem in spacy evaluate that the DocBin is loaded generically rather than following [corpora.dev] from the config, which can cause issues sometimes)

The textcat component is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.

@rmitsch
Copy link
Contributor Author

rmitsch commented Aug 9, 2022

  • there are multiple components with thresholds, can this be implemented more generally?

Should be possible. Would it be acceptable if we ditch the automatic component recognition then and always require naming the component to be evaluated?

  • I'd rather see the scoring code extended (e.g., for beta in PRFScore and Scorer.score_cats) than for this to be completely reimplemented here

Wasn't aware of this. Scorer.score_cats should be a good fit.

  • the args should look similar to spacy evaluate (note that it's a known problem in spacy evaluate that the DocBin is loaded generically rather than following [corpora.dev] from the config, which can cause issues sometimes)

I'll look into harmonizing the arguments.

The textcat component is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.

Can you elaborate on how modifying thresholds would affect annotations for other components?

@adrianeboyd
Copy link
Contributor

In general the situation is that you have a component that has:

  • some config setting that's a threshold that's used in set_annotations or the scorer
  • some score that's returned by its scorer that you want to maximize

Examples:

  • textcat_multilabel: textcat_multilabel.threshold and the score cats_macro_auc
  • spancat: spancat.threshold and the score spans_sc_f
  • span_finder:span_finder.threshold and the score span_finder_sc_f

I would initially think that beta could be added as a scorer setting in updated versions of the default scorers so it can be customized in the configs. For the exploration here it might be useful if beta was easier to override than by modifying the configs, but I think I'd start from the point that there's an existing score in the scorer results that can be used directly and just look at the threshold here?

@rmitsch
Copy link
Contributor Author

rmitsch commented Aug 26, 2022

Are there smart(-ish) ways to...

  1. ...dynamically identify whether a component is suited for the find-threshold command?
  2. ...map the correct scorer method to the component? Like score_spans() for span_finder (?) or score_cats() for spancat and textcat_multilabel?

We can hardcode this ofc, but I was wondering whether there's a better way to do this. 1. could be done by checking for the existence of a threshold attribute, but this is not quite consistent - e.g. SpanFinder has one, but TextCategorizer stores its threshold in self.cfg (I could check both, but this makes me question whether there are other variations).

@adrianeboyd
Copy link
Contributor

No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.

In v4, I'm planning to move all these settings out of self.cfg so they're only stored in the config, but this kind of threshold could have an arbitrary name in the config.

@rmitsch
Copy link
Contributor Author

rmitsch commented Aug 31, 2022

No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.

It's not just the scores key, I think. The scoring method in Scorer also has be specified or chosen if we want to this to be generic. E.g. span_finder wouldn't work with Scorer.score_cats(). Or am I misunderstanding something here?

Clarification: I interpret "scores key" to be the attr attribute in the method to be called for scoring, e.g. Scorer.score_cats().

@adrianeboyd
Copy link
Contributor

The component already has a registered scorer, so what I mean by "scores key" is the entry in the output of Language.evaluate that you want to optimize, like cats_macro_f. You don't care how/where this was calculated by the scorer, just that it ends up in scores.

@rmitsch
Copy link
Contributor Author

rmitsch commented Sep 1, 2022

The latest commit should be closer to a generic solution. Two remarks:

  • beta hasn't been introduced yet. I'd add it as optional argument to PRFScore to pass it forward from Scorer, if it's available in the latter's config. Are there any potential pitfalls to consider when doing this?
  • The part of the test running a spancat component fails so far because nlp.evaluate() returns None for the relevant scores (upon which find_threshold() exits). It's probably related to how I set up the component, but I haven't spotted the issue yet - if there's anything obvious, I'd be thankful for a pointer.

spacy/tests/test_cli.py Outdated Show resolved Hide resolved
@rmitsch
Copy link
Contributor Author

rmitsch commented Sep 1, 2022

Added a draft for integrating beta. Feedback much appreciated.

@adrianeboyd
Copy link
Contributor

I don't think beta makes sense as a Scorer-level setting but rather as an individual component scorer setting that would be set in the config, e.g.:

@registry.scorers("spacy.textcat_scorer.v1")
def make_textcat_scorer(beta: float = 1.0):
    return partial(textcat_score, beta=beta)

The existing textcat scorer is kind of a bad example because threshold should also already be a scorer setting but isn't.

For example, you could have two spancat components with different beta settings.

@rmitsch
Copy link
Contributor Author

rmitsch commented Sep 2, 2022

Let me know if changes for textcat_multilabel/score_cats() and spancat/score_spans() match what you meant. If so, I'll update the other components and scoring functions.

@rmitsch rmitsch marked this pull request as draft October 28, 2022 11:14
@adrianeboyd
Copy link
Contributor

Can you have a look at the conflicts?

@rmitsch rmitsch closed this Nov 11, 2022
@rmitsch rmitsch force-pushed the feature/classifier-threshold-tuning branch from 34c6c3b to 188a7d0 Compare November 11, 2022 10:34
@rmitsch
Copy link
Contributor Author

rmitsch commented Nov 11, 2022

This is quite weird. Apparently my master had diverged from explosion:master (I guess due to the new approach to the release prep?). Synchronizing it eradicated all commits in this PR by an automatic force-pushed initiated by GitHub. I'm working on re-adding the changes to this branch.

@rmitsch rmitsch reopened this Nov 11, 2022
@rmitsch
Copy link
Contributor Author

rmitsch commented Nov 11, 2022

Should be fine now. Are we ok with this? Then I'd update the docs.

@rmitsch rmitsch marked this pull request as ready for review November 14, 2022 08:51
@rmitsch rmitsch marked this pull request as draft November 14, 2022 08:51
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a useful CL tool to have, nice work!

I mainly had some comments around UX and documentation. It would be a good idea to document some standard settings for this command
(like spacy find-threshold my_nlp data.spacy textcat_multilabel threshold cats_macro_f) that users can just copy-paste if they're working with standard pipelines/configs.

spacy/cli/find_threshold.py Outdated Show resolved Hide resolved
spacy/cli/find_threshold.py Outdated Show resolved Hide resolved
spacy/cli/find_threshold.py Outdated Show resolved Hide resolved
spacy/cli/find_threshold.py Outdated Show resolved Hide resolved
spacy/cli/find_threshold.py Outdated Show resolved Hide resolved
spacy/cli/find_threshold.py Outdated Show resolved Hide resolved
spacy/cli/find_threshold.py Outdated Show resolved Hide resolved
@rmitsch
Copy link
Contributor Author

rmitsch commented Nov 17, 2022

It would be a good idea to provide some standard settings for this command

I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@svlandeg
Copy link
Member

It would be a good idea to provide some standard settings for this command

I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?

It'd be nice to include one for each of the main pipeline components we see as relevant - currently mainly multilabel textcat & spancat, no?

@rmitsch rmitsch marked this pull request as ready for review November 17, 2022 11:39
rmitsch and others added 2 commits November 17, 2022 12:53
rmitsch and others added 2 commits November 17, 2022 16:33
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'll leave it open for one more day in case anyone else wanted to do a final review.

@adrianeboyd adrianeboyd merged commit c0fd8a2 into explosion:master Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants