find-threshold: CLI command for multi-label classifier threshold tuning #11280

rmitsch · 2022-08-08T14:42:46Z

Goal

Add a find-threshold CLI command investigating different threshold values for classification models and returning the ones maximizing the specified score.

Description

New CLI command find-threshold; API call is find_threshold().
New tests in spacy.tests.test_cli.
Docs will be added once the code has been reviewed.

Supported options are:

pipe_name: Which pipe to evaluate (with pipelines with multiple MultiLabel_TextCategorizer components the name has to be specified, otherwise it's optional).
average: Whether to use micro or macro to compute F-score over all labels.
n_trials: Number of sample points in threshold space between 0 and 1.
beta: Beta coefficient for F-score calculation.

Types of change

New feature.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

adrianeboyd · 2022-08-09T07:47:50Z

I realize this is a draft, but some general concerns:

there are multiple components with thresholds, can this be implemented more generally?
I'd rather see the scoring code extended (e.g., for beta in PRFScore and Scorer.score_cats) than for this to be completely reimplemented here
the args should look similar to spacy evaluate (note that it's a known problem in spacy evaluate that the DocBin is loaded generically rather than following [corpora.dev] from the config, which can cause issues sometimes)

The textcat component is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.

rmitsch · 2022-08-09T10:43:40Z

there are multiple components with thresholds, can this be implemented more generally?

Should be possible. Would it be acceptable if we ditch the automatic component recognition then and always require naming the component to be evaluated?

I'd rather see the scoring code extended (e.g., for beta in PRFScore and Scorer.score_cats) than for this to be completely reimplemented here

Wasn't aware of this. Scorer.score_cats should be a good fit.

the args should look similar to spacy evaluate (note that it's a known problem in spacy evaluate that the DocBin is loaded generically rather than following [corpora.dev] from the config, which can cause issues sometimes)

I'll look into harmonizing the arguments.

The textcat component is a bit weird because the threshold only affects the scoring, so the annotation doesn't change with the threshold even though for most other components modifying the thresholds would affect the annotation.

Can you elaborate on how modifying thresholds would affect annotations for other components?

adrianeboyd · 2022-08-09T11:04:12Z

In general the situation is that you have a component that has:

some config setting that's a threshold that's used in set_annotations or the scorer
some score that's returned by its scorer that you want to maximize

Examples:

textcat_multilabel: textcat_multilabel.threshold and the score cats_macro_auc
spancat: spancat.threshold and the score spans_sc_f
span_finder:span_finder.threshold and the score span_finder_sc_f

I would initially think that beta could be added as a scorer setting in updated versions of the default scorers so it can be customized in the configs. For the exploration here it might be useful if beta was easier to override than by modifying the configs, but I think I'd start from the point that there's an existing score in the scorer results that can be used directly and just look at the threshold here?

rmitsch · 2022-08-26T11:03:41Z

Are there smart(-ish) ways to...

...dynamically identify whether a component is suited for the find-threshold command?
...map the correct scorer method to the component? Like score_spans() for span_finder (?) or score_cats() for spancat and textcat_multilabel?

We can hardcode this ofc, but I was wondering whether there's a better way to do this. 1. could be done by checking for the existence of a threshold attribute, but this is not quite consistent - e.g. SpanFinder has one, but TextCategorizer stores its threshold in self.cfg (I could check both, but this makes me question whether there are other variations).

adrianeboyd · 2022-08-29T08:35:52Z

No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.

In v4, I'm planning to move all these settings out of self.cfg so they're only stored in the config, but this kind of threshold could have an arbitrary name in the config.

rmitsch · 2022-08-31T09:27:11Z

No, I think we need to rely on the user to provide a path to the threshold in the config and the scores key to optimize.

It's not just the scores key, I think. The scoring method in Scorer also has be specified or chosen if we want to this to be generic. E.g. span_finder wouldn't work with Scorer.score_cats(). Or am I misunderstanding something here?

Clarification: I interpret "scores key" to be the attr attribute in the method to be called for scoring, e.g. Scorer.score_cats().

adrianeboyd · 2022-08-31T09:41:59Z

The component already has a registered scorer, so what I mean by "scores key" is the entry in the output of Language.evaluate that you want to optimize, like cats_macro_f. You don't care how/where this was calculated by the scorer, just that it ends up in scores.

… 'spacy evaluate' CLI.

rmitsch · 2022-09-01T12:13:34Z

The latest commit should be closer to a generic solution. Two remarks:

beta hasn't been introduced yet. I'd add it as optional argument to PRFScore to pass it forward from Scorer, if it's available in the latter's config. Are there any potential pitfalls to consider when doing this?
The part of the test running a spancat component fails so far because nlp.evaluate() returns None for the relevant scores (upon which find_threshold() exits). It's probably related to how I set up the component, but I haven't spotted the issue yet - if there's anything obvious, I'd be thankful for a pointer.

spacy/tests/test_cli.py

rmitsch · 2022-09-01T14:53:56Z

Added a draft for integrating beta. Feedback much appreciated.

adrianeboyd · 2022-09-02T07:06:07Z

I don't think beta makes sense as a Scorer-level setting but rather as an individual component scorer setting that would be set in the config, e.g.:

@registry.scorers("spacy.textcat_scorer.v1")
def make_textcat_scorer(beta: float = 1.0):
    return partial(textcat_score, beta=beta)

The existing textcat scorer is kind of a bad example because threshold should also already be a scorer setting but isn't.

For example, you could have two spancat components with different beta settings.

rmitsch · 2022-09-02T10:37:50Z

Let me know if changes for textcat_multilabel/score_cats() and spancat/score_spans() match what you meant. If so, I'll update the other components and scoring functions.

spacy/cli/find_threshold.py

adrianeboyd · 2022-11-11T09:52:04Z

Can you have a look at the conflicts?

rmitsch · 2022-11-11T10:50:08Z

This is quite weird. Apparently my master had diverged from explosion:master (I guess due to the new approach to the release prep?). Synchronizing it eradicated all commits in this PR by an automatic force-pushed initiated by GitHub. I'm working on re-adding the changes to this branch.

rmitsch · 2022-11-11T11:12:40Z

Should be fine now. Are we ok with this? Then I'd update the docs.

spacy/cli/find_threshold.py

svlandeg

This will be a useful CL tool to have, nice work!

I mainly had some comments around UX and documentation. It would be a good idea to document some standard settings for this command
(like spacy find-threshold my_nlp data.spacy textcat_multilabel threshold cats_macro_f) that users can just copy-paste if they're working with standard pipelines/configs.

spacy/cli/find_threshold.py

rmitsch · 2022-11-17T09:47:58Z

It would be a good idea to provide some standard settings for this command

I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

svlandeg · 2022-11-17T09:49:16Z

It would be a good idea to provide some standard settings for this command

I'll include some in the docs. Do you have any suggestions other than this one you'd like to have included?

It'd be nice to include one for each of the main pipeline components we see as relevant - currently mainly multilabel textcat & spancat, no?

# Conflicts: # website/docs/api/cli.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

spacy/cli/find_threshold.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

svlandeg

Looks good to me! I'll leave it open for one more day in case anyone else wanted to do a final review.

rmitsch added 4 commits August 5, 2022 16:42

Add foundation for find-threshold CLI functionality.

0e5cd6b

Finish first draft for find-threshold.

4981700

Add tests.

1d0f5d3

Revert adjusted import statements.

a7b56e8

rmitsch self-assigned this Aug 8, 2022

rmitsch added enhancement Feature requests and improvements feat / textcat Feature: Text Classifier feat / cli Feature: Command-line interface and removed feat / textcat Feature: Text Classifier labels Aug 8, 2022

rmitsch changed the title ~~Feature/classifier threshold tuning~~ find-threshold: CLI command for multi-label classifier threshold tuning Aug 8, 2022

rmitsch added 2 commits August 9, 2022 10:03

Fix mypy errors.

d689d97

Fix imports.

6c3ae8d

Harmonize arguments with spacy evaluate command.

63c8028

Generalize component and threshold handling. Harmonize arguments with…

3a0a385

… 'spacy evaluate' CLI.

adrianeboyd reviewed Sep 1, 2022

View reviewed changes

spacy/tests/test_cli.py Outdated Show resolved Hide resolved

rmitsch added 2 commits September 1, 2022 16:01

Fix Spancat test.

51863cd

Add beta parameter to Scorer and PRFScore.

ea9737a

Make beta a component scorer setting.

110850f

adrianeboyd reviewed Sep 2, 2022

View reviewed changes

spacy/cli/find_threshold.py Outdated Show resolved Hide resolved

rmitsch marked this pull request as draft October 28, 2022 11:14

adrianeboyd reviewed Nov 11, 2022

View reviewed changes

spacy/cli/find_threshold.py Outdated Show resolved Hide resolved

Change check of if there's only one unique value in scores.

34c6c3b

rmitsch closed this Nov 11, 2022

rmitsch force-pushed the feature/classifier-threshold-tuning branch from 34c6c3b to 188a7d0 Compare November 11, 2022 10:34

Attempt merging after reconciling diverging master branches.

ba857c6

rmitsch reopened this Nov 11, 2022

kadarakos reviewed Nov 11, 2022

View reviewed changes

spacy/cli/find_threshold.py Show resolved Hide resolved

rmitsch marked this pull request as ready for review November 14, 2022 08:51

rmitsch marked this pull request as draft November 14, 2022 08:51

svlandeg reviewed Nov 17, 2022

View reviewed changes

Update spacy/cli/find_threshold.py

5500a58

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

rmitsch added 4 commits November 17, 2022 11:16

Incorporate feedback.

d080808

Fix test issue. Update docstring.

7b4da3f

Update docs & docstring.

809588d

Merge branch 'master' into feature/classifier-threshold-tuning

42a8208

# Conflicts: # website/docs/api/cli.md

rmitsch marked this pull request as ready for review November 17, 2022 11:39

rmitsch and others added 2 commits November 17, 2022 12:53

Update spacy/tests/test_cli.py

3f9d879

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Add examples to docs. Rename _nlp to nlp in tests.

dd84d65

svlandeg reviewed Nov 17, 2022

View reviewed changes

spacy/cli/find_threshold.py Outdated Show resolved Hide resolved

svlandeg reviewed Nov 17, 2022

View reviewed changes

spacy/cli/find_threshold.py Outdated Show resolved Hide resolved

rmitsch and others added 2 commits November 17, 2022 16:33

Update spacy/cli/find_threshold.py

0ee2257

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Update spacy/cli/find_threshold.py

bbfef28

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

svlandeg approved these changes Nov 22, 2022

View reviewed changes

adrianeboyd merged commit c0fd8a2 into explosion:master Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find-threshold: CLI command for multi-label classifier threshold tuning #11280

find-threshold: CLI command for multi-label classifier threshold tuning #11280

rmitsch commented Aug 8, 2022 •

edited

Loading

adrianeboyd commented Aug 9, 2022

rmitsch commented Aug 9, 2022 •

edited

Loading

adrianeboyd commented Aug 9, 2022

rmitsch commented Aug 26, 2022

adrianeboyd commented Aug 29, 2022

rmitsch commented Aug 31, 2022 •

edited

Loading

adrianeboyd commented Aug 31, 2022

rmitsch commented Sep 1, 2022 •

edited

Loading

rmitsch commented Sep 1, 2022

adrianeboyd commented Sep 2, 2022

rmitsch commented Sep 2, 2022

adrianeboyd commented Nov 11, 2022

rmitsch commented Nov 11, 2022

rmitsch commented Nov 11, 2022

svlandeg left a comment •

edited

Loading

rmitsch commented Nov 17, 2022

svlandeg commented Nov 17, 2022

svlandeg left a comment

find-threshold: CLI command for multi-label classifier threshold tuning #11280

find-threshold: CLI command for multi-label classifier threshold tuning #11280

Conversation

rmitsch commented Aug 8, 2022 • edited Loading

Goal

Description

Types of change

Checklist

adrianeboyd commented Aug 9, 2022

rmitsch commented Aug 9, 2022 • edited Loading

adrianeboyd commented Aug 9, 2022

rmitsch commented Aug 26, 2022

adrianeboyd commented Aug 29, 2022

rmitsch commented Aug 31, 2022 • edited Loading

adrianeboyd commented Aug 31, 2022

rmitsch commented Sep 1, 2022 • edited Loading

rmitsch commented Sep 1, 2022

adrianeboyd commented Sep 2, 2022

rmitsch commented Sep 2, 2022

adrianeboyd commented Nov 11, 2022

rmitsch commented Nov 11, 2022

rmitsch commented Nov 11, 2022

svlandeg left a comment • edited Loading

Choose a reason for hiding this comment

rmitsch commented Nov 17, 2022

svlandeg commented Nov 17, 2022

svlandeg left a comment

Choose a reason for hiding this comment

rmitsch commented Aug 8, 2022 •

edited

Loading

rmitsch commented Aug 9, 2022 •

edited

Loading

rmitsch commented Aug 31, 2022 •

edited

Loading

rmitsch commented Sep 1, 2022 •

edited

Loading

svlandeg left a comment •

edited

Loading