Change GPU efficient textcat to use CNN, not BOW in generated configs #11900

polm · 2022-11-30T09:41:45Z

Description

If you generate a config with a textcat component using GPU (transformers), the defaut option (efficiency) uses a BOW architecture, which does not use tok2vec features. While that can make sense as part of a larger pipeline, in the case of just a transformer and a textcat, that means the transformer is doing a lot of work for no purpose.

This changes it so that the CNN architecture is used instead. It could also be changed to be the same as the accuracy config, which uses the ensemble architecture.

I haven't tested this in the browser yet because I'm having difficulty with the site's build env, but it works fine in the CLI, and imagine it's fine - there are no actual logic changes.

Types of change

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

If you generate a config with a textcat component using GPU (transformers), the defaut option (efficiency) uses a BOW architecture, which does not use tok2vec features. While that can make sense as part of a larger pipeline, in the case of just a transformer and a textcat, that means the transformer is doing a lot of work for no purpose. This changes it so that the CNN architecture is used instead. It could also be changed to be the same as the accuracy config, which uses the ensemble architecture.

adrianeboyd · 2022-12-02T08:42:05Z

This doesn't add the transformer to the pipeline for the GPU case (otherwise there are conditions that remove the tok2vec parts for the BOW case):

[nlp]
lang = "en"
pipeline = ["textcat"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

polm · 2022-12-05T06:09:01Z

Thanks for pointing that out, I fixed it.

Going to mark this as a draft while the issue with #11925 is investigated, since it looks like affects the CNN textcat with transformers.

spacy/cli/templates/quickstart_training.jinja

* Switch ubuntu-latest to ubuntu-20.04 in main tests * Only use 20.04 for 3.6

polm · 2023-02-27T04:05:59Z

Brought this out of draft since #11968 is resolved, though this probably shouldn't be merged until it's released.

adrianeboyd · 2023-03-01T13:47:26Z

thinc v8.1.8 hasn't been released yet, so this should fail for now.

adrianeboyd · 2023-03-02T07:13:22Z

This test looks good: explosion/spacy-transformers#357

So this change can be merged for v3.5.1.

adrianeboyd · 2023-03-02T08:26:37Z

Putting this back in draft because it appears to break the training quickstart in the deploy preview.

adrianeboyd · 2023-03-02T08:48:09Z

Locally it's not completely broken, but it doesn't appear to do what it's intended to do. This may take a bit longer to debug because of the buggy/frustrating jinja2js conversion.

adrianeboyd · 2023-03-07T10:33:51Z

I think the older website preview in the PR had gotten too old or out-of-sync with more recent website changes, and regenerated with the minor change above this now looks like it's working as intended to me.

Locally the issue was that npm run dev doesn't automatically regenerate the js version of the config template like it used to, so my local preview was also out-of-sync.

…explosion#11900) * Change GPU efficient textcat to use CNN, not BOW If you generate a config with a textcat component using GPU (transformers), the defaut option (efficiency) uses a BOW architecture, which does not use tok2vec features. While that can make sense as part of a larger pipeline, in the case of just a transformer and a textcat, that means the transformer is doing a lot of work for no purpose. This changes it so that the CNN architecture is used instead. It could also be changed to be the same as the accuracy config, which uses the ensemble architecture. * Add the transformer when using a textcat with GPU * Switch ubuntu-latest to ubuntu-20.04 in main tests (explosion#11928) * Switch ubuntu-latest to ubuntu-20.04 in main tests * Only use 20.04 for 3.6 * Require thinc v8.1.7 * Require thinc v8.1.8 * Break up longer expression --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

polm added docs Documentation and website feat / textcat Feature: Text Classifier feat / cli Feature: Command-line interface feat / transformer Feature: Transformer labels Nov 30, 2022

Add the transformer when using a textcat with GPU

f0fb4b0

polm marked this pull request as draft December 5, 2022 06:09

adrianeboyd reviewed Dec 5, 2022

View reviewed changes

spacy/cli/templates/quickstart_training.jinja Outdated Show resolved Hide resolved

polm added 2 commits December 5, 2022 19:20

Switch ubuntu-latest to ubuntu-20.04 in main tests (explosion#11928)

fec7410

* Switch ubuntu-latest to ubuntu-20.04 in main tests * Only use 20.04 for 3.6

Merge branch 'master' into docs/no-bow-for-transformers

d982ebe

polm marked this pull request as ready for review February 27, 2023 04:05

adrianeboyd added 2 commits March 1, 2023 14:44

Require thinc v8.1.7

cfd60d1

Require thinc v8.1.8

ac34330

adrianeboyd closed this Mar 2, 2023

adrianeboyd reopened this Mar 2, 2023

adrianeboyd marked this pull request as draft March 2, 2023 08:26

adrianeboyd closed this Mar 7, 2023

adrianeboyd reopened this Mar 7, 2023

Break up longer expression

7cef53a

adrianeboyd marked this pull request as ready for review March 7, 2023 10:30

adrianeboyd merged commit e656189 into explosion:master Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change GPU efficient textcat to use CNN, not BOW in generated configs #11900

Change GPU efficient textcat to use CNN, not BOW in generated configs #11900

polm commented Nov 30, 2022 •

edited

Loading

adrianeboyd commented Dec 2, 2022

polm commented Dec 5, 2022

polm commented Feb 27, 2023

adrianeboyd commented Mar 1, 2023

adrianeboyd commented Mar 2, 2023

adrianeboyd commented Mar 2, 2023

adrianeboyd commented Mar 2, 2023 •

edited

Loading

adrianeboyd commented Mar 7, 2023

Change GPU efficient textcat to use CNN, not BOW in generated configs #11900

Change GPU efficient textcat to use CNN, not BOW in generated configs #11900

Conversation

polm commented Nov 30, 2022 • edited Loading

Description

Types of change

Checklist

adrianeboyd commented Dec 2, 2022

polm commented Dec 5, 2022

polm commented Feb 27, 2023

adrianeboyd commented Mar 1, 2023

adrianeboyd commented Mar 2, 2023

adrianeboyd commented Mar 2, 2023

adrianeboyd commented Mar 2, 2023 • edited Loading

adrianeboyd commented Mar 7, 2023

polm commented Nov 30, 2022 •

edited

Loading

adrianeboyd commented Mar 2, 2023 •

edited

Loading