Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change GPU efficient textcat to use CNN, not BOW in generated configs #11900

Merged
merged 7 commits into from
Mar 7, 2023

Conversation

polm
Copy link
Contributor

@polm polm commented Nov 30, 2022

Description

If you generate a config with a textcat component using GPU (transformers), the defaut option (efficiency) uses a BOW architecture, which does not use tok2vec features. While that can make sense as part of a larger pipeline, in the case of just a transformer and a textcat, that means the transformer is doing a lot of work for no purpose.

This changes it so that the CNN architecture is used instead. It could also be changed to be the same as the accuracy config, which uses the ensemble architecture.

I haven't tested this in the browser yet because I'm having difficulty with the site's build env, but it works fine in the CLI, and imagine it's fine - there are no actual logic changes.

Types of change

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

If you generate a config with a textcat component using GPU
(transformers), the defaut option (efficiency) uses a BOW architecture,
which does not use tok2vec features. While that can make sense as part
of a larger pipeline, in the case of just a transformer and a textcat,
that means the transformer is doing a lot of work for no purpose.

This changes it so that the CNN architecture is used instead. It could
also be changed to be the same as the accuracy config, which uses the
ensemble architecture.
@polm polm added docs Documentation and website feat / textcat Feature: Text Classifier feat / cli Feature: Command-line interface feat / transformer Feature: Transformer labels Nov 30, 2022
@adrianeboyd
Copy link
Contributor

This doesn't add the transformer to the pipeline for the GPU case (otherwise there are conditions that remove the tok2vec parts for the BOW case):

[nlp]
lang = "en"
pipeline = ["textcat"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

@polm
Copy link
Contributor Author

polm commented Dec 5, 2022

Thanks for pointing that out, I fixed it.

Going to mark this as a draft while the issue with #11925 is investigated, since it looks like affects the CNN textcat with transformers.

@polm polm marked this pull request as draft December 5, 2022 06:09
@polm polm marked this pull request as ready for review February 27, 2023 04:05
@polm
Copy link
Contributor Author

polm commented Feb 27, 2023

Brought this out of draft since #11968 is resolved, though this probably shouldn't be merged until it's released.

@adrianeboyd
Copy link
Contributor

thinc v8.1.8 hasn't been released yet, so this should fail for now.

@adrianeboyd
Copy link
Contributor

This test looks good: explosion/spacy-transformers#357

So this change can be merged for v3.5.1.

@adrianeboyd adrianeboyd closed this Mar 2, 2023
@adrianeboyd adrianeboyd reopened this Mar 2, 2023
@adrianeboyd
Copy link
Contributor

Putting this back in draft because it appears to break the training quickstart in the deploy preview.

@adrianeboyd adrianeboyd marked this pull request as draft March 2, 2023 08:26
@adrianeboyd
Copy link
Contributor

adrianeboyd commented Mar 2, 2023

Locally it's not completely broken, but it doesn't appear to do what it's intended to do. This may take a bit longer to debug because of the buggy/frustrating jinja2js conversion.

@adrianeboyd adrianeboyd closed this Mar 7, 2023
@adrianeboyd adrianeboyd reopened this Mar 7, 2023
@adrianeboyd adrianeboyd marked this pull request as ready for review March 7, 2023 10:30
@adrianeboyd
Copy link
Contributor

I think the older website preview in the PR had gotten too old or out-of-sync with more recent website changes, and regenerated with the minor change above this now looks like it's working as intended to me.

Locally the issue was that npm run dev doesn't automatically regenerate the js version of the config template like it used to, so my local preview was also out-of-sync.

@adrianeboyd adrianeboyd merged commit e656189 into explosion:master Mar 7, 2023
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request Mar 9, 2023
…explosion#11900)

* Change GPU efficient textcat to use CNN, not BOW

If you generate a config with a textcat component using GPU
(transformers), the defaut option (efficiency) uses a BOW architecture,
which does not use tok2vec features. While that can make sense as part
of a larger pipeline, in the case of just a transformer and a textcat,
that means the transformer is doing a lot of work for no purpose.

This changes it so that the CNN architecture is used instead. It could
also be changed to be the same as the accuracy config, which uses the
ensemble architecture.

* Add the transformer when using a textcat with GPU

* Switch ubuntu-latest to ubuntu-20.04 in main tests (explosion#11928)

* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6

* Require thinc v8.1.7

* Require thinc v8.1.8

* Break up longer expression

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation and website feat / cli Feature: Command-line interface feat / textcat Feature: Text Classifier feat / transformer Feature: Transformer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants