Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lowercasing optional #682

Closed
grenwi opened this issue Aug 19, 2022 · 1 comment · Fixed by #668
Closed

Make lowercasing optional #682

grenwi opened this issue Aug 19, 2022 · 1 comment · Fixed by #668

Comments

@grenwi
Copy link

grenwi commented Aug 19, 2022

While you can use your own CountVectorizer to enable / diable lowercasing for the document embeddings, it seems that currently lowercasing is always performed on the document text in the _preprocess_text function before extracting words via class-based TF-IDF.

This is probably unwanted for some languages such as German where a number of words switch their meaning decisively.

Can an lowercasing in this step be made optional?

@MaartenGr
Copy link
Owner

Thank you for sharing this. It seems that lowercasing is quite redundant as the CountVectorizer does lowercasing as a default, so I think that step can actually be safely removed without affecting the pipeline. That would mean that you will have to define your own CountVectorizer and disable lowercasing there. There is currently a PR in the work for a new version of BERTopic, so I'll make sure to change that there.

@MaartenGr MaartenGr mentioned this issue Aug 23, 2022
MaartenGr added a commit that referenced this issue Aug 31, 2022
MaartenGr added a commit that referenced this issue Sep 11, 2022
* Online/incremental topic modeling with .partial_fit
* Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
* Expose attributes for easier access to internal data
* Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm
* Added an example of combining BERTopic with KeyBERT
* Added many tests with the intention of making development a bit more stable
* Fix #632, #648, #673, #682, #667, #664
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants