Make lowercasing optional #682

grenwi · 2022-08-19T08:25:58Z

While you can use your own CountVectorizer to enable / diable lowercasing for the document embeddings, it seems that currently lowercasing is always performed on the document text in the _preprocess_text function before extracting words via class-based TF-IDF.

This is probably unwanted for some languages such as German where a number of words switch their meaning decisively.

Can an lowercasing in this step be made optional?

The text was updated successfully, but these errors were encountered:

MaartenGr · 2022-08-23T05:36:55Z

Thank you for sharing this. It seems that lowercasing is quite redundant as the CountVectorizer does lowercasing as a default, so I think that step can actually be safely removed without affecting the pipeline. That would mean that you will have to define your own CountVectorizer and disable lowercasing there. There is currently a PR in the work for a new version of BERTopic, so I'll make sure to change that there.

* Online/incremental topic modeling with .partial_fit * Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer * Expose attributes for easier access to internal data * Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm * Added an example of combining BERTopic with KeyBERT * Added many tests with the intention of making development a bit more stable * Fix #632, #648, #673, #682, #667, #664

MaartenGr mentioned this issue Aug 23, 2022

v0.12 #668

Merged

MaartenGr added a commit that referenced this issue Aug 31, 2022

Fix #682

1f8362b

MaartenGr closed this as completed in #668 Sep 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make lowercasing optional #682

Make lowercasing optional #682

grenwi commented Aug 19, 2022

MaartenGr commented Aug 23, 2022

Make lowercasing optional #682

Make lowercasing optional #682

Comments

grenwi commented Aug 19, 2022

MaartenGr commented Aug 23, 2022