-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make lowercasing optional #682
Comments
Thank you for sharing this. It seems that lowercasing is quite redundant as the |
* Online/incremental topic modeling with .partial_fit * Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer * Expose attributes for easier access to internal data * Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm * Added an example of combining BERTopic with KeyBERT * Added many tests with the intention of making development a bit more stable * Fix #632, #648, #673, #682, #667, #664
While you can use your own CountVectorizer to enable / diable lowercasing for the document embeddings, it seems that currently lowercasing is always performed on the document text in the
_preprocess_text
function before extracting words via class-based TF-IDF.This is probably unwanted for some languages such as German where a number of words switch their meaning decisively.
Can an lowercasing in this step be made optional?
The text was updated successfully, but these errors were encountered: