-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In multi-aspect context, allow Main model to be chained [Issue #1846] #2002
Conversation
bertopic/_bertopic.py
Outdated
topics = self.representation_model["Main"].extract_topics(self, documents, c_tf_idf, topics) | ||
elif isinstance(self.representation_model["Main"], list): | ||
for tuner in self.representation_model["Main"]: | ||
topics = tuner.extract_topics(self, documents, c_tf_idf, topics) | ||
topics = {label: values[:self.top_n_words] for label, values in topics.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it would be nice to have chained main models, it does bring a number of difficulties.
First, whenever you chain the main representation, any additional aspects that you want to create are then created on top of that chained representation. This inhibits the functionality of creating multiple aspects. Thus, whenever you use a chained Main representation, you cannot really use additional aspects as they will use the same chained base as the Main representation.
Second, whenever you create a chained representation as the Main model, it will still apply top_n_words
, which might not be appropriate depending on the last chain the representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks for explaining that. I clearly wasn't aware of the full implications of this change.
For the first issue: What is the desired behavior here? If I'm creating multiple aspects, I would think that each chain would be independent, not concatenated onto "Main"
. If that's the intended behavior, I think I can fix that.
For the second issue: It seems like this is a problem with chained main representations in general, whether or not we have multiple aspects. What is the intended behavior of top_n_words
? Do we wish to automatically apply it after "Main"
and before every non-Main aspect? Or do we wish to apply it after every non-Main aspect? Or do we prefer allowing the user to control whether/when top_n_words
is applied to each aspect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first issue: What is the desired behavior here? If I'm creating multiple aspects, I would think that each chain would be independent, not concatenated onto "Main". If that's the intended behavior, I think I can fix that.
The desired behavior would be that any additional aspect is built on top of the default "Main" representation which is the c-TF-IDF representation. If it would be built on top of the chained "Main" representation, then that would create strange results, especially from a user-experience perspective.
For the second issue: It seems like this is a problem with chained main representations in general, whether or not we have multiple aspects. What is the intended behavior of top_n_words? Do we wish to automatically apply it after "Main" and before every non-Main aspect? Or do we wish to apply it after every non-Main aspect? Or do we prefer allowing the user to control whether/when top_n_words is applied to each aspect?
top_n_words
is used mainly for c-TF-IDF, so the default representation. Other representation models might have their own top_n_words
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I've pushed a new version where I attempted the following:
For Main representation:
- If Main representation is missing, None, or falsy, then use default c-tf-idf + top_n_words
- Avoid applying top_n_words in all other cases
- Process
BaseRepresentation
,list[BaseRepresentation]
, anddict[str, BaseRepresentation | list[BaseRepresentation]]
- raise
TypeError
in all other cases
For other aspects:
- Copy c-tf-idf topics directly to use as input
- If
aspect_model
is falsy, then use default c-tf-idf + top_n_words - Avoid applying top_n_words in all other cases
- Process
BaseRepresentation
andlist[BaseRepresentation]
- raise
TypeError
in all other cases
@ddicato Apologies for the late reply and thank you for the changes! This is exactly the behavior as intended and it is nicely and elegantly coded. |
No description provided.