-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to change # of documents pulled per topic by .get_representative_docs()? #848
Comments
Issue #811 might interest you || BERTopic: get_representative_docs(...) | Option to get all docs mapped to a topic beyond the default randomly selected 3 You can work with the PS: I notice that _representative_docs is changing to |
Thank you for your kind words! Unfortunately, that is currently not possible. I have been quite hesitant adding parameters to BERTopic in order to keep the package relatively straightforward to use. Just out of curiosity, would something like the following suffice for you: # When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_}) That way, you have all documents mapped to a topic and you can go through a sample of those yourself. Or would you want the top n most representative documents extracted instead?
Increasing the Class to include all documents per topic is not possible since the most representative documents in a topic are calculated and not taken as a direct random sample from all documents. In other words, those are the documents that are most similar to the entire cluster. You can find more about that calculation here.
Yes and no, the
Yes. When you use HDBSCAN in BERTopic we are using some of the internal structure of HDBSCAN in order to extract the documents that are best representative of a cluster. However, this does not work with other clustering algorithms since they do not have the same internal workings. Instead, with v0.13, it is now possible to get the most representative documents per topic by calculating the c-TF-IDF representation on a random subset of all documents in a topic and comparing those with the topic c-TF-IDF representation. The most similar documents per topic are then extracted. |
Once again @MaartenGr, thanks for BERTopic and the responses you take time out to provide. The more I dig in, the more 'honey I ooze out'. Yes, we all have and will have feature requests. By the way, what is a great piece of software if it doesn't have features request? The more we use and love it, the more we request more!!! However, for a library (software) to remain good and fluid at what it's great at, it should be LEAN and robust as it can be with little fluffing. Hence, in principle, I SUPPORT @MaartenGr hesitation in adding parameters to BERTopic in order to keep the package relatively straightforward to use. |
@MaartenGr Yes, this is what I was hoping for :) My code does already output a kind of "master" dataframe of all documents with the schema | topics | docs | probs | where documents are sorted first by ascending topic # and then by descending prob (again, thank you for making it so easy to generate & access these values!), but I'd noticed that the documents extracted by the 'get_representative_docs()' function seemed to be more coherent and similar to each other than the docs with the highest probs per topic that I'd see in my aforementioned master dataframe. (This might be just my own bias though) Could you expand a little on the difference between how documents are scored/ranked on their representativeness vs. the probs they're assigned as a result of .fit_transform()? (Your 2020 Medium essay mentions using c-TF-IDF scores to get the most important/representative words per topic ... am I way off to assume that scoring the 'representativeness' of a given document could be done simply by summing up the c-TF-IDF scores of each word in said document?) |
v0.13The documents extracted by from bertopic import BERTopic
class BERTopicAdjusted(BERTopic):
def _save_representative_docs(self, documents: pd.DataFrame):
""" Save the most representative docs (3) per topic
The most representative docs are extracted by taking
the exemplars from the HDBSCAN-generated clusters.
Full instructions can be found here:
https://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html
Arguments:
documents: Dataframe with documents and their corresponding IDs
"""
# Prepare the condensed tree and luf clusters beneath a given cluster
condensed_tree = self.hdbscan_model.condensed_tree_
raw_tree = condensed_tree._raw_tree
clusters = sorted(condensed_tree._select_clusters())
cluster_tree = raw_tree[raw_tree['child_size'] > 1]
# Find the points with maximum lambda value in each leaf
representative_docs = {}
for topic in documents['Topic'].unique():
if topic != -1:
leaves = hdbscan.plots._recurse_leaf_dfs(cluster_tree, clusters[topic])
result = np.array([])
for leaf in leaves:
max_lambda = raw_tree['lambda_val'][raw_tree['parent'] == leaf].max()
points = raw_tree['child'][(raw_tree['parent'] == leaf) & (raw_tree['lambda_val'] == max_lambda)]
result = np.hstack((result, points))
# Below, we get the top 3 documents per topic. If you want to have more documents per topic,
# simply increase the value of `3` to whatever value you want:
representative_docs[topic] = list(np.random.choice(result, 3, replace=False).astype(int))
# Convert indices to documents
self.representative_docs_ = {topic: [documents.iloc[doc_id].Document for doc_id in doc_ids]
for topic, doc_ids in
representative_docs.items()}
Representativeness and membership to a cluster are different things and as such are likely to produce different results if they are used for the same purpose. Cluster membership is not necessarily a metric for representation. You can find more about how HDBSCAN handles that here. v0.14+In the v0.14 release of BERTopic, all representative documents are extracted in the same way regardless of whether you are using HDBSCAN or another clustering algorithm. A random subset of 500 documents is sampled for each cluster after which we use c-TF-IDF to score those documents. The resulting values are compared with their topic's c-TF-IDF values to rank the documents based on their closeness to a topic. In v0.14, you can do the following to extract a specific number of representative documents: import pandas as pd
# Prepare your documents to be used in a dataframe
documents = pd.DataFrame({"Document": docs,
"ID": range(len(docs)),
"Topic": topic_model.topics_})
# Extract the top 50 representative documents
repr_docs, _, _ = topic_model._extract_representative_docs(c_tf_idf=topic_model.c_tf_idf_,
documents=documents,
topics=topic_model.topic_representations_ ,
nr_repr_docs=50) Do note though that accessing private functions will always be at risk for changes in future releases and is typically only advised to use if you do version control of BERTopic and its dependencies.
There is currently a method for doing something similar in the PR here but what is done here is calculating the c-TF-IDF representation for each document in a topic and comparing that, through cosine similarity, with the topic c-TF-IDF representations. This is what was implemented in v0.13 and has been further pursued in v0.14. |
Thank you so much for this comprehensive reply! Will dig into the code & links you've provided :) |
Hello Maarten! Just a quick note that running the class BERTopicAdjusted(BERTopic) listed above with the latest version of bertopic (v0.14) appears to create mismatching topic numbers — i.e., the topic numbers from .get_representative_docs() end up being different from those in .get_topic_info(). Also, switching back to just using BERTopic() without the adjusted class eliminates this mismatch. |
@MogwaiMomo That is correct! In the v0.14 release, it was updated how the representative documents were generated. It now produces a fixed number of representative documents regardless of whether topics were merged/combined. Instead, you can now use the following: import pandas as pd
# Prepare your documents to be used in a dataframe
documents = pd.DataFrame({"Document": docs,
"ID": range(len(docs)),
"Topic": topic_model.topics_})
# Extract the top 50 representative documents
repr_docs, _, _ = topic_model._extract_representative_docs(c_tf_idf=topic_model.c_tf_idf_,
documents=documents,
topics=topic_model.topic_representations_ ,
nr_repr_docs=50) Do note though that accessing private functions will always be at risk for changes in future releases and is typically only advised to use if you do version control of BERTopic and its dependencies. I'll make sure to update the code above to indicate for which versions they work. |
Discussed in #847
Originally posted by MogwaiMomo November 18, 2022
Hello!
Is there currently any way to adjust the number of documents pulled up by .get_representative_docs method? As far as I can tell the default is to return 3 documents per topic. For me it would be incredibly valuable to be able to increase customize this number, so that you could pull up 5 or 10 representative documents per topic.
Thank you in advance and just a note that I cannot believe BERTopic exists as it is SO awesome and so thoughtfully designed :)
The text was updated successfully, but these errors were encountered: