Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range when using zeroshot_topic_list in 0.16.1 #1946

Open
andiwinata opened this issue Apr 26, 2024 · 19 comments

Comments

@andiwinata
Copy link

Hi, I recently re-ran a notebook for zeroshot_topic_list and got the IndexError: list index our of range
I fixed this by downgrading to 0.16.0

Full stacktrace:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 18
      9 vectorizer_model = CountVectorizer(stop_words="english")
     11 topic_model = BERTopic(
     12     min_topic_size=20,
     13     zeroshot_topic_list=zeroshot_topic_list,
     14     zeroshot_min_similarity=.25,
     15     vectorizer_model=vectorizer_model
     16 )
---> 18 topics, probs = topic_model.fit_transform(docs)
     19 topic_model.get_topic_info()

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    446 # Combine Zero-shot with outliers
    447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448     predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
    450 return predictions, self.probabilities_

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in BERTopic._combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in <listcomp>(.0)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())
@MaartenGr
Copy link
Owner

Hmmm, this is surprising. Could you share your full code? That will make it easier to understand what is happening here. Also, I'm not seeing the actual error in your log. Does that mean that the error indeed happens at this line?

-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]

@Bougeant
Copy link

I have the same error.

@MaartenGr
Copy link
Owner

@Bougeant Could you also share your code and error log? That would help me understand what is happening here.

@Bougeant
Copy link

Bougeant commented Apr 26, 2024

Sure! Here goes:

pip install bertopic==0.16.1 datasets

import logging
import pandas as pd
import spacy
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic
from bertopic.representation import PartOfSpeech
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

spacy.cli.download("en_core_web_md")

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({"text": data['data'], "target": data['target']})
df = df.drop_duplicates(subset=["text"]).reset_index(drop=True)
classes = {i: data["target_names"][i] for i in range(len(data["target_names"]))}
df["target"] = df["target"].map(classes)

model_params = {
    "embedding_model": SentenceTransformer("all-MiniLM-L6-v2"),
    "calculate_probabilities": True,
    "representation_model": PartOfSpeech(model="en_core_web_md", top_n_words=20, pos_patterns=[[{"POS": "NOUN"}]]),
    "min_topic_size": 100,
    "nr_topics": 20,
    "zeroshot_topic_list": ["baseball", "hockey", "space", "medecine", "encryption", "middle-east politics", "cars", "motorcycle", "electronics", "computers", "religion"],
    "zeroshot_min_similarity": 0.5
}

topic_model = BERTopic(**model_params)
embeddings = topic_model.embedding_model.encode(df["text"], show_progress_bar=True)
topic_model.fit(df["text"].to_list(), embeddings)

This is the error I get:

cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
--> IndexError: list index out of range

It seems that the error comes from the fact that cluster_names should not include the outliers clusters, so the last index is out of range (we try to get the 14th element of a 13 elements list):

cluster_names = ['0_game_team_year_games', '1_health_patients_doctor_treatment', '2_car_bike_one_engine', '3_use_windows_one_system', '4_people_one_children_up', '5_people_arabs_one_peace', '6_health_mail_list_newsgroup', '7_space_launch_earth_orbit', '8_key_clipper_chip_encryption', '9_gay_people_sex_men', '10_post_people_one_flame', '11_one_will_people_christian', '12_fire_compound_children_people', '13_gun_guns_firearms_people']
topic = 13
self._outliers = 1

@lucasgautheron
Copy link

lucasgautheron commented Apr 28, 2024

Hi,

I am having the same issue (zero shot topic modelling crashes at the exact same line).

The code:

representation_model = KeyBERTInspired()
vectorizer_model = CountVectorizer(
    ngram_range=(1, 2), stop_words="english", min_df=30
)
embedding_model = "all-MiniLM-L6-v2"
topic_model = BERTopic(
    verbose=True,
    embedding_model=embedding_model,
    min_topic_size=50,
    calculate_probabilities=True,
    low_memory=True,
    representation_model=representation_model,
    zeroshot_topic_list=labels,
    zeroshot_min_similarity=0.5,
    language="english",
    n_gram_range=(1, 2),
)
topics, probs = topic_model.fit_transform(articles["abstract"].tolist())

I have printed out the following variables before the crash:

len(cluster_names): 78
np.max(documents.Topic.values): 77
np.min(documents.Topic.values): -1
self._outliers: 1
len(set(y)): 13 (which is also equal to len(labels), the amount of input zero shot labels)

In other words, the issue is the same as that reported by @Bougeant.

@andiwinata
Copy link
Author

sorry a bit late, but this is my code

from bertopic import BERTopic
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer

data = load_dataset("HuggingFaceH4/h4_10k_prompts_ranked_gen")
docs = data["train_gen"]["prompt"]

zeroshot_topic_list = ['searching knowledge', 'answer coding problem', 'summarizing', 'rephrasing', 'roleplay', 'translate', 'generate content']
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
    min_topic_size=20,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.25,
    vectorizer_model=vectorizer_model
)

topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

I'm running this in kaggle notebook, and I think I missed adding the last line of the error, this is the full screenshot:

image

@andiwinata
Copy link
Author

accidentally closed the issue, sorry

lucasgautheron added a commit to lucasgautheron/BERTopic that referenced this issue Apr 29, 2024
@lucasgautheron
Copy link

I've gotten around the problem with the following patch: master...lucasgautheron:BERTopic:patch-1

This is probably not the way you want to actually fix it, but I thought I should share

@MaartenGr
Copy link
Owner

Thank you all for sharing the code! In all honesty, I'm not entirely sure why it suddenly seems to ignore outliers as the topic label should exist...

Either way, I think I managed to create a fix but it still has to pass all the tests. Also, seeing as how the tests didn't cover this specific issue. Could any facing this issue also test whether this fix worked for them? I would feel a lot more confident to have addressed this issue if it resolves it for more people than just on my machine.

Here's the PR: #1957

@MaartenGr
Copy link
Owner

@lucasgautheron @andiwinata @Bougeant If you have the time, could you check whether #1957 works?

@mzhadigerov
Copy link

Hi! Any updates on that? This is a big blocker in my project right now.

@MaartenGr
Copy link
Owner

@mzhadigerov Have you tested the PR I linked in my comment above? If that works for you and also for others, then I can go ahead and create a new release. Until then, please check out the PR.

@mzhadigerov
Copy link

@MaartenGr Thanks! It is working on my side. I cloned from fix_1946 branch.

image

@mzhadigerov
Copy link

@MaartenGr but my Representative_Docs of topic -1 are NaN for some reason, even though Count shows 424

@MaartenGr
Copy link
Owner

@mzhadigerov The representative documents are not merged since they are essentially random documents when it concerns topic -1. Topic -1 consists of outliers that do not fall into a single group so the resulting documents are not actually related to one another.

I think it could be done to add representative documents there but in all honesty, I'm not sure it is worth the effort.

@mzhadigerov
Copy link

mzhadigerov commented May 7, 2024

@MaartenGr Alright, If it is supposed to work like that (I don't use rep.docs of topic -1 anyways).

I made the comment because the Rep.Docs of -1 are not NaN in v0.16.0

@MaartenGr
Copy link
Owner

@mzhadigerov Thanks for sharing. It is currently low priority but I might bump it if it's important to many users.

@MaartenGr
Copy link
Owner

For everyone facing this issue in 0.16.1, I just pushed an official 0.16.2 release which has the PR I mentioned earlier implemented. There are a bunch of PRs open with a number of interesting stuff that I will look through in the upcoming weeks. For now, this issue should be resolved.

@lucasgautheron
Copy link

Thank you for the super quick patch; I could not try it yet, but it looks equivalent to my quickfix so I assume it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants