`IndexError: list index out of range` when using zeroshot_topic_list in 0.16.1 #1946

andiwinata · 2024-04-26T08:11:06Z

Hi, I recently re-ran a notebook for zeroshot_topic_list and got the IndexError: list index our of range
I fixed this by downgrading to 0.16.0

Full stacktrace:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 18
      9 vectorizer_model = CountVectorizer(stop_words="english")
     11 topic_model = BERTopic(
     12     min_topic_size=20,
     13     zeroshot_topic_list=zeroshot_topic_list,
     14     zeroshot_min_similarity=.25,
     15     vectorizer_model=vectorizer_model
     16 )
---> 18 topics, probs = topic_model.fit_transform(docs)
     19 topic_model.get_topic_info()

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    446 # Combine Zero-shot with outliers
    447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448     predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
    450 return predictions, self.probabilities_

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in BERTopic._combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in <listcomp>(.0)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-04-26T08:35:10Z

Hmmm, this is surprising. Could you share your full code? That will make it easier to understand what is happening here. Also, I'm not seeing the actual error in your log. Does that mean that the error indeed happens at this line?

-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]

Bougeant · 2024-04-26T13:49:23Z

I have the same error.

MaartenGr · 2024-04-26T13:54:46Z

@Bougeant Could you also share your code and error log? That would help me understand what is happening here.

Bougeant · 2024-04-26T15:02:09Z

Sure! Here goes:

pip install bertopic==0.16.1 datasets

import logging
import pandas as pd
import spacy
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic
from bertopic.representation import PartOfSpeech
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

spacy.cli.download("en_core_web_md")

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({"text": data['data'], "target": data['target']})
df = df.drop_duplicates(subset=["text"]).reset_index(drop=True)
classes = {i: data["target_names"][i] for i in range(len(data["target_names"]))}
df["target"] = df["target"].map(classes)

model_params = {
    "embedding_model": SentenceTransformer("all-MiniLM-L6-v2"),
    "calculate_probabilities": True,
    "representation_model": PartOfSpeech(model="en_core_web_md", top_n_words=20, pos_patterns=[[{"POS": "NOUN"}]]),
    "min_topic_size": 100,
    "nr_topics": 20,
    "zeroshot_topic_list": ["baseball", "hockey", "space", "medecine", "encryption", "middle-east politics", "cars", "motorcycle", "electronics", "computers", "religion"],
    "zeroshot_min_similarity": 0.5
}

topic_model = BERTopic(**model_params)
embeddings = topic_model.embedding_model.encode(df["text"], show_progress_bar=True)
topic_model.fit(df["text"].to_list(), embeddings)

This is the error I get:

cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
--> IndexError: list index out of range

It seems that the error comes from the fact that cluster_names should not include the outliers clusters, so the last index is out of range (we try to get the 14th element of a 13 elements list):

cluster_names = ['0_game_team_year_games', '1_health_patients_doctor_treatment', '2_car_bike_one_engine', '3_use_windows_one_system', '4_people_one_children_up', '5_people_arabs_one_peace', '6_health_mail_list_newsgroup', '7_space_launch_earth_orbit', '8_key_clipper_chip_encryption', '9_gay_people_sex_men', '10_post_people_one_flame', '11_one_will_people_christian', '12_fire_compound_children_people', '13_gun_guns_firearms_people']
topic = 13
self._outliers = 1

lucasgautheron · 2024-04-28T09:32:16Z

Hi,

I am having the same issue (zero shot topic modelling crashes at the exact same line).

The code:

representation_model = KeyBERTInspired()
vectorizer_model = CountVectorizer(
    ngram_range=(1, 2), stop_words="english", min_df=30
)
embedding_model = "all-MiniLM-L6-v2"
topic_model = BERTopic(
    verbose=True,
    embedding_model=embedding_model,
    min_topic_size=50,
    calculate_probabilities=True,
    low_memory=True,
    representation_model=representation_model,
    zeroshot_topic_list=labels,
    zeroshot_min_similarity=0.5,
    language="english",
    n_gram_range=(1, 2),
)
topics, probs = topic_model.fit_transform(articles["abstract"].tolist())

I have printed out the following variables before the crash:

len(cluster_names): 78
np.max(documents.Topic.values): 77
np.min(documents.Topic.values): -1
self._outliers: 1
len(set(y)): 13 (which is also equal to len(labels), the amount of input zero shot labels)

In other words, the issue is the same as that reported by @Bougeant.

andiwinata · 2024-04-29T05:42:01Z

sorry a bit late, but this is my code

from bertopic import BERTopic
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer

data = load_dataset("HuggingFaceH4/h4_10k_prompts_ranked_gen")
docs = data["train_gen"]["prompt"]

zeroshot_topic_list = ['searching knowledge', 'answer coding problem', 'summarizing', 'rephrasing', 'roleplay', 'translate', 'generate content']
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
    min_topic_size=20,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.25,
    vectorizer_model=vectorizer_model
)

topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

I'm running this in kaggle notebook, and I think I missed adding the last line of the error, this is the full screenshot:

andiwinata · 2024-04-29T05:43:01Z

accidentally closed the issue, sorry

lucasgautheron · 2024-04-29T09:30:00Z

I've gotten around the problem with the following patch: master...lucasgautheron:BERTopic:patch-1

This is probably not the way you want to actually fix it, but I thought I should share

MaartenGr · 2024-04-29T13:00:14Z

Thank you all for sharing the code! In all honesty, I'm not entirely sure why it suddenly seems to ignore outliers as the topic label should exist...

Either way, I think I managed to create a fix but it still has to pass all the tests. Also, seeing as how the tests didn't cover this specific issue. Could any facing this issue also test whether this fix worked for them? I would feel a lot more confident to have addressed this issue if it resolves it for more people than just on my machine.

Here's the PR: #1957

MaartenGr · 2024-05-04T08:58:47Z

@lucasgautheron @andiwinata @Bougeant If you have the time, could you check whether #1957 works?

mzhadigerov · 2024-05-06T18:32:55Z

Hi! Any updates on that? This is a big blocker in my project right now.

MaartenGr · 2024-05-06T18:38:50Z

@mzhadigerov Have you tested the PR I linked in my comment above? If that works for you and also for others, then I can go ahead and create a new release. Until then, please check out the PR.

mzhadigerov · 2024-05-06T18:54:05Z

@MaartenGr Thanks! It is working on my side. I cloned from fix_1946 branch.

mzhadigerov · 2024-05-06T18:55:27Z

@MaartenGr but my Representative_Docs of topic -1 are NaN for some reason, even though Count shows 424

MaartenGr · 2024-05-07T13:54:23Z

@mzhadigerov The representative documents are not merged since they are essentially random documents when it concerns topic -1. Topic -1 consists of outliers that do not fall into a single group so the resulting documents are not actually related to one another.

I think it could be done to add representative documents there but in all honesty, I'm not sure it is worth the effort.

mzhadigerov · 2024-05-07T19:16:57Z

@MaartenGr Alright, If it is supposed to work like that (I don't use rep.docs of topic -1 anyways).

I made the comment because the Rep.Docs of -1 are not NaN in v0.16.0

MaartenGr · 2024-05-07T20:19:34Z

@mzhadigerov Thanks for sharing. It is currently low priority but I might bump it if it's important to many users.

MaartenGr · 2024-05-12T09:41:56Z

For everyone facing this issue in 0.16.1, I just pushed an official 0.16.2 release which has the PR I mentioned earlier implemented. There are a bunch of PRs open with a number of interesting stuff that I will look through in the upcoming weeks. For now, this issue should be resolved.

lucasgautheron · 2024-05-12T09:51:59Z

Thank you for the super quick patch; I could not try it yet, but it looks equivalent to my quickfix so I assume it works.

andiwinata closed this as completed Apr 29, 2024

andiwinata reopened this Apr 29, 2024

lucasgautheron added a commit to lucasgautheron/BERTopic that referenced this issue Apr 29, 2024

MaartenGr#1946 hotfix

5a910fb

MaartenGr mentioned this issue Apr 29, 2024

Fix issue with zeroshot topic modeling missing outlier #1957

Merged

mzhadigerov mentioned this issue May 6, 2024

ModuleNotFoundError: Can't use LangChain with version 0.16.0 #1976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`IndexError: list index out of range` when using zeroshot_topic_list in 0.16.1 #1946

`IndexError: list index out of range` when using zeroshot_topic_list in 0.16.1 #1946

andiwinata commented Apr 26, 2024

MaartenGr commented Apr 26, 2024

Bougeant commented Apr 26, 2024

MaartenGr commented Apr 26, 2024

Bougeant commented Apr 26, 2024 •

edited

Loading

lucasgautheron commented Apr 28, 2024 •

edited

Loading

andiwinata commented Apr 29, 2024

andiwinata commented Apr 29, 2024

lucasgautheron commented Apr 29, 2024

MaartenGr commented Apr 29, 2024

MaartenGr commented May 4, 2024

mzhadigerov commented May 6, 2024

MaartenGr commented May 6, 2024

mzhadigerov commented May 6, 2024

mzhadigerov commented May 6, 2024

MaartenGr commented May 7, 2024

mzhadigerov commented May 7, 2024 •

edited

Loading

MaartenGr commented May 7, 2024

MaartenGr commented May 12, 2024

lucasgautheron commented May 12, 2024

IndexError: list index out of range when using zeroshot_topic_list in 0.16.1 #1946

IndexError: list index out of range when using zeroshot_topic_list in 0.16.1 #1946

Comments

andiwinata commented Apr 26, 2024

MaartenGr commented Apr 26, 2024

Bougeant commented Apr 26, 2024

MaartenGr commented Apr 26, 2024

Bougeant commented Apr 26, 2024 • edited Loading

lucasgautheron commented Apr 28, 2024 • edited Loading

andiwinata commented Apr 29, 2024

andiwinata commented Apr 29, 2024

lucasgautheron commented Apr 29, 2024

MaartenGr commented Apr 29, 2024

MaartenGr commented May 4, 2024

mzhadigerov commented May 6, 2024

MaartenGr commented May 6, 2024

mzhadigerov commented May 6, 2024

mzhadigerov commented May 6, 2024

MaartenGr commented May 7, 2024

mzhadigerov commented May 7, 2024 • edited Loading

MaartenGr commented May 7, 2024

MaartenGr commented May 12, 2024

lucasgautheron commented May 12, 2024

`IndexError: list index out of range` when using zeroshot_topic_list in 0.16.1 #1946

`IndexError: list index out of range` when using zeroshot_topic_list in 0.16.1 #1946

Bougeant commented Apr 26, 2024 •

edited

Loading

lucasgautheron commented Apr 28, 2024 •

edited

Loading

mzhadigerov commented May 7, 2024 •

edited

Loading