-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexError: list index out of range
when using zeroshot_topic_list in 0.16.1
#1946
Comments
Hmmm, this is surprising. Could you share your full code? That will make it easier to understand what is happening here. Also, I'm not seeing the actual error in your log. Does that mean that the error indeed happens at this line?
|
I have the same error. |
@Bougeant Could you also share your code and error log? That would help me understand what is happening here. |
Sure! Here goes:
This is the error I get:
It seems that the error comes from the fact that cluster_names should not include the outliers clusters, so the last index is out of range (we try to get the 14th element of a 13 elements list):
|
Hi, I am having the same issue (zero shot topic modelling crashes at the exact same line). The code: representation_model = KeyBERTInspired()
vectorizer_model = CountVectorizer(
ngram_range=(1, 2), stop_words="english", min_df=30
)
embedding_model = "all-MiniLM-L6-v2"
topic_model = BERTopic(
verbose=True,
embedding_model=embedding_model,
min_topic_size=50,
calculate_probabilities=True,
low_memory=True,
representation_model=representation_model,
zeroshot_topic_list=labels,
zeroshot_min_similarity=0.5,
language="english",
n_gram_range=(1, 2),
)
topics, probs = topic_model.fit_transform(articles["abstract"].tolist()) I have printed out the following variables before the crash:
In other words, the issue is the same as that reported by @Bougeant. |
sorry a bit late, but this is my code from bertopic import BERTopic
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer
data = load_dataset("HuggingFaceH4/h4_10k_prompts_ranked_gen")
docs = data["train_gen"]["prompt"]
zeroshot_topic_list = ['searching knowledge', 'answer coding problem', 'summarizing', 'rephrasing', 'roleplay', 'translate', 'generate content']
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(
min_topic_size=20,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.25,
vectorizer_model=vectorizer_model
)
topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info() I'm running this in kaggle notebook, and I think I missed adding the last line of the error, this is the full screenshot: |
accidentally closed the issue, sorry |
I've gotten around the problem with the following patch: master...lucasgautheron:BERTopic:patch-1 This is probably not the way you want to actually fix it, but I thought I should share |
Thank you all for sharing the code! In all honesty, I'm not entirely sure why it suddenly seems to ignore outliers as the topic label should exist... Either way, I think I managed to create a fix but it still has to pass all the tests. Also, seeing as how the tests didn't cover this specific issue. Could any facing this issue also test whether this fix worked for them? I would feel a lot more confident to have addressed this issue if it resolves it for more people than just on my machine. Here's the PR: #1957 |
@lucasgautheron @andiwinata @Bougeant If you have the time, could you check whether #1957 works? |
Hi! Any updates on that? This is a big blocker in my project right now. |
@mzhadigerov Have you tested the PR I linked in my comment above? If that works for you and also for others, then I can go ahead and create a new release. Until then, please check out the PR. |
@MaartenGr Thanks! It is working on my side. I cloned from |
@MaartenGr but my |
@mzhadigerov The representative documents are not merged since they are essentially random documents when it concerns topic -1. Topic -1 consists of outliers that do not fall into a single group so the resulting documents are not actually related to one another. I think it could be done to add representative documents there but in all honesty, I'm not sure it is worth the effort. |
@MaartenGr Alright, If it is supposed to work like that (I don't use rep.docs of topic -1 anyways). I made the comment because the Rep.Docs of -1 are not NaN in v0.16.0 |
@mzhadigerov Thanks for sharing. It is currently low priority but I might bump it if it's important to many users. |
For everyone facing this issue in 0.16.1, I just pushed an official 0.16.2 release which has the PR I mentioned earlier implemented. There are a bunch of PRs open with a number of interesting stuff that I will look through in the upcoming weeks. For now, this issue should be resolved. |
Thank you for the super quick patch; I could not try it yet, but it looks equivalent to my quickfix so I assume it works. |
Hi, I recently re-ran a notebook for
zeroshot_topic_list
and got theIndexError: list index our of range
I fixed this by downgrading to 0.16.0
Full stacktrace:
The text was updated successfully, but these errors were encountered: