Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

representation_model: 'NoneType' object is not iterable #1755

Open
muehlhausen opened this issue Jan 16, 2024 · 4 comments
Open

representation_model: 'NoneType' object is not iterable #1755

muehlhausen opened this issue Jan 16, 2024 · 4 comments

Comments

@muehlhausen
Copy link

Hey there!

First of all: thank you for developing BERTopic, it's neat! However, I am encountering an issue with representation_model, when trying to rename my cluster representations. Everything works fine as long as I am using just an embedding_model. However, as soon as I start using a representation_model I get the same error consistently.

Here is some sample code, inspired by this documentation.

# Import the necessary libraries
from bertopic import BERTopic
import pandas as pd
from transformers import pipeline
from bertopic.representation import TextGeneration

# prompt = f"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?"

# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)

# 4. Get some sample data
data = pd.read_excel(testdata.xlsx')

# 5. Initialize BERTopic with the representation model
topic_model = BERTopic(
    embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
    representation_model = representation_model # if commented, code works
)

# 6. Fit BERTopic to the sample texts
topics, _ = topic_model.fit_transform(data['text'])

# 6. Get the topic information
topic_info = topic_model.get_topic_info()

# 7. Print the topic information
print(topic_info)

The error I get is:

TypeError                                 Traceback (most recent call last)
Cell In[3], line 26
     20 topic_model = BERTopic(
     21     embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
     22     representation_model = representation_model
     23 )
     25 # 6. Fit BERTopic to the sample texts
---> 26 topics, _ = topic_model.fit_transform(data['Absatz'])
     28 # 6. Get the topic information
     29 topic_info = topic_model.get_topic_info()

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:433, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    430     self._save_representative_docs(custom_documents)
    431 else:
    432     # Extract topics by calculating c-TF-IDF
--> 433     self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
    435     # Reduce topics
    436     if self.nr_topics:

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3637, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
   3635 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
   3636 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3637 self.topic_representations_ = self._extract_words_per_topic(words, documents)
   3638 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3639 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
   3640                       for key, values in
   3641                       self.topic_representations_.items()}

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3922, in BERTopic._extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
   3920         topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
   3921 elif isinstance(self.representation_model, BaseRepresentation):
-> 3922     topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
   3923 elif isinstance(self.representation_model, dict):
   3924     if self.representation_model.get("Main"):

File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/representation/_textgeneration.py:147, in TextGeneration.extract_topics(self, topic_model, documents, c_tf_idf, topics)
    143 updated_topics = {}
    144 for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
    145 
    146     # Prepare prompt
--> 147     truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
    148     prompt = self._create_prompt(truncated_docs, topic, topics)
    149     self.prompts_.append(prompt)

TypeError: 'NoneType' object is not iterable

Running it on an M1 Mac, if that helps. Any help appreciated. Also tried copying all code from the best practise and got the same error.

Best regards!
Alex Mühlhausen

@MaartenGr
Copy link
Owner

In all honesty, not sure what is happening here. I believe there is another issue open with the same problem but it might just be related to the underlying T5 model. Also, have you tried passing the documents as a list of strings instead of a pandas series?

@leoschet
Copy link

leoschet commented Jan 16, 2024

I'm facing the same issue, but only with the TextGeneration representation model. I can generate other representation models without an issue. I did try passing the documents as a list of string, but the error persists.

I have the same code running successfully on v0.15.0

Edit: I did some digging, and found the problem is in this line. It seems that whenever using the default prompt, the top representative documents will be None.

A simple fix for it would be to have the else condition in line 141 assigning an empty list as the default value. I opened a PR with this change

@MaartenGr
Copy link
Owner

Thanks for the PR. I just merged #1726 which should fix the issue. Could one of you test it out so I know it also works for others?

@leoschet
Copy link

Thanks for the update! I tested it, and it runs without any errors on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants