Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use [DOCUMENTS] in prompt for respresentation_model #1004

Closed
ohmeow opened this issue Feb 15, 2023 · 5 comments
Closed

Cannot use [DOCUMENTS] in prompt for respresentation_model #1004

ohmeow opened this issue Feb 15, 2023 · 5 comments

Comments

@ohmeow
Copy link

ohmeow commented Feb 15, 2023

Here is the code:

# embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
# dimensionality reduction
umap_model = UMAP(n_neighbors=5, n_components=2, min_dist=0.0, metric="cosine")
# clustering
hdbscan_model = HDBSCAN(min_cluster_size=3, metric="euclidean", cluster_selection_method="eom", prediction_data=True)
# vectorizer
vectorizer_model = CountVectorizer()
# representation
prompt = """
I have topic that contains the following documents: [DOCUMENTS]. 
The topic is described by the following keywords: [KEYWORDS].
Based on the above information, can you give a short label of the topic?
"""
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator, prompt=prompt)

# build topic model and get predictions
topic_model = BERTopic(
    embedding_model=sentence_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    min_topic_size=3,
)

docs = df["_seq"].values.tolist()
topics, probs = topic_model.fit_transform(documents=docs)

The error:

File ~/mambaforge/envs/myenv/lib/python3.10/site-packages/bertopic/_bertopic.py:2950, in BERTopic._extract_topics(self, documents)
   2948 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
   2949 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 2950 self.topic_representations_ = self._extract_words_per_topic(words, documents)
   2951 self._create_topic_vectors()
   2952 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
...
--> 135     for doc in docs:
    136         to_replace += f"- {doc[:255]}\n"
    137     prompt += self.prompt.replace("[DOCUMENTS]", to_replace)

TypeError: 'NoneType' object is not iterable

Thanks much - wg

@MaartenGr
Copy link
Owner

Hmmm, strange. Do you perhaps have a reproducible example of this? To me, it is not immediately clear why it is doing this but I'll make sure to test some things out!

@MaartenGr
Copy link
Owner

There was a small typo in the code of TextGeneration that made it pass a None value instead of the documents. I believe this is fixed with the latest commit to the main branch. So installing BERTopic there should resolve your issue. With these major releases, there are often bugs that were overlooked, so I typically wait a couple of weeks to release a quickfix in order to gather any issues that may come up.

For now, you can install BERTopic either from its latest commit:

pip install git+https://github.com/MaartenGr/BERTopic.git@1ee8141d65063a37f6ee3fd56b30e3f9e2f43d6e

or you can adjust the code yourself as was done here.

@ohmeow
Copy link
Author

ohmeow commented Feb 16, 2023 via email

@MaartenGr
Copy link
Owner

That is correct. In the link I posted above, you will find the relevant PR. You can change it yourself or simply install from the most recent commit.

@ohmeow
Copy link
Author

ohmeow commented Feb 16, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants