-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openai incompatible issues with Bertopic #1629
Comments
Thanks for sharing! It seems that with openai 1.0.0 there were breaking changes to the API which need to be updated in BERTopic. I'll make sure to fix it in this PR since there were more OpenAI updates there. |
@jamesleverage If I'm not mistaken, you can use openai 0.28 instead of 0.38 and I believe it should be working. However, I just pushed a fix to the PR mentioned above that should make it work with openai >= 1.0. In the upcoming release of BERTopic, openai < 1.0 will not be supported anymore. |
I used openai==0.28 and getting this error at this line after running for 10+ minutes: representation_model = OpenAI(model="gpt-3.5-turbo", chat=True) model = BERTopic(representation_model=representation_model) Error message: =========================================================================== The above exception was the direct cause of the following exception: ReadTimeoutError Traceback (most recent call last) During handling of the above exception, another exception occurred: =========================================================================== Is there something I can add to my code to make this work? For simple prompting to OpenAI, this definition worked with the latest openai: messages = [{"role": "user", "content":prompt}] client = OpenAI(api_key=OPENAI_API_KEY) response = client.chat.completions.create( response_message = response.choices[0].message.content return response_message |
If 0.28 is currently not working, then I would wait until the PR fix. You can already download it if you want like this: pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1572/head |
Adding to the discussion, while we wait for the PR fix, I'm trying to do the labeling after training BERTopic models. Would it be possible to adjust the number of representative documents when we use |
@linxule I believe you can use it as follows: import pandas as pd
documents = pd.DataFrame(
{
"Document": docs,
"ID": range(len(docs)),
"Topic": None,
"Image": None
}
)
repr_docs, _, _, _ = topic_model._extract_representative_docs(
topic_model.c_tf_idf_,
documents,
topic_model.topics_,
nr_repr_docs=10
) Where |
Hi @MaartenGr , I tried the solution you suggested but encountered some issues I ran import pandas as pd
# Assuming df_dict and models are defined in an accessible scope
# df_dict: Dictionary of DataFrames
# models: Dictionary of BERTopic models
def extract_representative_documents(df_name, nr_repr_docs=5):
"""
Extracts representative documents for each topic from a DataFrame specified by df_name.
Parameters:
- df_name: The name of the DataFrame within df_dict.
- nr_repr_docs: Number of representative documents to extract for each topic (default is 10).
Returns:
- A DataFrame with the representative documents and their associated topics.
"""
if df_name not in df_dict:
raise ValueError(f"DataFrame with name '{df_name}' not found in df_dict")
if df_name not in models:
raise ValueError(f"BERTopic model with name '{df_name}' not found in models")
# Access the documents from the specified DataFrame
docs = df_dict[df_name]['Post_Content']
# Create a DataFrame for the documents
documents = pd.DataFrame(
{
"Document": docs,
"ID": range(len(docs)),
"Topic": None,
"Image": None
}
)
# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = models[df_name]._extract_representative_docs(
models[df_name].c_tf_idf_,
documents,
models[df_name].topics_,
nr_repr_docs=nr_repr_docs
)
return repr_docs
# Example usage
representative_docs = extract_representative_documents(df_name) I got ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-40-80fe2672c770>](https://localhost:8080/#) in <cell line: 1>()
----> 1 representative_docs = extract_representative_documents(df_name)
2 representative_docs
[<ipython-input-37-023df3f7fa07>](https://localhost:8080/#) in extract_representative_documents(df_name, nr_repr_docs)
35
36 # Extract representative documents using the BERTopic model
---> 37 repr_docs, _, _, _ = models[df_name]._extract_representative_docs(
38 models[df_name].c_tf_idf_,
39 documents,
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _extract_representative_docs(self, c_tf_idf, documents, topics, nr_samples, nr_repr_docs, diversity)
3689 # Sample documents per topic
3690 documents_per_topic = (
-> 3691 documents.drop("Image", axis=1, errors="ignore")
3692 .groupby('Topic')
3693 .sample(n=nr_samples, replace=True, random_state=42)
[/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/groupby.py](https://localhost:8080/#) in sample(self, n, frac, replace, weights, random_state)
4334 sampled_indices.append(grp_indices[grp_sample])
4335
-> 4336 sampled_indices = np.concatenate(sampled_indices)
4337 return self._selected_obj.take(sampled_indices, axis=self.axis)
4338
/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
ValueError: need at least one array to concatenate Besides the proposed solution, is there any way to use |
As mentioned above, Instead, you can fix the error you ran into with the following code. I just tested it and it should work to extract, for example, the top 10 topics: import pandas as pd
documents = pd.DataFrame(
{
"Document": docs,
"ID": range(len(docs)),
"Topic": topic_model.topics_,
"Image": None
}
)
repr_docs, _, _, _ = topic_model._extract_representative_docs(
topic_model.c_tf_idf_,
documents,
topic_model.topic_labels_,
nr_repr_docs=10
) Note that what is happening in the above code is that the documents are passed to the function which is not the case with |
@MaartenGr Thank you for your quick response. I ran # Access the the model and documents
docs = df_dict[df_name]['Post_Content']
topic_model = models[df_name]
# Create a DataFrame for the documents
documents = pd.DataFrame(
{
"Document": docs,
"ID": range(len(docs)),
"Topic": topic_model.topics_,
"Image": None
}
)
# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(
topic_model.c_tf_idf_,
documents,
topic_model.topics_,
nr_repr_docs=10
) Ang got ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-98-cefb73aba3fa> in <cell line: 16>()
14
15 # Extract representative documents using the BERTopic model
---> 16 repr_docs, _, _, _ = topic_model._extract_representative_docs(
17 topic_model.c_tf_idf_,
18 documents,
/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py in _extract_representative_docs(self, c_tf_idf, documents, topics, nr_samples, nr_repr_docs, diversity)
3700 repr_docs_mappings = {}
3701 repr_docs_ids = []
-> 3702 labels = sorted(list(topics.keys()))
3703 for index, topic in enumerate(labels):
3704
AttributeError: 'list' object has no attribute 'keys' So I adjusted it to # Access the model and documents
docs = df_dict[df_name]['Post_Content']
topic_model = models[df_name]
# Retrieve the topics as a dictionary (replace get_topics() with the correct method)
topics_dict = topic_model.get_topics() # This should be a dictionary
# Create a DataFrame for the documents
documents = pd.DataFrame(
{
"Document": docs,
"ID": range(len(docs)),
"Topic": topic_model.topics_,
"Image": None
}
)
# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(
topic_model.c_tf_idf_,
documents,
topics_dict, # Use the topics dictionary
nr_repr_docs=10
) This worked. Do you have any comments on this approach? Am I missing anything? Thank you again for your help! |
The reason for your error is that you did not copy my example as showed. In Don't do this: # Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(
topic_model.c_tf_idf_,
documents,
topic_model.topics_,
nr_repr_docs=10
) Do this: repr_docs, _, _, _ = topic_model._extract_representative_docs(
topic_model.c_tf_idf_,
documents,
topic_model.topic_labels_,
nr_repr_docs=10
) |
Thank you so much for spotting the error! It works now! |
There is a warning message. Collecting git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1572/head |
@jamesleverage You can try this !pip install git+https://github.com/MaartenGr/BERTopic.git@6fd3e14fa0867d5d68c580b75a0b40151626e80b This will install the Commits on Nov 17, 2023 (6fd3e14) |
Hello! working with this collab of BERTopic - Best Practices.ipynb I am getting this error even working with openai==0.28(It was working before). Sorry i saw the comments but I am confused.
|
@giannisni Thanks for sharing! I just updated the notebook, can you check whether it works? |
Hi @MaartenGr it works now thanks. Thought in my copy of notebook, using the same (mine) documents as i did before I am getting this error now, which was not happening. Also can you please explain what is passed exactly on the [DOCUMENTS] in the prompt? prompt = """ Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format: The error: BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 40728 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}} |
@giannisni Most likely, your documents are simply too big. I would advise applying document truncation using this guide. |
I'm experiencing a similar incompatibility issue. I am running bertopic version 0.16.0 and it was running fine until I updated openai. Now, I'm getting import errors no matter which version of openai I tried. I've tried 0.28, 1.10 and the latest version. Whenever I import bertopic I'm getting the following error: ` File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/init.py:1, in File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/_bertopic.py:49, in File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/representation/init.py:30, in File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/representation/_langchain.py:4, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/init.py:6, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/agents/init.py:2, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/agents/agent.py:16, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/agents/tools.py:8, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/tools/init.py:42, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/tools/vectorstore/tool.py:13, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/chains/init.py:2, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/chains/api/base.py:13, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/chains/api/prompt.py:2, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/prompts/init.py:3, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/prompts/chat.py:10, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/memory/init.py:28, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/memory/vectorstore.py:10, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/vectorstores/init.py:2, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/vectorstores/analyticdb.py:15, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/embeddings/init.py:19, in File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/embeddings/openai.py:66, in File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/main.py:197, in pydantic.main.ModelMetaclass.new() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:506, in pydantic.fields.ModelField.infer() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:436, in pydantic.fields.ModelField.init() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:552, in pydantic.fields.ModelField.prepare() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:663, in pydantic.fields.ModelField._type_analysis() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:808, in pydantic.fields.ModelField._create_sub_type() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:436, in pydantic.fields.ModelField.init() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:552, in pydantic.fields.ModelField.prepare() File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:668, in pydantic.fields.ModelField._type_analysis() File ~/.conda/envs/py310/lib/python3.9/typing.py:852, in _SpecialGenericAlias.subclasscheck(self, cls) TypeError: issubclass() arg 1 must be a class I've also created a new conda environment and retried it and it still does not work. It only runs when using version 14 or smaller. Do you know what might be going on? |
Based on this line in your error message, you are not using v0.16.0 but v0.15.0:
Please make sure that you are using the newest version of BERTopic. |
A related question: can I somehow set the temperature argument when using openAI () to refine topic representations? Now I am using:
and the topics I got changed slightly every time I ran the code. |
You can use the
I would advise checking out the FAQ. It might be that you need to install UMAP from the main branch (I believe a PR updated some things) but I'm not sure, you will have to test. |
Thanks! I will try out the |
@MaartenGr I am receiving a similar issue, displayed below:
I encounter the error while trying to generate the embeddings. Below is the code snippet I am using.
Could you please guide me to a solution, or maybe help me out with how to run batches than the complete corpus. Thanks. |
@SamArora18 |
Hi @MaartenGr, I'm experiencing a similar issue, but it’s more "hidden"/"silent" and took me some time to isolate the cause. When I use representation models without OpenAI, everything works smoothly. However, when OpenAI is included, the process gets stuck indefinitely at the step where representation models are applied to extract topics from clusters, without any error messages or progress beyond that point. Note: I also checked OpenAI API usage and confirmed that requests is being made via BERTopic. (I created a new API key to test this) My BERTopic version: v0.16.4 I've attached the code and some output screenshots for your reference below. Code: def create_representation_model() -> Dict:
"""Create and return a dictionary of representation models."""
client = openai.OpenAI(api_key=OPENAI_API_KEY)
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
openai_model = OpenAI(client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)
return {
"KeyBERT": KeyBERTInspired(),
# "OpenAI": openai_model,
"MMR": MaximalMarginalRelevance(diversity=0.4),
"POS": PartOfSpeech("en_core_web_sm")
}
def create_bertopic_model(embedding_model, umap_model, hdbscan_model, representation_model) -> BERTopic:
"""Create and return a BERTopic model."""
return BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(1, 3)),
representation_model=representation_model,
top_n_words=15,
calculate_probabilities=False,
verbose=True
) Output screenshot:
Thank you. |
@ashoknimiwal Hmmm, I'm not sure what exactly is happening there. Based on your tqdm bar it seems that the representation is not progressing at all. You mentioned that you do see something happening at the OpenAI API right? Does that mean that the tqdm bar is not progressing and stuck at 0? Also, which version of openai are you using? The last thing I can think off is not using |
Thank you for your input! It helped me figure out the actual problem. The issue lies with my OpenAI API limits, which have exceeded the approved quota for requests. And having One question: Is it possible to use local LLMs or other LLM APIs as a proxy for the OpenAI API here? |
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic:
"""
openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)
All representation models
representation_model = {
"OpenAI": openai_model, # Uncomment if you will use OpenAI
}
topics, probs = model.fit_transform(smaller_docs_list)
Getting following error message:
You tried to access openai.ChatCompletion,
but this is no longer supported in openai>=1.0.0 - see the
When I downgrade opneai to 0.38, this error went away.
However, the execution timed out after 600 seconds.
The text was updated successfully, but these errors were encountered: