Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openai incompatible issues with Bertopic #1629

Open
jamesleverage opened this issue Nov 14, 2023 · 27 comments
Open

openai incompatible issues with Bertopic #1629

jamesleverage opened this issue Nov 14, 2023 · 27 comments

Comments

@jamesleverage
Copy link

jamesleverage commented Nov 14, 2023

prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic:
"""
openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

All representation models

representation_model = {
"OpenAI": openai_model, # Uncomment if you will use OpenAI
}

topics, probs = model.fit_transform(smaller_docs_list)

Getting following error message:

You tried to access openai.ChatCompletion,
but this is no longer supported in openai>=1.0.0 - see the

When I downgrade opneai to 0.38, this error went away.
However, the execution timed out after 600 seconds.

@MaartenGr
Copy link
Owner

Thanks for sharing! It seems that with openai 1.0.0 there were breaking changes to the API which need to be updated in BERTopic. I'll make sure to fix it in this PR since there were more OpenAI updates there.

MaartenGr added a commit that referenced this issue Nov 15, 2023
@MaartenGr
Copy link
Owner

@jamesleverage If I'm not mistaken, you can use openai 0.28 instead of 0.38 and I believe it should be working. However, I just pushed a fix to the PR mentioned above that should make it work with openai >= 1.0. In the upcoming release of BERTopic, openai < 1.0 will not be supported anymore.

@jamesleverage
Copy link
Author

jamesleverage commented Nov 16, 2023

I used openai==0.28 and getting this error at this line after running for 10+ minutes:

representation_model = OpenAI(model="gpt-3.5-turbo", chat=True)

model = BERTopic(representation_model=representation_model)
topics, probs = model.fit_transform(smaller_docs_list). <==== This is where the execution hangs.

Error message:

===========================================================================
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

ReadTimeoutError Traceback (most recent call last)
ReadTimeoutError: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)

During handling of the above exception, another exception occurred:

===========================================================================
The old release is causing a problem.

Is there something I can add to my code to make this work?
Or should I just wait for the latest PR fix?

For simple prompting to OpenAI, this definition worked with the latest openai:
``
def get_completion( prompt, model="gpt-3.5-turbo", temperature=0):

messages = [{"role": "user", "content":prompt}]

client = OpenAI(api_key=OPENAI_API_KEY)

response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0
)

response_message = response.choices[0].message.content

return response_message
``

@MaartenGr
Copy link
Owner

Is there something I can add to my code to make this work?
Or should I just wait for the latest PR fix?

If 0.28 is currently not working, then I would wait until the PR fix. You can already download it if you want like this:

pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1572/head

@linxule
Copy link

linxule commented Nov 16, 2023

Adding to the discussion, while we wait for the PR fix, I'm trying to do the labeling after training BERTopic models. Would it be possible to adjust the number of representative documents when we use get_representative_docs?

@MaartenGr
Copy link
Owner

@linxule .get_representative_docs is a function that does no calculations with respect to extracting the most representative documents. For that, you would have to use the internal ._extract_representative_docs which is used to calculate which documents are most representative of a given topic. Do note though that since this is a private function, breaking changes might appear in future releases and not additional official support can be given.

I believe you can use it as follows:

import pandas as pd
documents = pd.DataFrame(
  {
    "Document": docs,
    "ID": range(len(docs)),
    "Topic": None,
    "Image": None
  }
)

repr_docs, _, _, _ = topic_model._extract_representative_docs(
  topic_model.c_tf_idf_, 
  documents, 
  topic_model.topics_, 
  nr_repr_docs=10
)

Where docs are your input documents. I have not tested this so there might be a few mistakes there but the general principle should be solid.

@linxule
Copy link

linxule commented Nov 21, 2023

Hi @MaartenGr ,

I tried the solution you suggested but encountered some issues

I ran

import pandas as pd

# Assuming df_dict and models are defined in an accessible scope
# df_dict: Dictionary of DataFrames
# models: Dictionary of BERTopic models

def extract_representative_documents(df_name, nr_repr_docs=5):
    """
    Extracts representative documents for each topic from a DataFrame specified by df_name.

    Parameters:
    - df_name: The name of the DataFrame within df_dict.
    - nr_repr_docs: Number of representative documents to extract for each topic (default is 10).

    Returns:
    - A DataFrame with the representative documents and their associated topics.
    """
    if df_name not in df_dict:
        raise ValueError(f"DataFrame with name '{df_name}' not found in df_dict")
    if df_name not in models:
        raise ValueError(f"BERTopic model with name '{df_name}' not found in models")

    # Access the documents from the specified DataFrame
    docs = df_dict[df_name]['Post_Content']

    # Create a DataFrame for the documents
    documents = pd.DataFrame(
        {
            "Document": docs,
            "ID": range(len(docs)),
            "Topic": None,
            "Image": None
        }
    )

    # Extract representative documents using the BERTopic model
    repr_docs, _, _, _ = models[df_name]._extract_representative_docs(
        models[df_name].c_tf_idf_, 
        documents, 
        models[df_name].topics_, 
        nr_repr_docs=nr_repr_docs
    )

    return repr_docs

# Example usage
representative_docs = extract_representative_documents(df_name)

I got

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-40-80fe2672c770>](https://localhost:8080/#) in <cell line: 1>()
----> 1 representative_docs = extract_representative_documents(df_name)
      2 representative_docs

[<ipython-input-37-023df3f7fa07>](https://localhost:8080/#) in extract_representative_documents(df_name, nr_repr_docs)
     35 
     36     # Extract representative documents using the BERTopic model
---> 37     repr_docs, _, _, _ = models[df_name]._extract_representative_docs(
     38         models[df_name].c_tf_idf_,
     39         documents,

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _extract_representative_docs(self, c_tf_idf, documents, topics, nr_samples, nr_repr_docs, diversity)
   3689         # Sample documents per topic
   3690         documents_per_topic = (
-> 3691             documents.drop("Image", axis=1, errors="ignore")
   3692                      .groupby('Topic')
   3693                      .sample(n=nr_samples, replace=True, random_state=42)

[/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/groupby.py](https://localhost:8080/#) in sample(self, n, frac, replace, weights, random_state)
   4334             sampled_indices.append(grp_indices[grp_sample])
   4335 
-> 4336         sampled_indices = np.concatenate(sampled_indices)
   4337         return self._selected_obj.take(sampled_indices, axis=self.axis)
   4338 

/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in concatenate(*args, **kwargs)

ValueError: need at least one array to concatenate

Besides the proposed solution, is there any way to use topic_model.get_representative_docs() and specify the number of representative documents? This approach seems to be default to 3 representative documents?

@MaartenGr
Copy link
Owner

@linxule

Besides the proposed solution, is there any way to use topic_model.get_representative_docs() and specify the number of
representative documents? This approach seems to be default to 3 representative documents?

As mentioned above, topic_model.get_representative_docs() does not actually calculate which documents are most representative as that is done during topic_model.fit(docs). Instead, topic_model.get_representative_docs() simply gets the previously calculated representative documents in a nice format. As a result, it is simply not possible to get more than 3 representative documents that way since the trained documents are not saved within the topic model. The reason for this is that saving training data within a model is something that we should generally prevent, especially if the data is large.

Instead, you can fix the error you ran into with the following code. I just tested it and it should work to extract, for example, the top 10 topics:

import pandas as pd
documents = pd.DataFrame(
  {
    "Document": docs,
    "ID": range(len(docs)),
    "Topic": topic_model.topics_,
    "Image": None
  }
)

repr_docs, _, _, _ = topic_model._extract_representative_docs(
  topic_model.c_tf_idf_, 
  documents, 
  topic_model.topic_labels_, 
  nr_repr_docs=10
)

Note that what is happening in the above code is that the documents are passed to the function which is not the case with topic_model.get_representative_docs.

@linxule
Copy link

linxule commented Nov 21, 2023

@MaartenGr Thank you for your quick response.

I ran

# Access the the model and documents
docs = df_dict[df_name]['Post_Content']
topic_model = models[df_name]

# Create a DataFrame for the documents
documents = pd.DataFrame(
  {
      "Document": docs,
      "ID": range(len(docs)),
      "Topic": topic_model.topics_,
      "Image": None
  }
)

# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(
  topic_model.c_tf_idf_,
  documents,
  topic_model.topics_,
  nr_repr_docs=10
)

Ang got

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-98-cefb73aba3fa> in <cell line: 16>()
     14 
     15 # Extract representative documents using the BERTopic model
---> 16 repr_docs, _, _, _ = topic_model._extract_representative_docs(
     17   topic_model.c_tf_idf_,
     18   documents,

/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py in _extract_representative_docs(self, c_tf_idf, documents, topics, nr_samples, nr_repr_docs, diversity)
   3700         repr_docs_mappings = {}
   3701         repr_docs_ids = []
-> 3702         labels = sorted(list(topics.keys()))
   3703         for index, topic in enumerate(labels):
   3704 

AttributeError: 'list' object has no attribute 'keys'

So I adjusted it to

# Access the model and documents
docs = df_dict[df_name]['Post_Content']
topic_model = models[df_name]

# Retrieve the topics as a dictionary (replace get_topics() with the correct method)
topics_dict = topic_model.get_topics()  # This should be a dictionary

# Create a DataFrame for the documents
documents = pd.DataFrame(
    {
        "Document": docs,
        "ID": range(len(docs)),
        "Topic": topic_model.topics_, 
        "Image": None
    }
)

# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(
    topic_model.c_tf_idf_,
    documents,
    topics_dict,  # Use the topics dictionary
    nr_repr_docs=10
)

This worked. Do you have any comments on this approach? Am I missing anything?

Thank you again for your help!

@MaartenGr
Copy link
Owner

The reason for your error is that you did not copy my example as showed. In ._extract_representative_docs you should use topic_model.topic_labels_ instead of using topic_model.topics_.

Don't do this:

# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(
  topic_model.c_tf_idf_,
  documents,
  topic_model.topics_,
  nr_repr_docs=10
)

Do this:

repr_docs, _, _, _ = topic_model._extract_representative_docs(
  topic_model.c_tf_idf_, 
  documents, 
  topic_model.topic_labels_, 
  nr_repr_docs=10
)

@linxule
Copy link

linxule commented Nov 21, 2023

Thank you so much for spotting the error! It works now!

@jamesleverage
Copy link
Author

Is there something I can add to my code to make this work?
Or should I just wait for the latest PR fix?

If 0.28 is currently not working, then I would wait until the PR fix. You can already download it if you want like this:

pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1572/head

There is a warning message.
!pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1572/head

Collecting git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1572/head
Cloning https://github.com/MaartenGr/BERTopic.git (to revision refs/pull/1572/head) to /tmp/pip-req-build-9vcn0uye
Running command git clone --filter=blob:none --quiet https://github.com/MaartenGr/BERTopic.git /tmp/pip-req-build-9vcn0uye
WARNING: Did not find branch or tag 'refs/pull/1572/head', assuming revision or ref.

@linxule
Copy link

linxule commented Nov 22, 2023

@jamesleverage You can try this

!pip install git+https://github.com/MaartenGr/BERTopic.git@6fd3e14fa0867d5d68c580b75a0b40151626e80b

This will install the Commits on Nov 17, 2023 (6fd3e14)

@giannisni
Copy link

giannisni commented Nov 30, 2023

Hello! working with this collab of BERTopic - Best Practices.ipynb

https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing#scrollTo=Fo-Oig4Yib5K

I am getting this error even working with openai==0.28(It was working before). Sorry i saw the comments but I am confused.

<ipython-input-23-6a1f85215426> in <cell line: 27>()
     25 
     26 # Initialize the OpenAI model for BERTopic
---> 27 openai_model = OpenAI(model="gpt-3.5-turbo", prompt=prompt, chat=True, exponential_backoff=True)
     28 
     29 # All representation models

TypeError: OpenAI.__init__() missing 1 required positional argument: 'client'

@MaartenGr
Copy link
Owner

@giannisni Thanks for sharing! I just updated the notebook, can you check whether it works?

@giannisni
Copy link

Hi @MaartenGr it works now thanks. Thought in my copy of notebook, using the same (mine) documents as i did before I am getting this error now, which was not happening. Also can you please explain what is passed exactly on the [DOCUMENTS] in the prompt?

prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic:
"""

The error:

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 40728 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

@MaartenGr
Copy link
Owner

@giannisni Most likely, your documents are simply too big. I would advise applying document truncation using this guide.

@SebastianSpeer
Copy link

SebastianSpeer commented Apr 2, 2024

@MaartenGr,

I'm experiencing a similar incompatibility issue. I am running bertopic version 0.16.0 and it was running fine until I updated openai. Now, I'm getting import errors no matter which version of openai I tried. I've tried 0.28, 1.10 and the latest version.

Whenever I import bertopic I'm getting the following error:

`
Input In [1], in <cell line: 1>()
----> 1 import bertopic

File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/init.py:1, in
----> 1 from bertopic._bertopic import BERTopic
3 version = "0.15.0"
5 all = [
6 "BERTopic",
7 ]

File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/_bertopic.py:49, in
47 from bertopic.cluster import BaseCluster
48 from bertopic.backend import BaseEmbedder
---> 49 from bertopic.representation._mmr import mmr
50 from bertopic.backend._utils import select_backend
51 from bertopic.vectorizers import ClassTfidfTransformer

File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/representation/init.py:30, in
28 # OpenAI Generator
29 try:
---> 30 from bertopic.representation._langchain import LangChain
31 except ModuleNotFoundError:
32 msg = "pip install langchain \n\n"

File ~/.conda/envs/py310/lib/python3.9/site-packages/bertopic/representation/_langchain.py:4, in
2 from scipy.sparse import csr_matrix
3 from typing import Mapping, List, Tuple
----> 4 from langchain.docstore.document import Document
5 from bertopic.representation._base import BaseRepresentation
8 DEFAULT_PROMPT = "What are these documents about? Please give a single label."

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/init.py:6, in
3 from importlib import metadata
4 from typing import Optional
----> 6 from langchain.agents import MRKLChain, ReActChain, SelfAskWithSearchChain
7 from langchain.cache import BaseCache
8 from langchain.chains import (
9 ConversationChain,
10 LLMBashChain,
(...)
18 VectorDBQAWithSourcesChain,
19 )

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/agents/init.py:2, in
1 """Interface for agents."""
----> 2 from langchain.agents.agent import (
3 Agent,
4 AgentExecutor,
5 AgentOutputParser,
6 BaseMultiActionAgent,
7 BaseSingleActionAgent,
8 LLMSingleActionAgent,
9 )
10 from langchain.agents.agent_toolkits import (
11 create_csv_agent,
12 create_json_agent,
(...)
20 create_vectorstore_router_agent,
21 )
22 from langchain.agents.agent_types import AgentType

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/agents/agent.py:16, in
13 from pydantic import BaseModel, root_validator
15 from langchain.agents.agent_types import AgentType
---> 16 from langchain.agents.tools import InvalidTool
17 from langchain.base_language import BaseLanguageModel
18 from langchain.callbacks.base import BaseCallbackManager

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/agents/tools.py:8, in
2 from typing import Optional
4 from langchain.callbacks.manager import (
5 AsyncCallbackManagerForToolRun,
6 CallbackManagerForToolRun,
7 )
----> 8 from langchain.tools.base import BaseTool, Tool, tool
11 class InvalidTool(BaseTool):
12 """Tool that is run when invalid tool name is encountered by agent."""

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/tools/init.py:42, in
40 from langchain.tools.shell.tool import ShellTool
41 from langchain.tools.steamship_image_generation import SteamshipImageGenerationTool
---> 42 from langchain.tools.vectorstore.tool import (
43 VectorStoreQATool,
44 VectorStoreQAWithSourcesTool,
45 )
46 from langchain.tools.wikipedia.tool import WikipediaQueryRun
47 from langchain.tools.wolfram_alpha.tool import WolframAlphaQueryRun

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/tools/vectorstore/tool.py:13, in
8 from langchain.base_language import BaseLanguageModel
9 from langchain.callbacks.manager import (
10 AsyncCallbackManagerForToolRun,
11 CallbackManagerForToolRun,
12 )
---> 13 from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain
14 from langchain.llms.openai import OpenAI
15 from langchain.tools.base import BaseTool

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/chains/init.py:2, in
1 """Chains are easily reusable components which can be linked together."""
----> 2 from langchain.chains.api.base import APIChain
3 from langchain.chains.api.openapi.chain import OpenAPIEndpointChain
4 from langchain.chains.combine_documents.base import AnalyzeDocumentChain

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/chains/api/base.py:13, in
8 from langchain.base_language import BaseLanguageModel
9 from langchain.callbacks.manager import (
10 AsyncCallbackManagerForChainRun,
11 CallbackManagerForChainRun,
12 )
---> 13 from langchain.chains.api.prompt import API_RESPONSE_PROMPT, API_URL_PROMPT
14 from langchain.chains.base import Chain
15 from langchain.chains.llm import LLMChain

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/chains/api/prompt.py:2, in
1 # flake8: noqa
----> 2 from langchain.prompts.prompt import PromptTemplate
4 API_URL_PROMPT_TEMPLATE = """You are given the below API Documentation:
5 {api_docs}
6 Using this documentation, generate the full API url to call for answering the user question.
(...)
9 Question:{question}
10 API url:"""
12 API_URL_PROMPT = PromptTemplate(
13 input_variables=[
14 "api_docs",
(...)
17 template=API_URL_PROMPT_TEMPLATE,
18 )

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/prompts/init.py:3, in
1 """Prompt template classes."""
2 from langchain.prompts.base import BasePromptTemplate, StringPromptTemplate
----> 3 from langchain.prompts.chat import (
4 AIMessagePromptTemplate,
5 BaseChatPromptTemplate,
6 ChatMessagePromptTemplate,
7 ChatPromptTemplate,
8 HumanMessagePromptTemplate,
9 MessagesPlaceholder,
10 SystemMessagePromptTemplate,
11 )
12 from langchain.prompts.few_shot import FewShotPromptTemplate
13 from langchain.prompts.few_shot_with_templates import FewShotPromptWithTemplates

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/prompts/chat.py:10, in
6 from typing import Any, Callable, List, Sequence, Tuple, Type, TypeVar, Union
8 from pydantic import BaseModel, Field
---> 10 from langchain.memory.buffer import get_buffer_string
11 from langchain.prompts.base import BasePromptTemplate, StringPromptTemplate
12 from langchain.prompts.prompt import PromptTemplate

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/memory/init.py:28, in
26 from langchain.memory.summary_buffer import ConversationSummaryBufferMemory
27 from langchain.memory.token_buffer import ConversationTokenBufferMemory
---> 28 from langchain.memory.vectorstore import VectorStoreRetrieverMemory
30 all = [
31 "CombinedMemory",
32 "ConversationBufferWindowMemory",
(...)
52 "CassandraChatMessageHistory",
53 ]

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/memory/vectorstore.py:10, in
8 from langchain.memory.utils import get_prompt_input_key
9 from langchain.schema import Document
---> 10 from langchain.vectorstores.base import VectorStoreRetriever
13 class VectorStoreRetrieverMemory(BaseMemory):
14 """Class for a VectorStore-backed memory object."""

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/vectorstores/init.py:2, in
1 """Wrappers on top of vector stores."""
----> 2 from langchain.vectorstores.analyticdb import AnalyticDB
3 from langchain.vectorstores.annoy import Annoy
4 from langchain.vectorstores.atlas import AtlasDB

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/vectorstores/analyticdb.py:15, in
12 from sqlalchemy.sql.expression import func
14 from langchain.docstore.document import Document
---> 15 from langchain.embeddings.base import Embeddings
16 from langchain.utils import get_from_dict_or_env
17 from langchain.vectorstores.base import VectorStore

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/embeddings/init.py:19, in
17 from langchain.embeddings.jina import JinaEmbeddings
18 from langchain.embeddings.llamacpp import LlamaCppEmbeddings
---> 19 from langchain.embeddings.openai import OpenAIEmbeddings
20 from langchain.embeddings.sagemaker_endpoint import SagemakerEndpointEmbeddings
21 from langchain.embeddings.self_hosted import SelfHostedEmbeddings

File ~/.conda/envs/py310/lib/python3.9/site-packages/langchain/embeddings/openai.py:66, in
61 return embeddings.client.create(**kwargs)
63 return _embed_with_retry(**kwargs)
---> 66 class OpenAIEmbeddings(BaseModel, Embeddings):
67 """Wrapper around OpenAI embedding models.
68
69 To use, you should have the openai python package installed, and the
(...)
103
104 """
106 client: Any #: :meta private:

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/main.py:197, in pydantic.main.ModelMetaclass.new()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:506, in pydantic.fields.ModelField.infer()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:436, in pydantic.fields.ModelField.init()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:552, in pydantic.fields.ModelField.prepare()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:663, in pydantic.fields.ModelField._type_analysis()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:808, in pydantic.fields.ModelField._create_sub_type()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:436, in pydantic.fields.ModelField.init()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:552, in pydantic.fields.ModelField.prepare()

File ~/.conda/envs/py310/lib/python3.9/site-packages/pydantic/fields.py:668, in pydantic.fields.ModelField._type_analysis()

File ~/.conda/envs/py310/lib/python3.9/typing.py:852, in _SpecialGenericAlias.subclasscheck(self, cls)
850 return issubclass(cls.origin, self.origin)
851 if not isinstance(cls, _GenericAlias):
--> 852 return issubclass(cls, self.origin)
853 return super().subclasscheck(cls)

TypeError: issubclass() arg 1 must be a class
`

I've also created a new conda environment and retried it and it still does not work. It only runs when using version 14 or smaller. Do you know what might be going on?

@MaartenGr
Copy link
Owner

@SebastianSpeer

Based on this line in your error message, you are not using v0.16.0 but v0.15.0:

3 version = "0.15.0"

Please make sure that you are using the newest version of BERTopic.

@YooWonTaek
Copy link

A related question: can I somehow set the temperature argument when using openAI () to refine topic representations? Now I am using:

representation_model_openai = OpenAI(client, model="gpt-4-turbo-preview", chat=True)
topic_model.update_topics(texts, topics, representation_model=representation_model_openai)

and the topics I got changed slightly every time I ran the code.

@MaartenGr
Copy link
Owner

A related question: can I somehow set the temperature argument when using openAI () to refine topic representations? Now I am using:

You can use the generator_kwargs for that (see the docstrings).

and the topics I got changed slightly every time I ran the code.

I would advise checking out the FAQ. It might be that you need to install UMAP from the main branch (I believe a PR updated some things) but I'm not sure, you will have to test.

@YooWonTaek
Copy link

A related question: can I somehow set the temperature argument when using openAI () to refine topic representations? Now I am using:

You can use the generator_kwargs for that (see the docstrings).

and the topics I got changed slightly every time I ran the code.

I would advise checking out the FAQ. It might be that you need to install UMAP from the main branch (I believe a PR updated some things) but I'm not sure, you will have to test.

Thanks! I will try out the generator_kwargs. I have already set a seed for UMAP, and my topics are the same every time, the only difference is the topic representations I get.

@SamArora18
Copy link

@MaartenGr I am receiving a similar issue, displayed below:

BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

I encounter the error while trying to generate the embeddings.
The error occurs when I run the embedding model on complete corpus having 5000 sentences, but when I test it out on 100 sentences, it works fine.

Below is the code snippet I am using.

client = openai.OpenAI(api_key = my_key)

embedding_model = OpenAIBackend(client, "text-embedding-ada-002")


summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in a one statement in the following format:
topic: <description>
"""


representation_model = OpenAI(client = client, model="gpt-4o", chat=True, prompt=summarization_prompt, 
                              nr_docs=5, delay_in_seconds=3)

vectorizer_model = CountVectorizer(min_df=1)
topic_model = BERTopic(
    embedding_model=embedding_model, 
    min_topic_size=25,
    zeroshot_topic_list=topics,
    zeroshot_min_similarity=0,
    representation_model=representation_model
)

Could you please guide me to a solution, or maybe help me out with how to run batches than the complete corpus. Thanks.

@MaartenGr
Copy link
Owner

@SamArora18
Hmmm, this seems like it would be either an issue with empty documents (which I doubt) but more likely to do with the OpenAI's content filter. It might be that you are sending over documents are asking questions that go against their terms of service. I believe this was fixed in the latest release of BERTopic. Have you tried v0.16.4?

@ashoknimiwal
Copy link

ashoknimiwal commented Oct 22, 2024

Hi @MaartenGr,

I'm experiencing a similar issue, but it’s more "hidden"/"silent" and took me some time to isolate the cause. When I use representation models without OpenAI, everything works smoothly. However, when OpenAI is included, the process gets stuck indefinitely at the step where representation models are applied to extract topics from clusters, without any error messages or progress beyond that point.

Note: I also checked OpenAI API usage and confirmed that requests is being made via BERTopic. (I created a new API key to test this)

My BERTopic version: v0.16.4

I've attached the code and some output screenshots for your reference below.

Code:

def create_representation_model() -> Dict:
    """Create and return a dictionary of representation models."""
    client = openai.OpenAI(api_key=OPENAI_API_KEY)
    prompt = """
    I have a topic that contains the following documents:
    [DOCUMENTS]
    The topic is described by the following keywords: [KEYWORDS]

    Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
    topic: <topic label>
    """
    openai_model = OpenAI(client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)
    
    return {
        "KeyBERT": KeyBERTInspired(),
        # "OpenAI": openai_model,
        "MMR": MaximalMarginalRelevance(diversity=0.4),
        "POS": PartOfSpeech("en_core_web_sm")
    }

def create_bertopic_model(embedding_model, umap_model, hdbscan_model, representation_model) -> BERTopic:
    """Create and return a BERTopic model."""
    return BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(1, 3)),
        representation_model=representation_model,
        top_n_words=15,
        calculate_probabilities=False,
        verbose=True
    )

Output screenshot:

  1. With OpenAI in representation model:
  • I noticed that progress bar only appears during the representation step when OpenAI is used as one of the representation models.

image

  1. Without OpenAI in represented models:

image

Thank you.

@MaartenGr
Copy link
Owner

@ashoknimiwal Hmmm, I'm not sure what exactly is happening there. Based on your tqdm bar it seems that the representation is not progressing at all. You mentioned that you do see something happening at the OpenAI API right? Does that mean that the tqdm bar is not progressing and stuck at 0?

Also, which version of openai are you using?

The last thing I can think off is not using exponential_backoff=True and see whether that resolves anything. Perhaps using a set number of seconds between requests instead.

@ashoknimiwal
Copy link

ashoknimiwal commented Oct 23, 2024

Thank you for your input! It helped me figure out the actual problem.

The issue lies with my OpenAI API limits, which have exceeded the approved quota for requests. And having exponential_backoff set to True was preventing any RateLimitError error messages from appearing!

One question: Is it possible to use local LLMs or other LLM APIs as a proxy for the OpenAI API here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants