-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
n-gram Keywords need delimiting in OpenAI() #1546
Comments
Thanks for the extensive description! I'll make sure to change it in #1539
That indeed would be helpful. I can enable verbosity to print out the prompts that are given for each call but that might prove to be too much logging if you have a very large dataset. |
What I've set up in my custom class is just for it to print the prompt for topic 0 (or the outlier topic if there are no topics) so that might be a good way to go if you want to do it with verbosity rather than a make function to generate just a prompt |
The thing is, when just 1 topic will be logged users might want to log every one of them and vice versa. I might add it in the LLMs themselves as additional verbosity levels but that feels a bit less intuitive wrt user experience since verbosity is handled differently throughout BERTopic. |
Yeah it might be nice to have access to all the prompts in an easier way than extracting them from logs. Is the selection and diversification of representative documents deterministic? If so, rather than looping through the topics, generating a prompt and getting the description back one by one, you could generate all of the prompts at once, then loop through the prompts to get each representation. Then you could abstract the prompt generation to a function or method exposed to the user so they could just call a function to get all the prompts using the same arguments that they sent to the LLM initially, maybe bind them to psuedo code might be something like:
Then rather then: BERTopic/bertopic/representation/_openai.py Lines 192 to 220 in 817ad86
You might have something like:
That code is based on #1539 and still needs some work... It works for generating the representations but |
Good idea! It is possible to generate the prompts before passing them to the LLM, they are currently not dependent on previous prompts. This might change in the future however so I think I would prefer to simply save the prompts after generating them iteratively. Then, you could save the prompts to the representation model and access them there. Since the prompts are also dependent on the order of representation models (KeyBERT -> OpenAI), I think Also, in your example, you would essentially create the prompts twice. Once when running Based on that, I would suggest the following. During any LLM representation model, save the prompts in the representation model whilst they are being created with the option of logging each of them or just the first. This would mean that the prompts are created once during |
Yes, Very good points. I forget that data can be saved in objects in python (I think I still approach python with a bit of an R mindset). That sounds like a great solution. |
Hi Maarten,
I think there is a bug in the OpenAI representation model in the way the prompt is generated. The keywords are only separated by a space, not a comma, which is problematic for n-grams > 1.
BERTopic/bertopic/representation/_openai.py
Lines 203 to 209 in 244215a
Without proper delimiting I end up with a prompt like this:
TextGeneration and Cohere look to be okay.
BERTopic/bertopic/representation/_textgeneration.py
Lines 130 to 136 in 244215a
It would also be helpful to have some way to generate an example prompt with [DOCUMENTS] and [KEYWORDS] applied to help with testing so the user can actually see what's being sent. I've got a custom class bc I'm using ChatGPT on AWS so I've got extra loggers in there but its it difficult to actually see the prompt in context with standard BERTopic.
The text was updated successfully, but these errors were encountered: