-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise extraction prompts via DSPy #154
Comments
@drAbreu could you update briefly with your recent experiences? |
A series of experiments were performed to investigate whether DSPy has the power to improve the benchmarking results, specifically among the Llama family of models. Unfortunately, it seems that it is not possible to use system prompts in the Llama models on DSPy as of now. A bit of extra research has also shown me that the template that we are using on the system prompt for the information extraction, while understood by Open AI, it is not understood by other models. One example is Claude, where having the template
leads the model kind of fail, while taking the template out leads to good results. This is likely due to the fact that Claude uses XML-like tags for prompt templating, as oposed to GPT. This issue speaks clearly about prompt engineering issues that will be model dependent. This poses the question of whether our current benchmark for The idea of DSPy was to improve the prompt or the system prompt, increasing the quality of the LLM inferences. However, I do not see this happening in our information extraction. I have been comparing GPT3.5, GPT4 and Claude3.5 using the baseline API results, and then some of the different solutions of DSPy. As shown below, Claude3.5 works better than any pf the GPT models, with the surprise that gpt-4o is overperformed by GPT3.5 :hug: Also interesting is to see that the most basic DSPy uses (Signature and ChainOfThought) just make the models worse. Few shot learning is what provides indeed the best results.
Introducing the system prompt as a learnable parameter does not actually improve anything. Using this Few Shot learning process the system prompt is actually not even modifying a tiny bit by the compiler of DSPy. The results do not change either. |
This experiment might suggest that keeping track of the prompt engineering of different model families might be important to make the framework as universal as possible. |
Very nice analysis, thanks! Aligns with my intuition that the model creators are doing many individualistic things and it would thus be valuable to know the peculiarities of each model family and account for it in the backend to get comparable results between models. I'll be off next week but let's catch up in September. :)
In fact, I did suspect that, but I think it is still valid to test, because this is the application we use. The next step would be the extraction module I suggested, where we look at each model family and create family-specific prompts to improve their performance. This would increase BioChatter version, and we would hopefully see a positive trend in extraction performance in some of the models.
That is very interesting and counterintuitive, although I am not surprised.
We see this in many instances. My guess is that it has to do with the internal system instructions. |
There remain some questions about the right prompt for the behaviour of the different models; llama series models seem to handle prompts differently than GPT. As an initial experiment, DSPy will be used to generate optimal text extraction prompts for a selection of models (GPT, llama, mi(s/x)tral), which will then be examined for their differences.
The text was updated successfully, but these errors were encountered: