Optimise extraction prompts via DSPy #154

slobentanzer · 2024-05-11T11:57:25Z

There remain some questions about the right prompt for the behaviour of the different models; llama series models seem to handle prompts differently than GPT. As an initial experiment, DSPy will be used to generate optimal text extraction prompts for a selection of models (GPT, llama, mi(s/x)tral), which will then be examined for their differences.

slobentanzer · 2024-08-23T13:43:18Z

@drAbreu could you update briefly with your recent experiences?

drAbreu · 2024-08-27T13:06:34Z

A series of experiments were performed to investigate whether DSPy has the power to improve the benchmarking results, specifically among the Llama family of models.

Unfortunately, it seems that it is not possible to use system prompts in the Llama models on DSPy as of now. A bit of extra research has also shown me that the template that we are using on the system prompt for the information extraction, while understood by Open AI, it is not understood by other models. One example is Claude, where having the template

FIGURE CAPTION: {{figure legend}} ##\n\n## QUERY: {{query}} ##\n\n## ANSWER FORMAT: {{format}}. Submit your answer EXTRICTLY in the format specified by {{format}}

leads the model kind of fail, while taking the template out leads to good results. This is likely due to the fact that Claude uses XML-like tags for prompt templating, as oposed to GPT. This issue speaks clearly about prompt engineering issues that will be model dependent.

This poses the question of whether our current benchmark for information_extraction is meaningful since the issues of models other than GPT might arise from a lack of prompt understanding, and we would see this as a result instead of the actual capacity of the models on extracting the required information.

The idea of DSPy was to improve the prompt or the system prompt, increasing the quality of the LLM inferences. However, I do not see this happening in our information extraction.

I have been comparing GPT3.5, GPT4 and Claude3.5 using the baseline API results, and then some of the different solutions of DSPy.

As shown below, Claude3.5 works better than any pf the GPT models, with the surprise that gpt-4o is overperformed by GPT3.5 :hug:

Also interesting is to see that the most basic DSPy uses (Signature and ChainOfThought) just make the models worse.

Few shot learning is what provides indeed the best results.
Below are shown the results

(Rogue scores)	gpt-3.5-turbo	claude-3-opus-20240229	gpt-4o
Baseline	0.41 +/- 0.32	0.58 +/- 0.39	0.39 +/- 0.34
DSPy Signature	0.37 +/- 0.31	0.35 +/- 0.31	0.28 +/- 0.26
DSPy ChainOfThought	0.28 +/- 0.30	0.37 +/- 0.33	0.25 +/- 0.26
DSPy LabeledFewShot	0.48 +/- 0.37	0.66 +/- 0.34	0.44 +/- 0.30
DSPy BootstrapFewShot	0.47 +/- 0.35	0.58 +/- 0.4	0.43 +/- 0.26

Introducing the system prompt as a learnable parameter does not actually improve anything. Using this Few Shot learning process the system prompt is actually not even modifying a tiny bit by the compiler of DSPy.

The results do not change either.

drAbreu · 2024-08-27T13:07:31Z

This experiment might suggest that keeping track of the prompt engineering of different model families might be important to make the framework as universal as possible.

slobentanzer · 2024-08-30T11:24:21Z

Very nice analysis, thanks! Aligns with my intuition that the model creators are doing many individualistic things and it would thus be valuable to know the peculiarities of each model family and account for it in the backend to get comparable results between models. I'll be off next week but let's catch up in September. :)

the issues of models other than GPT might arise from a lack of prompt understanding

In fact, I did suspect that, but I think it is still valid to test, because this is the application we use. The next step would be the extraction module I suggested, where we look at each model family and create family-specific prompts to improve their performance. This would increase BioChatter version, and we would hopefully see a positive trend in extraction performance in some of the models.

most basic DSPy uses (Signature and ChainOfThought) just make the models worse

That is very interesting and counterintuitive, although I am not surprised.

with the surprise that gpt-4o is overperformed by GPT3.5

We see this in many instances. My guess is that it has to do with the internal system instructions.

slobentanzer added this to BioCypher Development May 11, 2024

slobentanzer converted this from a draft issue May 11, 2024

slobentanzer mentioned this issue May 11, 2024

LLM Benchmarking SourceData / Text extraction #146

Merged

slobentanzer moved this from Todo to In Progress in BioCypher Development Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise extraction prompts via DSPy #154

Optimise extraction prompts via DSPy #154

slobentanzer commented May 11, 2024 •

edited

Loading

slobentanzer commented Aug 23, 2024

drAbreu commented Aug 27, 2024

drAbreu commented Aug 27, 2024

slobentanzer commented Aug 30, 2024

Optimise extraction prompts via DSPy #154

Optimise extraction prompts via DSPy #154

Comments

slobentanzer commented May 11, 2024 • edited Loading

slobentanzer commented Aug 23, 2024

drAbreu commented Aug 27, 2024

drAbreu commented Aug 27, 2024

slobentanzer commented Aug 30, 2024

slobentanzer commented May 11, 2024 •

edited

Loading