Prompting strategies #110

tmm1 · 2023-07-16T17:59:52Z

Thanks for all your work investigating and benchmarking various prompting strategies.

I was curious if there are other strategies you tried and how well they worked. Ideally an LLM could generate unified diffs directly, but that seems quite challenging, even for GPT4.

I plan to experiment with a simple line based strategy, but wanted to run it by you in case you had thoughts. If all file listings had <line no>: prefixes on each line, I think a model could be asked to return only lines it wanted to change. This would reduce a lot of the repetition still seen with the block diff strategy and gpt4.

The text was updated successfully, but these errors were encountered:

tmm1 · 2023-07-16T19:50:44Z

I played a bit with this today, and had some success with GPT4. I discovered that if I ask for a range of line numbers at the top of an edit-block, they are usually off by one. But asking for it at the end of the block after the changed lines, it works more reliably.

EDIT: my prompt changes are here

paul-gauthier · 2023-07-16T21:13:18Z

Nice, that looks promising. I am exploring something similar on a branch myself right now.

I have previously experimented with line numbers (and unified diff, and a bunch of other things). Most of that work predates the benchmarking suite, so I was working off anecdotal evidence and best intuitions. I can't recall the specific observations that moved me away from line numbers and towards diffs at that time.

I'll share some results/learnings once I find the time to make progress on my line number branch.

tmm1 · 2023-07-17T20:04:53Z

To run the benchmarks, do I need to clone https://github.com/exercism/python into tmp.benchmarks/exercism-python?

I built the docker and then ran rungrid inside but see this error:

FileNotFoundError: [Errno 2] No such file or directory: '/benchmarks/2023-07-17-19-59-35--rungrid-gpt-3.5-turbo-0613-whole-repeat-1/exercises/.docs/instructions.md'

EDIT: Figured out I need to copy the practice folders in, cp -a ~/code/exercism-python/exercises/practice tmp.benchmarks/exercism-python

tmm1 · 2023-07-18T00:28:19Z

I experimented a bit more with line-based editing today. Asking gpt4 to emit simple code blocks, with instructions immediately after to replace/insert-above/insert-below/delete seems to work well, and allow the model to make changes to multiple places very efficiently:

However, it's challenging to apply these changes because as soon as you apply the first the line numbers which affect the next will change.

paul-gauthier · 2023-07-18T18:35:38Z

However, it's challenging to apply these changes because as soon as you apply the first the line numbers which affect the next will change.

Interesting progress! Can you apply them highest->lowest line numbers?

paul-gauthier · 2023-07-18T18:38:27Z

EDIT: Figured out I need to copy the practice folders in, cp -a ~/code/exercism-python/exercises/practice tmp.benchmarks/exercism-python

Yes, that's correct.

Sorry the benchmarking tools are not packaged up very well. I wasn't expecting anyone other than me to use them. I will try and put some effort into polishing them up, adding a README, etc.

Be warned that it costs about $15 to process all 133 exercises with aider when using the gpt-4 model.

tmm1 · 2023-07-19T02:09:28Z

Sorry the benchmarking tools are not packaged up very well. I wasn't expecting anyone other than me to use them. I will try and put some effort into polishing them up, adding a README, etc.

Be warned that it costs about $15 to process all 133 exercises with aider when using the gpt-4 model.

No worries, it wasn't hard to figure out how to run the various scripts but some docs would be great.

I was able to run the "whole" tests for one iteration using gpt-3.5, for $1.72. I got a similar result to you, ~54% iirc

I then tried to run the "whole" benchmarks against LocalAI + WizardCoder and StarCoderPlus, but neither were able to follow aider's instructions well enough to score above 0%. I will try again with llama2 chat models, and also want to try fine tuning some open-source models using aider's prompt syntax (hence the investigation into prompting strategies before I invest in building a training dataset).

Interesting progress! Can you apply them highest->lowest line numbers?

That's clever!

Unfortunately I'm souring on the line number approach. It seems very finicky and hard to replicate reliably. I will experiment with prompts more, maybe if we ask GPT4 to reason step-by-step further that will help. Instead of just asking for a list of files first, maybe a list of line ranges and operations decided upfront would also lead to better results.

I'm also considering using something like gpt-prompt-engineer

paul-gauthier · 2023-07-19T12:01:04Z

That's really interesting that you were able to benchmark against LocalAI. Thanks for doing that and sharing the (disappointing) outcome.

I can't say I am super surprised. The Claude models are the only ones I've seen discussed as potentially rivaling OpenAI's wrt to coding tasks.

I've had similar results from experiments with line number based edit formats. It's surprisingly difficult to get reliable edits from even GPT-4.

tmm1 · 2023-07-19T16:44:49Z

My hunch is that once GPT4 allows fine-tuning, reliability could be improved drastically with a training set of examples using line number edits.

I played a bit with Llama2 7B chat (llama-2-13b-chat.ggmlv3.q4_K_M.bin) and the results are very encouraging. It manages to output edit blocks sometimes, and can handle whole file listings with relative ease. Fine tuning may push it over the edge enough to be useful. I will try to run the benchmarks and report the results.

tmm1 · 2023-07-20T02:07:38Z

I have been experimenting with llama2 and managed to wire up aider to it.

If you have an nvidia GPU and CUDA environment, you can do this very simply:

pip3 install git+https://github.com/lm-sys/FastChat@main
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf

then use aider --openai-api-base http://localhost:8000/v1

funkytaco · 2023-07-20T16:25:48Z

@tmm1 if you use llama2 does that mean your api calls would be free? I see mention of openai.api.server which is what has me wondering (that's probably just the misnamed env setting?)

tmm1 · 2023-07-20T16:37:47Z

Yes all local and free. The vllm project has an openai compatible api module I'm using here.

I've also been able to use text-generation-webui's openai extension and will document those commands later today.

I think GGML will be the most portable way to integrate so I'm working through that now. I'm able to simulate an aider session prompt using llama.cpp chat cli and have been using that to test responses.

tmm1 · 2023-07-20T17:31:33Z

I've also been able to use text-generation-webui's openai extension and will document those commands later today.

text-generation-webui

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
pip install -r extensions/openai/requirements.txt

then, if using CUDA:

python3 download-model.py TheBloke/Llama-2-13B-chat-GPTQ
python3 server.py --listen --extensions openai --loader exllama --model TheBloke_Llama-2-13B-chat-GPTQ

otherwise to run on CPU:

python3 download-model.py TheBloke/Llama-2-13B-chat-GGML
python3 server.py --listen --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML

the openai compatible api will be listening on http://0.0.0.0:5001/v1

Some small changes (de1616e) are required to get aider working. I will try to clean them up and send some PRs.

funkytaco · 2023-07-21T04:56:01Z

python3 server.py --listen --trust-remote-code --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML

I thought it doesn't need openai? This doesn't seem to be working on mac m1 either

funkytaco · 2023-07-21T20:57:14Z

I think maybe the model was too much for my m1 2020. I moved out a lot of the .bin files and used TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q4_K_S.bin.

It's really slow, but it works, so i guess I need to get a GPU.

Ichigo3766 · 2023-07-22T06:35:21Z

@tmm1 Would you try doing this with Huggingface Text-Generation-Inference server?. They have a langchain wrapper which can be utilized to connect with the llm. Unfortunately theres no openai style api for them but more of this:

from langchain.llms.huggingface_text_gen_inference import HuggingFaceTextGenInference

llm = HuggingFaceTextGenInference(
inference_server_url='',
top_k=10,
top_p=0.7,
temperature=0.01,
stream=True,
repetition_penalty=1.1,
max_new_tokens=6000,
#callbacks=[MyCustomHandler()]
callbacks=[streaming_stdout.StreamingStdOutCallbackHandler()]
)

This would require some work but would be sick as TGI is a production level api and is very fast! I have been running this and using a coding model alongside with aider would be a huge deal!

tmm1 · 2023-07-22T19:02:26Z

How fast is TGI for you vs other methods? In my testing the exllama and vllm were the fastest, but I didn't try TGI. Unfortunately without an openai compatible api it will not be straightforward to integrate.

I had some issues with text-generation-webui when used with GGML (with GPTQ + exllama it works really great). I would recommend using LocalAI instead if you need GGML:

git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build

cd models
wget --continue https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_M.bin
cat > llama2.yaml <<EOF
name: llama2
backend: llama
context_size: 2048
max_tokens: 512
parameters:
  model: llama-2-13b-chat.ggmlv3.q4_K_M.bin
  temperature: 0.6
  top_k: 80
  top_p: 0.7
template:
  chat_message: llama2-chat
EOF
cat > llama2-chat.tmpl <<EOF
{{if eq .RoleName "assistant"}}{{.Content}}{{else}}
[INST]
{{if .SystemPrompt}}{{.SystemPrompt}}{{else if eq .RoleName "system"}}<<SYS>>{{.Content}}<</SYS>>

{{else if .Content}}{{.Content}}{{end}}
[/INST] 
{{end}}
EOF
cd ..

./local-ai --debug
aider --model llama2 --openai-api-base http://localhost:8080/v1 --edit-format whole

Ichigo3766 · 2023-07-22T20:36:54Z

TGI is crazy fast. Now they support gptq as well using exllama kernal. VLLM is not as fast but yea it has open ai api.

tmm1 · 2023-07-23T22:18:10Z

I tested TGI and it was much slower than using exllama directly. Not sure why. Maybe it's more tuned for A100 vs 3090 that I'm using.

That said, the API for TGI is very simple (https://huggingface.github.io/text-generation-inference/) and there's a python wrapper too (pip install text_generation), so perhaps it would be worth adding support for that as a backend to aider.

Ichigo3766 · 2023-07-24T05:26:23Z

Interesting. But exllama is not a production ready server and it may be fast for just one person but if you were to use it in multiple user scenario, TGI is way ahead due to the sharding mechanism. Overall, it would be nice to have that and many people are using it so would be cool to use wizardcoder with aider using TGI. @paul-gauthier

paul-gauthier · 2023-07-24T15:43:09Z

FYI, we just added an #llm-integrations channel on the discord, as a place to discuss using aider with alternative or local LLMs.

https://discord.gg/X9Sq56tsaR

paul-gauthier · 2023-08-02T17:29:42Z

Closing this for now. See #172 for more info.

jankowtf mentioned this issue Jul 19, 2023

Extending aider to use GPT-4-0613 for more tokens/context window (16K and 32K) #129

Closed

paul-gauthier added the question Further information is requested label Jul 19, 2023

paul-gauthier mentioned this issue Jul 20, 2023

Is it possible to use Llama-2 #135

Closed

paul-gauthier closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompting strategies #110

Prompting strategies #110

tmm1 commented Jul 16, 2023

tmm1 commented Jul 16, 2023 •

edited

Loading

paul-gauthier commented Jul 16, 2023

tmm1 commented Jul 17, 2023 •

edited

Loading

tmm1 commented Jul 18, 2023

paul-gauthier commented Jul 18, 2023

paul-gauthier commented Jul 18, 2023

tmm1 commented Jul 19, 2023 •

edited

Loading

paul-gauthier commented Jul 19, 2023

tmm1 commented Jul 19, 2023

tmm1 commented Jul 20, 2023

funkytaco commented Jul 20, 2023

tmm1 commented Jul 20, 2023

tmm1 commented Jul 20, 2023 •

edited

Loading

funkytaco commented Jul 21, 2023

funkytaco commented Jul 21, 2023

Ichigo3766 commented Jul 22, 2023 •

edited

Loading

tmm1 commented Jul 22, 2023

Ichigo3766 commented Jul 22, 2023

tmm1 commented Jul 23, 2023

Ichigo3766 commented Jul 24, 2023 •

edited

Loading

paul-gauthier commented Jul 24, 2023 •

edited

Loading

paul-gauthier commented Aug 2, 2023

Prompting strategies #110

Prompting strategies #110

Comments

tmm1 commented Jul 16, 2023

tmm1 commented Jul 16, 2023 • edited Loading

paul-gauthier commented Jul 16, 2023

tmm1 commented Jul 17, 2023 • edited Loading

tmm1 commented Jul 18, 2023

paul-gauthier commented Jul 18, 2023

paul-gauthier commented Jul 18, 2023

tmm1 commented Jul 19, 2023 • edited Loading

paul-gauthier commented Jul 19, 2023

tmm1 commented Jul 19, 2023

tmm1 commented Jul 20, 2023

funkytaco commented Jul 20, 2023

tmm1 commented Jul 20, 2023

tmm1 commented Jul 20, 2023 • edited Loading

funkytaco commented Jul 21, 2023

funkytaco commented Jul 21, 2023

Ichigo3766 commented Jul 22, 2023 • edited Loading

tmm1 commented Jul 22, 2023

Ichigo3766 commented Jul 22, 2023

tmm1 commented Jul 23, 2023

Ichigo3766 commented Jul 24, 2023 • edited Loading

paul-gauthier commented Jul 24, 2023 • edited Loading

paul-gauthier commented Aug 2, 2023

tmm1 commented Jul 16, 2023 •

edited

Loading

tmm1 commented Jul 17, 2023 •

edited

Loading

tmm1 commented Jul 19, 2023 •

edited

Loading

tmm1 commented Jul 20, 2023 •

edited

Loading

Ichigo3766 commented Jul 22, 2023 •

edited

Loading

Ichigo3766 commented Jul 24, 2023 •

edited

Loading

paul-gauthier commented Jul 24, 2023 •

edited

Loading