Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist:
Introduction
Petals is a library for running models in a distributed manner, with the inference split up amongst different servers. This PR adds support for using text-generation-webui as a petals client. I have tested it in a colab notebook with and without GPU (petals doesn't support TPU currently),
Code comments
As of the most recent release, the petals API is essentially identical to huggingface's transformers, so I used the same loading codepath. One complication is that petals usually loads models from the network using the transformers download API and stores them in
~/.cache/huggingface/hub
. I didn't investigate pre-downloading the model manually too deeply, I just patched the checks to allow the downloading to work. I was getting errors on trying to run webui+petals without--cpu
and without a CPU in colab, so I also moved the CPU check earlier. Another note is that every http request that petals was making was getting logged, which ended up being a significant amount of output, so I suppressed that. And then you'll notice my colab notebook uses%run server.py
, this makes the model downloading output much nicer, but then I noticed this warning aboutgradio.launch(debug=)
. Researching this it seemsdebug=true
is appropriate in this case.There is a "session" API with petals, but I investigated it and it is not currently flexible enough to support most of the webui's commands, so I just left it with the session unspecified, meaning it sets up a new inference session with the servers every time you hit "generate". It means that there is a 1-second delay or so on every request before it starts generation, sometimes much longer if it can't find sufficient servers or a server times out. Hitting the "stop" button does not interrupt the route-finding; I am not familiar enough with gradio to fix this.