Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Petals support #3784

Closed
wants to merge 3 commits into from
Closed

Petals support #3784

wants to merge 3 commits into from

Conversation

Mathnerd314
Copy link

Checklist:

Introduction

Petals is a library for running models in a distributed manner, with the inference split up amongst different servers. This PR adds support for using text-generation-webui as a petals client. I have tested it in a colab notebook with and without GPU (petals doesn't support TPU currently),

Code comments

As of the most recent release, the petals API is essentially identical to huggingface's transformers, so I used the same loading codepath. One complication is that petals usually loads models from the network using the transformers download API and stores them in ~/.cache/huggingface/hub. I didn't investigate pre-downloading the model manually too deeply, I just patched the checks to allow the downloading to work. I was getting errors on trying to run webui+petals without --cpu and without a CPU in colab, so I also moved the CPU check earlier. Another note is that every http request that petals was making was getting logged, which ended up being a significant amount of output, so I suppressed that. And then you'll notice my colab notebook uses %run server.py, this makes the model downloading output much nicer, but then I noticed this warning about gradio.launch(debug=). Researching this it seems debug=true is appropriate in this case.

There is a "session" API with petals, but I investigated it and it is not currently flexible enough to support most of the webui's commands, so I just left it with the session unspecified, meaning it sets up a new inference session with the servers every time you hit "generate". It means that there is a 1-second delay or so on every request before it starts generation, sometimes much longer if it can't find sufficient servers or a server times out. Hitting the "stop" button does not interrupt the route-finding; I am not familiar enough with gradio to fix this.

@shohamjac
Copy link

The implementation looks great.
I thought maybe to add something like this below the loader option, so that it will be a bit more GUI based:
image

I can add it after the request will be merged.

@Mathnerd314
Copy link
Author

Updated for latest commit.

@gaborkukucska
Copy link

Would be great to get this implemented in tgwui.

@oobabooga
Copy link
Owner

As much as this PR is perfect, I prefer to focus this repository on experimenting with local inference.

@oobabooga oobabooga closed this Dec 4, 2023
@gaborkukucska
Copy link

As much as this PR is perfect, I prefer to focus this repository on experimenting with local inference.

Petals allows the creation of local swarms to be able to run LLMs on multiple GPUs. I think Petals very much worth it!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants