Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add simple HTTP API server like in llama.cpp with api like OpenAI #1

Closed
pythops opened this issue Feb 21, 2024 · 11 comments
Assignees
Labels
Feature New feature or request type:support Support issues

Comments

@pythops
Copy link

pythops commented Feb 21, 2024

For more infos here
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

@austinvhuang
Copy link
Collaborator

Great suggestion, if there's others who interested please +emoji above and we'll prioritize this :)

@pythops
Copy link
Author

pythops commented Feb 21, 2024

Just for the update: llama.cpp added support for gemma models
ggerganov/llama.cpp#5631

@loretoparisi
Copy link

loretoparisi commented Feb 21, 2024

Just for the update: llama.cpp added support for gemma models

ggerganov/llama.cpp#5631

Also with 💎Gemma in 🦙Llama.CPP you get CUDA, Neon and AMD GPUs support!
And - in theory - running into the browser if you can compile to WASM.

@austinvhuang austinvhuang added the Feature New feature or request label Feb 24, 2024
@omkar806
Copy link

adding a api like support would be great these models can be used on cpu for smaller tasks.
+1 for this.

@zeerd
Copy link
Contributor

zeerd commented Apr 24, 2024

I have a question: why using http but not websocket?

As I known, the answer token is generated one word by one word.
And, seems, http has no function to do multi-responses for one call.
Which means , http need to gather the whole answer before trans it back.

@ufownl
Copy link
Contributor

ufownl commented Apr 26, 2024

I have a question: why using http but not websocket?

As I known, the answer token is generated one word by one word. And, seems, http has no function to do multi-responses for one call. Which means , http need to gather the whole answer before trans it back.

WebSocket is more suitable for instant messenger style UI but may not be ideal for other UI types. And I think it is better to integrate gemma.cpp as a module into the web backend framework than to implement the HTTP/WebSocket API directly.

Here is my WebSocket online demo solution, and you can try it here or via this Kaggle notebook. In this solution gemma.cpp is a module of OpenResty which makes it easy to implement WebSocket or HTTP API.

@Gopi-Uppari
Copy link
Collaborator

Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ?

Thank you.

@leszko7
Copy link

leszko7 commented Oct 16, 2024

ok app

@Zeenat30
Copy link

Zeenat30 commented Oct 17, 2024 via email

@Zeenat30
Copy link

Zeenat30 commented Oct 17, 2024 via email

@Gopi-Uppari
Copy link
Collaborator

Closing this issue, please feel free reopen if this is still a valid request. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature or request type:support Support issues
Projects
None yet
Development

No branches or pull requests