A very simple server to run guidance programs over http.
Supports health checking and reflection.
Run guidance programs over http in a reliable and performant way.
- Run simple programs consisting of
gen+ prompt text - Streaming
- Logging (no idea why this is not working.)
- Error handling
- Guidance programs with
asyncsteps
- Support non-hugging-face models (including openai)
- Support windows (use wsl/docker/podman)
- CPU support (fixes going this direction are fine, it should not add complexity)
- Improving my awful python
- Inprove Dockerfile
- Add docker examples
- Bug fixes
- Documentation
- Tests
- Performance improvements (startup speed on larger models is a big one)
- Increasing the number of guidance programs that can be run
podman run -e MODEL_NAME=gpt2 -p 50051:50051 --init --device=nvidia.com/gpu=all ghcr.io/utilityai/guidance-rpc:latestRequired poetry to be installed.
poetry install
poetry run python src/main.pyThis should work almost 1-1 with docker
- the
deviceflag inrunmay be different - the suffix
,zon the--mountwill not be required
podman run \
-p 50051:50051 \
-e MODEL_NAME=meta-llama/Llama-2-7b-hf \
-e HF_TOKEN=hf_aaaaaaaaaaaaaaaaaaaaaaaaaa \
--mount type=bind,src=$XDG_CONFIG_HOME/.cache/huggingface,dst=/root/.cache/huggingface,z \
--init \
--device=nvidia.com/gpu=all \
ghcr.io/utilityai/guidance-rpc:latestpodman build -t guidance-rpc .podman run \
-p 50051:50051 \
-e MODEL_NAME=TheBloke/Llama-2-7b-Chat-GPTQ \
-e CACHE=False \
--mount type=bind,src=$HOME/.cache/huggingface,dst=/root/.cache/huggingface,z \
--init \
--device=nvidia.com/gpu=all \
guidance-rpcSee Acceptable Contributions and Non-Goals above.
Generate grpc files with
python -m grpc_tools.protoc -I protos --python_out=src --pyi_out=src --grpc_python_out=src protos/guidance.protoIf you update dependencies, run
poetry updateand then
poetry export -f requirements.txt --output requirements.txt