Skip to content

Commit

Permalink
Merge pull request #126 from rubra-ai/readme-getstart
Browse files Browse the repository at this point in the history
update readme for the run-model-locally section
  • Loading branch information
sanjay920 authored Jul 6, 2024
2 parents d47e062 + 992dee7 commit 88ce92a
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 4 deletions.
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,13 @@ Try out the models immediately without downloading anything in Our [Huggingface

## Run Rubra Models Locally

Check out our [documentation](https://docs.rubra.ai/category/serving--inferencing) to learn how to run Rubra models locally.
We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use:

- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- [vllm](https://github.com/vllm-project/vllm)
- [llama.cpp](https://github.com/rubra-ai/tools.cpp)
- [vLLM](https://github.com/rubra-ai/vllm)

**Note**: Llama3 models, including the 8B and 70B variants, are known to experience increased perplexity and a subsequent degradation in function-calling performance as a result of quantization. We recommend serving them with either vLLM or using the fp16 quantization.

## Benchmark

Expand Down
6 changes: 4 additions & 2 deletions docs/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,10 @@ Try out the models immediately without downloading anything in [Huggingface Spac

We extend the following inferencing tools to run Rubra models in an OpenAI-compatible tool-calling format for local use:

- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- [vllm](https://github.com/vllm-project/vllm)
- [llama.cpp](https://github.com/rubra-ai/tools.cpp)
- [vLLM](https://github.com/rubra-ai/vllm)

**Note**: Llama3 models, including the 8B and 70B variants, are known to experience increased perplexity and a subsequent degradation in function-calling performance as a result of quantization. We recommend serving them with either vLLM or using the fp16 quantization.

## Contributing

Expand Down

0 comments on commit 88ce92a

Please sign in to comment.