Skip to content

CreditMutuelArkea/llm-inference

Repository files navigation

Bloomz Inference Server

This project implements an inference server for Bloomz Large Language Models (LLMs). The server allows for efficient and scalable inference of Bloomz models, making it suitable for various NLP applications.

Features

  • Flexible: Easily configurable to support different model sizes and configurations.
  • REST API: Provides a RESTful API for easy integration with other applications.

Installation

Prerequisites

  • Python 3.10 or higher
  • poetry package manager

Clone the Repository

git clone <github.com>
cd llm-inference

Install Dependencies

poetry install

Usage

Configuration

Before running the server, you need to configure it and set some environment variables.

HUGGING_FACE_HUB_TOKEN="<YOUR HF HUB TOKEN>"

Running the Server

For each kind of task goes a specific model here are some models based on bloomz architecture, Open Sourced that you could use you will find the latest model on Credit Mutuel Arkea's hugginface ODQA collection.

If you are using PyCharm find run configuration in .run/**.yaml they should appear direcly in PyCharm, those configurations uses the samllest models.

Embedding server

Use to vectorise documents search for *-retriever models, then start the inference server like this (smallest model) :

python -m llm_inference --task EMBEDDING --port 8081 --model cmarkea/bloomz-560m-retriever-v2

Then go to http://localhost:8081/docs.

Reranking / Scoring server

Use to rank severeal context according to a specific query, search for *-reranking models, then start the inference server like this (smallest model) :

python -m llm_inference --task SCORING --port 8082 --model cmarkea/bloomz-560m-reranking

Then go to http://localhost:8082/docs.

Be aware to check the examples in the model card depending on the model you use to understand the meaning of the output labels. For instance for cmarkea/bloomz-560m-reranking, LABEL1 near to 1 means that the context in really similar to the query, as described in the model card-,contexts_reranked,-%3D%20sorted().

Guardrail

Use to detect responses that would be toxic for instance : insult, obscene, sexual_explicit, identity_attack... Our guardrail models are published under *-guardrail models

python -m llm_inference --task GUARDRAIL --port 8083 --model cmarkea/bloomz-560m-guardrail

Then go to http://localhost:8083/docs.

API Endpoints

You can access server documentation through this endpoint : /docs

Testing

To run the tests, use the following command:

pytest tests/

Contributing

We welcome contributions to this project! Please open an issue or submit a pull request on GitHub.

Guidelines

  • Fork the repository.
  • Create a new branch for your feature or bugfix.
  • Make your changes.
  • Ensure all tests pass.
  • Submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

Contact

For any inquiries or support, open an issue on this repository.

About

Inference server for Bloomz Large Language Models (LLMs)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published