Skip to content

Update README.md #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 183 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,10 @@ ModelCache
- [Architecture](#architecture)
- [Quick start](#quick-start)
- [Dependencies](#dependencies)
- [Start service](#start-service)
- [Start demo](#start-demo)
- [Service Startup With Docker-compose](#service-startup-with-docker-compose)
- [Start normal service](#start-normal-service)
- [Visit the service](#visit-the-service)
- [Running the service](#running-the-service)
- [Demo service](#demo-service)
- [Standard service](#standard-service)
- [Using the service](#using-the-service)
- [Write cache](#write-cache)
- [Query cache](#query-cache)
- [Clear cache](#clear-cache)
Expand All @@ -43,7 +42,7 @@ ModelCache
- [Contributing](#contributing)

## News

- 🔥🔥[2025.06.28] Added a Websocket-based API, memory cache, multiprocessing-based embedding with configurable amount of workers, bulk-insert support in the backend, python 12 support and massive performance improvements
- 🔥🔥[2024.10.22] Added tasks for 1024 developer day.
- 🔥🔥[2024.04.09] Added Redis Search to store and retrieve embeddings in multi-tenant. This can reduce the interaction time between Cache and vector databases to 10ms.
- 🔥🔥[2023.12.10] Integrated LLM embedding frameworks such as 'llmEmb', 'ONNX', 'PaddleNLP', 'FastText', and the image embedding framework 'timm' to bolster embedding functionality.
Expand All @@ -52,18 +51,18 @@ ModelCache

### Introduction

Codefuse-ModelCache is a semantic cache for large language models (LLMs). By caching pre-generated model results, it reduces response time for similar requests and improves user experience. <br />This project aims to optimize services by introducing a caching mechanism. It helps businesses and research institutions reduce the cost of inference deployment, improve model performance and efficiency, and provide scalable services for large models. Through open-source, we aim to share and exchange technologies related to large model semantic cache.
Codefuse-ModelCache is a standalone semantic cache for large language models (LLMs).\
By caching pre-generated model results, it reduces response time for similar requests and improves user experience. <br />This project aims to optimize services by introducing a caching mechanism. It helps businesses and research institutions reduce the cost of inference deployment, improve model performance and efficiency, and provide scalable services for large models. Through open-source, we aim to share and exchange technologies related to large model semantic cache.

## Architecture

![modelcache modules](docs/modelcache_modules_20240409.png)

## Quick start

You can find the start script in `flask4modelcache.py` and `flask4modelcache_demo.py`.
# Quick start

- `flask4modelcache_demo.py`: A quick test service that embeds SQLite and FAISS. No database configuration required.
- `flask4modelcache.py`: The standard service that requires MySQL and Milvus configuration.
You can find the start scripts at the root of the repository.\
There are standard services that require MySQL and Milvus configuration, and there are quick test services that use SQLite and FAISS (No database configuration required).\
The quick test services have `_demo` at the end of the file name

### Dependencies

Expand All @@ -74,74 +73,107 @@ You can find the start script in `flask4modelcache.py` and `flask4modelcache_dem
pip install -r requirements.txt
```

### Start service
## Running the service

#### Start demo
### Demo service
Navigate to the root of the repository and run one of the following:
- `python flask4modelcache_demo.py`
- `python fastapi4modelcache_demo.py`
- `python websocket4modelcache_demo.py`

1. Download the embedding model bin file from [Hugging Face](https://huggingface.co/shibing624/text2vec-base-chinese/tree/main). Place it in the `model/text2vec-base-chinese` folder.
2. Start the backend service:
### Standard service

```shell
cd CodeFuse-ModelCache
```
You can choose to run the databases via docker-compose or installing them manually onto your machine

```shell
python flask4modelcache_demo.py
```
#### Service Startup With Docker-compose
1. Download the embedding model bin file from [Hugging Face](https://huggingface.co/shibing624/text2vec-base-chinese/tree/main). Place it in the `model/text2vec-base-chinese` folder.
2. Configure docker network, only need to execute once
```shell
cd CodeFuse-ModelCache
```
#### Starting databases using docker-compose
Navigate to the root of the repository and run
```shell
docker network create modelcache
docker-compose up -d
```
3. Execute the docker-compose command
```shell
# When the modelcache image does not exist locally for the first time, or when the Dockerfile is changed
docker-compose up --build

# This is not the first run and the Dockerfile has not changed
docker-compose up
```
#### Start normal service

Before you start standard service, do these steps:

#### Manual databases install
1. Install MySQL and import the SQL file from `reference_doc/create_table.sql`.
2. Install vector database Milvus.
3. Configure database access in:
- `modelcache/config/milvus_config.ini`
- `modelcache/config/mysql_config.ini`
4. Download the embedding model bin file from [Hugging Face](https://huggingface.co/shibing624/text2vec-base-chinese/tree/main). Put it in `model/text2vec-base-chinese`.
5. Start the backend service:

```bash
python flask4modelcache.py
```

## Visit the service

The service provides three core RESTful API functionalities: Cache-Writing, Cache-Querying, and Cache-Clearing.

### Write cache

- `modelcache/config/milvus_config.ini`
- `modelcache/config/mysql_config.ini`

\-\-\-\-\-\-\-\-\-\-\-\-

After installing and running the databases, start a backend service of your choice
- `python flask4modelcache_demo.py`
- `python fastapi4modelcache_demo.py`
- `python websocket4modelcache_demo.py`

## Using the service

The service provides three core functionalities: Cache-Writing, Cache-Querying, and Cache-Clearing.\
The service supports both a RESTful API and Websocket API

RESTful API - `flask4modelcache.py` and `fastapi4modelcache.py`\
Websocket API - `websocket4modelcache.py`

### RESTful API
#### Write cache

```json
{
"type": "insert",
"scope": {
"model": "CODEGPT-1008"
},
"chat_info": [
{
"query": [
{
"role": "user",
"content": "Who are you?"
},
{
"role": "system",
"content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."
}
],
"answer": "Hello, I am an intelligent assistant. How can I assist you?"
}
]
}
```
Code example
```python
import json
import requests
url = 'http://127.0.0.1:5000/modelcache'
type = 'insert'
scope = {"model": "CODEGPT-1008"}
chat_info = [{"query": [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "你是谁?"}],
"answer": "Hello, I am an intelligent assistant. How can I assist you?"}]
chat_info = [{"query": [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "Who are you?"}],"answer": "Hello, I am an intelligent assistant. How can I assist you?"}]
data = {'type': type, 'scope': scope, 'chat_info': chat_info}

headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
```

### Query cache

\-\-\-\-\-\-\-\-\-\-\-\-

#### Query cache
```json
{
"type": "query",
"scope": {
"model": "CODEGPT-1008"
},
"query": [
{
"role": "user",
"content": "Who are you?"
},
{
"role": "system",
"content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."
}
]
}
```
Code example
```python
import json
import requests
Expand All @@ -155,8 +187,19 @@ headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
```

### Clear cache

\-\-\-\-\-\-\-\-\-\-\-\-

#### Clear cache
```json
{
"type": "remove",
"scope": {
"model": "CODEGPT-1008"
},
"remove_type": "truncate_by_model"
}
```
Code example
```python
import json
import requests
Expand All @@ -170,6 +213,76 @@ headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
```

### Websocket API

The websocket API is inherently asynchronous, so we need to wrap the request with a request id in order to be able to track it.\
The service will return a response with the appropriate request id that was given for the request

#### Write cache
```json
{
"requestId": "943e9450-3467-4d73-9b32-68a337691f6d",
"payload": {
"type": "insert",
"scope": {
"model": "CODEGPT-1008"
},
"chat_info": [
{
"query": [
{
"role": "user",
"content": "Who are you?"
},
{
"role": "system",
"content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."
}
],
"answer": "Hello, I am an intelligent assistant. How can I assist you?"
}
]
}
}
```

#### Query cache
```json
{
"requestId": "51f00484-acc9-406f-807d-29fba672473e",
"payload": {
"type": "query",
"scope": {
"model": "CODEGPT-1008"
},
"query": [
{
"role": "user",
"content": "Who are you?"
},
{
"role": "system",
"content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."
}
]
}
}
```

#### Clear cache
```json
{
"requestId": "f96bbc87-5ef9-4161-9e96-3076ca97b4b9",
"payload": {
"type": "remove",
"scope": {
"model": "CODEGPT-1008"
},
"remove_type": "truncate_by_model"
}
}
```

## Function comparison

We've implemented several key updates to our repository. We've resolved network issues with Hugging Face and improved inference speed by introducing local embedding capabilities. Due to limitations in SqlAlchemy, we've redesigned our relational database interaction module for more flexible operations. We've added multi-tenancy support to ModelCache, recognizing the need for multiple users and models in LLM products. Lastly, we've made initial adjustments for better compatibility with system commands and multi-turn dialogues.
Expand Down Expand Up @@ -297,7 +410,8 @@ We've implemented several key updates to our repository. We've resolved network

## Features

In ModelCache, we incorporated the core principles of GPTCache. ModelCache has four modules: adapter, embedding, similarity, and data_manager.
In ModelCache, we incorporated the core principles of GPTCache.\
ModelCache has four modules: adapter, embedding, similarity, and data_manager.

- The adapter module orchestrates the business logic for various tasks, integrate the embedding, similarity, and data_manager modules.
- The embedding module converts text into semantic vector representations, and transforms user queries into vectors.
Expand All @@ -310,6 +424,10 @@ To make ModelCache more suitable for industrial use, we made several improvement
- Embedded into LLM products using a Redis-like caching mode
- Provided semantic caching without interfering with LLM calls, security audits, and other functions
- Compatible with all LLM services
- [x] Multiprocessing-based embedding:
- True parallel embedding, serving multiple requests at once
- Highly scalable, supports configuring the amount of embedding worker.
- Enables efficient use of available computing resources
- [x] Multiple model loading:
- Supported local embedding model loading, and resolved Hugging Face network connectivity issues
- Supported loading embedding layers from various pre-trained models
Expand All @@ -329,7 +447,7 @@ To make ModelCache more suitable for industrial use, we made several improvement

### Adapter

- [ ] Register adapter for Milvus:Based on the "model" parameter in the scope, initialize the corresponding Collection and perform the load operation.
- [x] Register adapter for Milvus:Based on the "model" parameter in the scope, initialize the corresponding Collection and perform the load operation.

### Embedding model&inference

Expand All @@ -351,7 +469,7 @@ To make ModelCache more suitable for industrial use, we made several improvement

### Service

- [ ] Supports FastAPI.
- [x] Supports FastAPI.
- [ ] Add visual interface to offer a more direct user experience.

## Acknowledgements
Expand Down