Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update LLM readme #172

Merged
merged 6 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions comps/llms/text-generation/ollama/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Introduction

[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally.

# Get Started

## Setup

Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance.

- Download and install Ollama onto the available supported platforms (including Windows)
- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3`
- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model.

Note:
Special settings are necessary to pull models behind the proxy.

```bash
sudo vim /etc/systemd/system/ollama.service
```

Add your proxy to the above configure file.

```markdown
[Service]
Environment="http_proxy=${your_proxy}"
Environment="https_proxy=${your_proxy}"
```

## Usage

Here are a few ways to interact with pulled local models:

### In the terminal

All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly.

### API access

Send an application/json request to the API endpoint of Ollama to interact.

```bash
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt":"Why is the sky blue?"
}'
```

# Build Docker Image

```bash
cd GenAIComps/
docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile .
```

# Run the Ollama Microservice

```bash
docker run --network host opea/llm-ollama:latest
```

# Consume the Ollama Microservice

```bash
curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json'
```
124 changes: 91 additions & 33 deletions comps/llms/text-generation/tgi/README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,124 @@
# Introduction
# TGI LLM Microservice

[Ollama](https://github.com/ollama/ollama) allows you to run open-source large language models, such as Llama 3, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. It's the best choice to deploy large language models on AIPC locally.
[Text Generation Inference](https://github.com/huggingface/text-generation-inference) (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more.

# Get Started
# 🚀1. Start Microservice with Python (Option 1)

## Setup
To start the LLM microservice, you need to install python packages first.

Follow [these instructions](https://github.com/ollama/ollama) to set up and run a local Ollama instance.
## 1.1 Install Requirements

- Download and install Ollama onto the available supported platforms (including Windows)
- Fetch available LLM model via ollama pull <name-of-model>. View a list of available models via the model library and pull to use locally with the command `ollama pull llama3`
- This will download the default tagged version of the model. Typically, the default points to the latest, smallest sized-parameter model.
```bash
pip install -r requirements.txt
```

Note:
Special settings are necessary to pull models behind the proxy.
## 1.2 Start LLM Service

```bash
sudo vim /etc/systemd/system/ollama.service
export HF_TOKEN=${your_hf_api_token}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/gen-ai-comps:llms"
docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model}
```

Add your proxy to the above configure file.
## 1.3 Verify the TGI Service

```markdown
[Service]
Environment="http_proxy=${your_proxy}"
Environment="https_proxy=${your_proxy}"
```bash
curl http://${your_ip}:8008/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-H 'Content-Type: application/json'
```

## Usage
## 1.4 Start LLM Service with Python Script

Here are a few ways to interact with pulled local models:
```bash
export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
python text-generation/tgi/llm.py
```

### In the terminal
# 🚀2. Start Microservice with Docker (Option 2)

All of your local models are automatically served on localhost:11434. Run ollama run <name-of-model> to start interacting via the command line directly.
If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker.

### API access
## 2.1 Setup Environment Variables

Send an application/json request to the API endpoint of Ollama to interact.
In order to start TGI and LLM services, you need to setup the following environment variables first.

```bash
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt":"Why is the sky blue?"
}'
export HF_TOKEN=${your_hf_api_token}
export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
export LLM_MODEL_ID=${your_hf_llm_model}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/llms"
```

# Build Docker Image
## 2.2 Build Docker Image

```bash
cd GenAIComps/
docker build --no-cache -t opea/llm-ollama:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/ollama/Dockerfile .
cd ../../
docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile .
```

# Run the Ollama Microservice
To start a docker container, you have two options:

- A. Run Docker with CLI
- B. Run Docker with Docker Compose

You can choose one as needed.

## 2.3 Run Docker with CLI (Option A)

```bash
docker run --network host opea/llm-ollama:latest
docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-tgi:latest
```

# Consume the Ollama Microservice
## 2.4 Run Docker with Docker Compose (Option B)

```bash
curl http://127.0.0.1:9000/v1/chat/completions -X POST -d '{"query":"What is Deep Learning?","max_new_tokens":32,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' -H 'Content-Type: application/json'
cd text-generation/tgi
docker compose -f docker_compose_llm.yaml up -d
```

# 🚀3. Consume LLM Service

## 3.1 Check Service Status

```bash
curl http://${your_ip}:9000/v1/health_check\
-X GET \
-H 'Content-Type: application/json'
```

## 3.2 Consume LLM Service

You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`.

The `streaming` parameter determines the format of the data returned by the API. It will return text string with `streaming=false`, return text streaming flow with `streaming=true`.

```bash
# non-streaming mode
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
-H 'Content-Type: application/json'

# streaming mode
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
-H 'Content-Type: application/json'
```

## 4. Validated Model

| Model | TGI-Gaudi |
| ------------------------- | --------- |
| Intel/neural-chat-7b-v3-3 | ✓ |
| Llama-2-7b-chat-hf | ✓ |
| Llama-2-70b-chat-hf | ✓ |
| Meta-Llama-3-8B-Instruct | ✓ |
| Meta-Llama-3-70B-Instruct | ✓ |
| Phi-3 | x |
Loading