Skip to content

Commit

Permalink
Use TheBloke quantizations instead of base llama model or GPT4all by …
Browse files Browse the repository at this point in the history
…default. But need to deal with Issue #192, output lost for wizard case if prompting not right
  • Loading branch information
pseudotensor committed May 30, 2023
1 parent 7cd071f commit e3307e7
Show file tree
Hide file tree
Showing 9 changed files with 240 additions and 107 deletions.
3 changes: 2 additions & 1 deletion .env_gpt4all
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
model_name_gptj=ggml-gpt4all-j-v1.3-groovy.bin

# llama-cpp-python type, supporting version 3 quantization, here from locally built llama.cpp q4 v3 quantization
model_path_llama=./models/7B/ggml-model-q4_0.bin
# below uses prompt_type=wizard2
model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
# below assumes max_new_tokens=256
n_ctx=1792

Expand Down
11 changes: 8 additions & 3 deletions FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,8 +258,13 @@ etc.

### CPU with no AVX2 or using LLaMa.cpp

For GPT4All based models, require AVX2, unless one recompiles that project on your system. Until then, use llama.cpp models instead,
e.g. by compiling the llama model on your system by following the [instructions](https://github.com/ggerganov/llama.cpp#build) and [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), e.g. for Linux:
For GPT4All based models, require AVX2, unless one recompiles that project on your system. Until then, use llama.cpp models instead.

So we recommend downloading models from [TheBloke](https://huggingface.co/TheBloke) that are version 3 quantized ggml files to work with latest llama.cpp. See main [README.md](README.md#cpu).

The below example is for base LLaMa model, not instruct-tuned, so is not recommended for chatting. It just gives an example of how to quantize if you are an expert.

Compile the llama model on your system by following the [instructions](https://github.com/ggerganov/llama.cpp#build) and [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), e.g. for Linux:
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Expand Down Expand Up @@ -295,7 +300,7 @@ python convert.py models/7B/
# test by running the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
```
then adding an entry in the .env file like (assumes version 3 quantization)
then adding an entry in the `.env_gpt4all` file like (assumes version 3 quantization)
```.env_gpt4all
# model path and model_kwargs
model_path_llama=./models/7B/ggml-model-q4_0.bin
Expand Down
31 changes: 22 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Also check out [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio) for our

### Getting Started

For help installing a Python 3.10 environment, see [Install Python 3.10 Environment](INSTALL.md#install-python-environment)
First one needs a Python 3.10 environment. For help installing a Python 3.10 environment, see [Install Python 3.10 Environment](INSTALL.md#install-python-environment)

#### GPU (CUDA)

Expand All @@ -92,6 +92,12 @@ python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=Tru
```
Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with `--share=False`). For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as `pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1`.

Note if you download the model yourself and point `--base_model` to that location, you'll need to specify the prompt_type as well by running:
```
python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot
```
for some user path `<user path>`.

For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called `user_path` and run
```bash
pip install -r requirements_optional_langchain.txt
Expand All @@ -112,7 +118,7 @@ Any other instruct-tuned base models can be used, including non-h2oGPT ones. [L

CPU support is obtained after installing two optional requirements.txt files. GPU support is also present if one has GPUs.

1) Install base, langchain, and GPT4All dependencies:
1) Install base, langchain, and GPT4All, and python LLaMa dependencies:
```bash
git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
Expand All @@ -125,25 +131,32 @@ One can run `make req_constraints.txt` to ensure that the constraints file is co

2. Change `.env_gpt4all` model name if desired.
```.env_gpt4all
# model path and model_kwargs
model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin
```
For `gptj` and `gpt4all_llama`, you can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). One does not need to download manually, the gp4all package will download at runtime and put it into `.cache` like huggingface would. However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT.

So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from [TheBloke](https://huggingface.co/TheBloke). For example, [13B Vicuna Quantized](https://huggingface.co/TheBloke/wizardLM-13B-1.0-GGML) or [7B WizardLM Quantized](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML). TheBloke has a variety of model types, quantization bit, and memory consumption. Choose what is best for your system's specs.

Then one sets `model_path_llama` in the `.env_gpt4all` file and runs, for `WizardLM-7B-uncensored.ggmlv3.q8_0.bin`:
```bash
python generate.py --base_model='llama' --prompt_type=wizard2 --langchain_mode=UserData --user_path=user_path
```
You can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). Do not need to download, the gp4all package will download at runtime and put it into `.cache` like huggingface would.
See [llama.cpp](https://github.com/ggerganov/llama.cpp) for instructions on getting model for `--base_model=llama` case.
This is recommended way to run, because currently some GPT-J models have issues with [no output](FAQ.md#gpt4all-not-producing-output).
See [llama.cpp](https://github.com/ggerganov/llama.cpp) for instructions on quantizing your own models for `--base_model=llama` case, but for non-expert users we recommend downloading only from TheBloke.

3. Run generate.py

For LangChain support using documents in `user_path` folder, run h2oGPT like:
```bash
python generate.py --base_model=gptj --score_model=None --langchain_mode='UserData' --user_path=user_path
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
```
See [LangChain Readme](README_LangChain.md) for more details.
For no langchain support (still uses LangChain package as model wrapper), run as:
```bash
python generate.py --base_model=gptj --score_model=None
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None
```
However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT, so we recommend using a [llama.cpp](FAQ.md#cpu-with-no-avx2-or-using-llamacpp) based model,
although such models perform much worse than standard non-quantized models.

#### MACOS

Expand Down
Loading

0 comments on commit e3307e7

Please sign in to comment.