Use TheBloke quantizations instead of base llama model or GPT4all by …

…default. But need to deal with Issue #192, output lost for wizard case if prompting not right
h2oai · May 30, 2023 · e3307e7 · e3307e7
1 parent 7cd071f
commit e3307e7
Show file tree

Hide file tree

Showing 9 changed files with 240 additions and 107 deletions.
diff --git a/.env_gpt4all b/.env_gpt4all
@@ -4,7 +4,8 @@
 model_name_gptj=ggml-gpt4all-j-v1.3-groovy.bin
 
 # llama-cpp-python type, supporting version 3 quantization, here from locally built llama.cpp q4 v3 quantization
-model_path_llama=./models/7B/ggml-model-q4_0.bin
+# below uses prompt_type=wizard2
+model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
 # below assumes max_new_tokens=256
 n_ctx=1792
 

diff --git a/FAQ.md b/FAQ.md
@@ -258,8 +258,13 @@ etc.
 
 ### CPU with no AVX2 or using LLaMa.cpp
 
-For GPT4All based models, require AVX2, unless one recompiles that project on your system.  Until then, use llama.cpp models instead,
-e.g. by compiling the llama model on your system by following the [instructions](https://github.com/ggerganov/llama.cpp#build) and [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), e.g. for Linux:
+For GPT4All based models, require AVX2, unless one recompiles that project on your system.  Until then, use llama.cpp models instead.
+
+So we recommend downloading models from [TheBloke](https://huggingface.co/TheBloke) that are version 3 quantized ggml files to work with latest llama.cpp.  See main [README.md](README.md#cpu).
+
+The below example is for base LLaMa model, not instruct-tuned, so is not recommended for chatting.  It just gives an example of how to quantize if you are an expert.
+
+Compile the llama model on your system by following the [instructions](https://github.com/ggerganov/llama.cpp#build) and [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), e.g. for Linux:
 ```bash
 git clone https://github.com/ggerganov/llama.cpp
 cd llama.cpp
@@ -295,7 +300,7 @@ python convert.py models/7B/
 # test by running the inference
 ./main -m ./models/7B/ggml-model-q4_0.bin -n 128
 ```
-then adding an entry in the .env file like (assumes version 3 quantization)
+then adding an entry in the `.env_gpt4all` file like (assumes version 3 quantization)
 ```.env_gpt4all
 # model path and model_kwargs
 model_path_llama=./models/7B/ggml-model-q4_0.bin

diff --git a/README.md b/README.md
@@ -78,7 +78,7 @@ Also check out [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio) for our
 
 ### Getting Started
 
-For help installing a Python 3.10 environment, see [Install Python 3.10 Environment](INSTALL.md#install-python-environment)
+First one needs a Python 3.10 environment.  For help installing a Python 3.10 environment, see [Install Python 3.10 Environment](INSTALL.md#install-python-environment)
 
 #### GPU (CUDA)
 
@@ -92,6 +92,12 @@ python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=Tru
 ```
 Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with `--share=False`).  For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as `pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1`.
 
+Note if you download the model yourself and point `--base_model` to that location, you'll need to specify the prompt_type as well by running:
+```
+python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot
+```
+for some user path `<user path>`.
+
 For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called `user_path` and run
 ```bash
 pip install -r requirements_optional_langchain.txt
@@ -112,7 +118,7 @@ Any other instruct-tuned base models can be used, including non-h2oGPT ones.  [L
 
 CPU support is obtained after installing two optional requirements.txt files.  GPU support is also present if one has GPUs.
 
-1) Install base, langchain, and GPT4All dependencies:
+1) Install base, langchain, and GPT4All, and python LLaMa dependencies:
 ```bash
 git clone https://github.com/h2oai/h2ogpt.git
 cd h2ogpt
@@ -125,25 +131,32 @@ One can run `make req_constraints.txt` to ensure that the constraints file is co
 
 2. Change `.env_gpt4all` model name if desired.
 ```.env_gpt4all
-# model path and model_kwargs
+model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
 model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
+model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin
+```
+For `gptj` and `gpt4all_llama`, you can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). One does not need to download manually, the gp4all package will download at runtime and put it into `.cache` like huggingface would.  However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT.
+
+So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from [TheBloke](https://huggingface.co/TheBloke).  For example, [13B Vicuna Quantized](https://huggingface.co/TheBloke/wizardLM-13B-1.0-GGML) or [7B WizardLM Quantized](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML).  TheBloke has a variety of model types, quantization bit, and memory consumption.  Choose what is best for your system's specs.
+
+Then one sets `model_path_llama` in the `.env_gpt4all` file and runs, for `WizardLM-7B-uncensored.ggmlv3.q8_0.bin`:
+```bash
+python generate.py --base_model='llama' --prompt_type=wizard2 --langchain_mode=UserData --user_path=user_path
 ```
-You can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). Do not need to download, the gp4all package will download at runtime and put it into `.cache` like huggingface would.
-See [llama.cpp](https://github.com/ggerganov/llama.cpp) for instructions on getting model for `--base_model=llama` case.
+This is recommended way to run, because currently some GPT-J models have issues with [no output](FAQ.md#gpt4all-not-producing-output).
+See [llama.cpp](https://github.com/ggerganov/llama.cpp) for instructions on quantizing your own models for `--base_model=llama` case, but for non-expert users we recommend downloading only from TheBloke.
 
 3. Run generate.py
 
 For LangChain support using documents in `user_path` folder, run h2oGPT like:
 ```bash
-python generate.py --base_model=gptj --score_model=None --langchain_mode='UserData' --user_path=user_path
+python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
 ```
 See [LangChain Readme](README_LangChain.md) for more details.
 For no langchain support (still uses LangChain package as model wrapper), run as:
 ```bash
-python generate.py --base_model=gptj --score_model=None
+python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None
 ```
-However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT, so we recommend using a [llama.cpp](FAQ.md#cpu-with-no-avx2-or-using-llamacpp) based model,
-although such models perform much worse than standard non-quantized models.
 
 #### MACOS