diff --git a/docs/source/run_locally/llama.cpp.rst b/docs/source/run_locally/llama.cpp.rst index e9591c4..4d76acf 100644 --- a/docs/source/run_locally/llama.cpp.rst +++ b/docs/source/run_locally/llama.cpp.rst @@ -55,14 +55,18 @@ Then you can run the model with the following command: .. code:: bash - ./main -m qwen2-7b-instruct-q5_k_m.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt + ./llama-cli -m qwen2-7b-instruct-q5_k_m.gguf \ + -n 512 -co -i -if -f prompts/chat-with-qwen.txt \ + --in-prefix "<|im_start|>user\n" \ + --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \ + -ngl 80 -fa where ``-n`` refers to the maximum number of tokens to generate. There are other hyperparameters for you to choose and you can run .. code:: bash - ./main -h + ./llama-cli -h to figure them out. @@ -92,7 +96,7 @@ Then you can run the test with the following command: .. code:: bash - ./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw + ./llama-perplexity -m -f wiki.test.raw where the output is like