From 2d1fa3b33fc3308f4cce9917829ad24346cc0901 Mon Sep 17 00:00:00 2001 From: Zlobin Vladimir Date: Wed, 17 Jul 2024 15:51:54 +0400 Subject: [PATCH 1/9] Add Llama3 (#620) Co-authored-by: Yaroslav Tarkan --- samples/cpp/beam_search_causal_lm/README.md | 2 +- samples/cpp/chat_sample/README.md | 2 +- samples/cpp/greedy_causal_lm/README.md | 2 +- samples/cpp/multinomial_causal_lm/README.md | 2 +- samples/cpp/prompt_lookup_decoding_lm/README.md | 2 +- samples/cpp/speculative_decoding_lm/README.md | 2 +- samples/python/beam_search_causal_lm/README.md | 2 +- samples/python/chat_sample/README.md | 2 +- samples/python/greedy_causal_lm/README.md | 2 +- samples/python/multinomial_causal_lm/README.md | 2 +- src/docs/SUPPORTED_MODELS.md | 14 +++++++++++++- 11 files changed, 23 insertions(+), 11 deletions(-) diff --git a/samples/cpp/beam_search_causal_lm/README.md b/samples/cpp/beam_search_causal_lm/README.md index a10428891..82232c42f 100644 --- a/samples/cpp/beam_search_causal_lm/README.md +++ b/samples/cpp/beam_search_causal_lm/README.md @@ -1,4 +1,4 @@ -# Text generation C++ sample that supports most popular models like LLaMA 2 +# Text generation C++ sample that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. It's only possible to change the device for inference to a differnt one, GPU for example, from the command line interface. The sample fearures `ov::genai::LLMPipeline` and configures it to use multiple beam grops. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/samples/cpp/chat_sample/README.md b/samples/cpp/chat_sample/README.md index 4baa8385e..8a24b2000 100644 --- a/samples/cpp/chat_sample/README.md +++ b/samples/cpp/chat_sample/README.md @@ -1,4 +1,4 @@ -# C++ chat_sample that supports most popular models like LLaMA 2 +# C++ chat_sample that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `ov::genai::LLMPipeline` and configures it for the chat scenario. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/samples/cpp/greedy_causal_lm/README.md b/samples/cpp/greedy_causal_lm/README.md index 3c0758ee6..c0a7d5f3c 100644 --- a/samples/cpp/greedy_causal_lm/README.md +++ b/samples/cpp/greedy_causal_lm/README.md @@ -1,4 +1,4 @@ -# Text generation C++ greedy_causal_lm that supports most popular models like LLaMA 2 +# Text generation C++ greedy_causal_lm that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `ov::genai::LLMPipeline` and configures it to run the simplest deterministic greedy sampling algorithm. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/samples/cpp/multinomial_causal_lm/README.md b/samples/cpp/multinomial_causal_lm/README.md index 731d03e3c..447857991 100644 --- a/samples/cpp/multinomial_causal_lm/README.md +++ b/samples/cpp/multinomial_causal_lm/README.md @@ -1,4 +1,4 @@ -# Text generation C++ multinomial_causal_lm that supports most popular models like LLaMA 2 +# Text generation C++ multinomial_causal_lm that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `ov::genai::LLMPipeline` and configures it to run random sampling algorithm. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/samples/cpp/prompt_lookup_decoding_lm/README.md b/samples/cpp/prompt_lookup_decoding_lm/README.md index 980c0cd19..89a5e2c58 100644 --- a/samples/cpp/prompt_lookup_decoding_lm/README.md +++ b/samples/cpp/prompt_lookup_decoding_lm/README.md @@ -1,4 +1,4 @@ -# prompt_lookup_decoding_lm C++ sample that supports most popular models like LLaMA 2 +# prompt_lookup_decoding_lm C++ sample that supports most popular models like LLaMA 3 [Prompt Lookup decoding](https://github.com/apoorvumang/prompt-lookup-decoding) is [assested-generation](https://huggingface.co/blog/assisted-generation#understanding-text-generation-latency) technique where the draft model is replaced with simple string matching the prompt to generate candidate token sequences. This method highly effective for input grounded generation (summarization, document QA, multi-turn chat, code editing), where there is high n-gram overlap between LLM input (prompt) and LLM output. This could be entity names, phrases, or code chunks that the LLM directly copies from the input while generating the output. Prompt lookup exploits this pattern to speed up autoregressive decoding in LLMs. This results in significant speedups with no effect on output quality. diff --git a/samples/cpp/speculative_decoding_lm/README.md b/samples/cpp/speculative_decoding_lm/README.md index 7abcb6782..c86bd8b61 100644 --- a/samples/cpp/speculative_decoding_lm/README.md +++ b/samples/cpp/speculative_decoding_lm/README.md @@ -1,4 +1,4 @@ -# speculative_decoding_lm C++ sample that supports most popular models like LLaMA 2 +# speculative_decoding_lm C++ sample that supports most popular models like LLaMA 3 Speculative decoding (or [assisted-generation](https://huggingface.co/blog/assisted-generation#understanding-text-generation-latency) in HF terminology) is a recent technique, that allows to speed up token generation when an additional smaller draft model is used alonside with the main model. diff --git a/samples/python/beam_search_causal_lm/README.md b/samples/python/beam_search_causal_lm/README.md index ff5286d01..5e80aa69d 100644 --- a/samples/python/beam_search_causal_lm/README.md +++ b/samples/python/beam_search_causal_lm/README.md @@ -1,4 +1,4 @@ -# Text generation Python sample that supports most popular models like LLaMA 2 +# Text generation Python sample that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. It's only possible to change the device for inference to a differnt one, GPU for example, from the command line interface. The sample fearures `openvino_genai.LLMPipeline` and configures it to use multiple beam grops. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/samples/python/chat_sample/README.md b/samples/python/chat_sample/README.md index 34d71fab8..983789d0e 100644 --- a/samples/python/chat_sample/README.md +++ b/samples/python/chat_sample/README.md @@ -1,4 +1,4 @@ -# Python chat_sample that supports most popular models like LLaMA 2 +# Python chat_sample that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `openvino_genai.LLMPipeline` and configures it for the chat scenario. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/samples/python/greedy_causal_lm/README.md b/samples/python/greedy_causal_lm/README.md index 7c87b04aa..97b044eb5 100644 --- a/samples/python/greedy_causal_lm/README.md +++ b/samples/python/greedy_causal_lm/README.md @@ -1,4 +1,4 @@ -# Text generation Python greedy_causal_lm that supports most popular models like LLaMA 2 +# Text generation Python greedy_causal_lm that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `openvino_genai.LLMPipeline` and configures it to run the simplest deterministic greedy sampling algorithm. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/samples/python/multinomial_causal_lm/README.md b/samples/python/multinomial_causal_lm/README.md index d76b93366..d39142f3d 100644 --- a/samples/python/multinomial_causal_lm/README.md +++ b/samples/python/multinomial_causal_lm/README.md @@ -1,4 +1,4 @@ -# Text generation Python multinomial_causal_lm that supports most popular models like LLaMA 2 +# Text generation Python multinomial_causal_lm that supports most popular models like LLaMA 3 This example showcases inference of text-generation Large Language Models (LLMs): `chatglm`, `LLaMA`, `Qwen` and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample fearures `ov::genai::LLMPipeline` and configures it to run random sampling algorithm. There is also a Jupyter [notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot) which provides an example of LLM-powered Chatbot in Python. diff --git a/src/docs/SUPPORTED_MODELS.md b/src/docs/SUPPORTED_MODELS.md index 0e6099db0..3eb2af17b 100644 --- a/src/docs/SUPPORTED_MODELS.md +++ b/src/docs/SUPPORTED_MODELS.md @@ -45,7 +45,19 @@ - LlamaForCausalLM + LlamaForCausalLM + Llama 3 + + + + + + Llama 2