Skip to content

Commit

Permalink
Merge pull request langchain-ai#12 from shane-huang/llm-gpu
Browse files Browse the repository at this point in the history
add support for more datatypes
  • Loading branch information
shane-huang authored Apr 24, 2024
2 parents 9111d3a + 8fc5850 commit 60e6ff6
Show file tree
Hide file tree
Showing 5 changed files with 338 additions and 84 deletions.
129 changes: 108 additions & 21 deletions docs/docs/integrations/llms/ipex_llm.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
"source": [
"# IPEX-LLM\n",
"\n",
"> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a low-bit LLM optimization library on Intel XPU (Xeon/Core/Flex/Arc/Max). It can make LLMs run extremely fast and consume much less memory on Intel platforms. It is open sourced under Apache 2.0 License.\n",
"> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency. \n",
"\n",
"This example goes over how to use LangChain to interact with IPEX-LLM for text generation. \n"
"This example goes over how to use LangChain to interact with `ipex-llm` for text generation. \n"
]
},
{
Expand Down Expand Up @@ -49,7 +49,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage"
"## Basic Usage"
]
},
{
Expand All @@ -58,9 +58,20 @@
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"from langchain.chains import LLMChain\n",
"from langchain_community.llms import IpexLLM\n",
"from langchain_core.prompts import PromptTemplate"
"from langchain_core.prompts import PromptTemplate\n",
"\n",
"warnings.filterwarnings(\"ignore\", category=UserWarning, message=\".*padding_mask.*\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify the prompt template for your model. In this example, we use the [vicuna-1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) model. If you're working with a different model, choose a proper template accordingly."
]
},
{
Expand All @@ -77,7 +88,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Load Model: "
"Load the model locally using IpexLLM using `IpexLLM.from_model_id`. It will load the model directly in its Huggingface format and convert it automatically to low-bit format for inference."
]
},
{
Expand All @@ -88,7 +99,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "27c08180714a44c7ab766624d5054163",
"model_id": "897501860fe4452b836f816c72d955dd",
"version_major": 2,
"version_minor": 0
},
Expand All @@ -103,7 +114,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"2024-03-27 00:58:43,670 - INFO - Converting the current model to sym_int4 format......\n"
"2024-04-24 21:20:12,461 - INFO - Converting the current model to sym_int4 format......\n"
]
}
],
Expand All @@ -130,24 +141,16 @@
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `run` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.\n",
"/opt/anaconda3/envs/shane-langchain-3.11/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `LLMChain` was deprecated in LangChain 0.1.17 and will be removed in 0.3.0. Use RunnableSequence, e.g., `prompt | llm` instead.\n",
" warn_deprecated(\n",
"/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n",
"/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/ipex_llm/transformers/models/llama.py:218: UserWarning: Passing `padding_mask` is deprecated and will be removed in v4.37.Please make sure use `attention_mask` instead.`\n",
" warnings.warn(\n",
"/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/ipex_llm/transformers/models/llama.py:218: UserWarning: Passing `padding_mask` is deprecated and will be removed in v4.37.Please make sure use `attention_mask` instead.`\n",
"/opt/anaconda3/envs/shane-langchain-3.11/lib/python3.11/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"To disable this warning, you can either:\n",
"\t- Avoid using `tokenizers` before the fork if possible\n",
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
"AI stands for \"Artificial Intelligence.\" It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI can be achieved through a combination of techniques such as machine learning, natural language processing, computer vision, and robotics. The ultimate goal of AI research is to create machines that can think and learn like humans, and can even exceed human capabilities in certain areas.\n"
]
}
Expand All @@ -156,15 +159,99 @@
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
"\n",
"question = \"What is AI?\"\n",
"output = llm_chain.run(question)"
"output = llm_chain.invoke(question)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save/Load Low-bit Model\n",
"Alternatively, you might save the low-bit model to disk once and use `from_model_id_low_bit` instead of `from_model_id` to reload it for later use - even across different machines. It is space-efficient, as the low-bit model demands significantly less disk space than the original model. And `from_model_id_low_bit` is also more efficient than `from_model_id` in terms of speed and memory usage, as it skips the model conversion step."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To save the low-bit model, use `save_low_bit` as follows."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": []
"source": [
"saved_lowbit_model_path = \"./vicuna-7b-1.5-low-bit\" # path to save low-bit model\n",
"llm.model.save_low_bit(saved_lowbit_model_path)\n",
"del llm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the model from saved lowbit model path as follows. \n",
"> Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original model's directory to the location where the low-bit model is saved."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-04-24 21:20:35,874 - INFO - Converting the current model to sym_int4 format......\n"
]
}
],
"source": [
"llm_lowbit = IpexLLM.from_model_id_low_bit(\n",
" model_id=saved_lowbit_model_path,\n",
" tokenizer_id=\"lmsys/vicuna-7b-v1.5\",\n",
" # tokenizer_name=saved_lowbit_model_path, # copy the tokenizers to saved path if you want to use it this way\n",
" model_kwargs={\"temperature\": 0, \"max_length\": 64, \"trust_remote_code\": True},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the loaded model in Chains:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/anaconda3/envs/shane-langchain-3.11/lib/python3.11/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"AI stands for \"Artificial Intelligence.\" It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI can be achieved through a combination of techniques such as machine learning, natural language processing, computer vision, and robotics. The ultimate goal of AI research is to create machines that can think and learn like humans, and can even exceed human capabilities in certain areas.\n"
]
}
],
"source": [
"llm_chain = LLMChain(prompt=prompt, llm=llm_lowbit)\n",
"\n",
"question = \"What is AI?\"\n",
"output = llm_chain.invoke(question)"
]
}
],
"metadata": {
Expand All @@ -183,7 +270,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"version": "3.11.9"
}
},
"nbformat": 4,
Expand Down
33 changes: 29 additions & 4 deletions libs/community/langchain_community/llms/bigdl_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ class BigdlLLM(IpexLLM):
def from_model_id(
cls,
model_id: str,
tokenizer_id: Optional[str] = None,
load_in_4bit: bool = True,
load_in_low_bit: Optional[str] = None,
model_kwargs: Optional[dict] = None,
**kwargs: Any,
) -> LLM:
Expand All @@ -31,6 +34,8 @@ def from_model_id(
Args:
model_id: Path for the huggingface repo id to be downloaded or
the huggingface checkpoint folder.
tokenizer_id: Path for the huggingface repo id to be downloaded or
the huggingface checkpoint folder which contains the tokenizer.
model_kwargs: Keyword arguments to pass to the model and tokenizer.
kwargs: Extra arguments to pass to the model and tokenizer.
Expand All @@ -52,12 +57,27 @@ def from_model_id(
"Please install it with `pip install --pre --upgrade bigdl-llm[all]`."
)

if load_in_low_bit is not None:
logger.warning(
"""`load_in_low_bit` option is not supported in BigdlLLM and
is ignored. For more data types support with `load_in_low_bit`,
use IpexLLM instead."""
)

if not load_in_4bit:
raise ValueError(
"BigdlLLM only supports loading in 4-bit mode, "
"i.e. load_in_4bit = True. "
"Please install it with `pip install --pre --upgrade bigdl-llm[all]`."
)

_model_kwargs = model_kwargs or {}
_tokenizer_id = tokenizer_id or model_id

try:
tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(_tokenizer_id, **_model_kwargs)
except Exception:
tokenizer = LlamaTokenizer.from_pretrained(model_id, **_model_kwargs)
tokenizer = LlamaTokenizer.from_pretrained(_tokenizer_id, **_model_kwargs)

try:
model = AutoModelForCausalLM.from_pretrained(
Expand Down Expand Up @@ -85,6 +105,7 @@ def from_model_id(
def from_model_id_low_bit(
cls,
model_id: str,
tokenizer_id: Optional[str] = None,
model_kwargs: Optional[dict] = None,
**kwargs: Any,
) -> LLM:
Expand All @@ -94,6 +115,8 @@ def from_model_id_low_bit(
Args:
model_id: Path for the bigdl-llm transformers low-bit model folder.
tokenizer_id: Path for the huggingface repo id or local model folder
which contains the tokenizer.
model_kwargs: Keyword arguments to pass to the model and tokenizer.
kwargs: Extra arguments to pass to the model and tokenizer.
Expand All @@ -117,10 +140,12 @@ def from_model_id_low_bit(
)

_model_kwargs = model_kwargs or {}
_tokenizer_id = tokenizer_id or model_id

try:
tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(_tokenizer_id, **_model_kwargs)
except Exception:
tokenizer = LlamaTokenizer.from_pretrained(model_id, **_model_kwargs)
tokenizer = LlamaTokenizer.from_pretrained(_tokenizer_id, **_model_kwargs)

try:
model = AutoModelForCausalLM.load_low_bit(model_id, **_model_kwargs)
Expand Down
Loading

0 comments on commit 60e6ff6

Please sign in to comment.