-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LlaMa #1473
Comments
@conceptofmind i believe you said you were working on this? |
Yes actively working on this with a group of peers. We have successfully deployed inference with the 65B models. Working on a LangChain wrapper now. |
Would have to think about how to handle the sizes of different models though. I could see this becoming an issue for the end user....... |
There is some ongoing work to use GPTQ to compress the models to 3 or 4 bits in this repo. Also a discussion going on over at the oobabooga repo. Not sure if this is going to work but might be something to keep an eye on. If it works out it could be possible to run the larger models on a single consumer grade GPU. The original paper is available here on arxiv. |
4 bit may be plausible. 8 bit should be fine. The weights are already in fp16 from my understanding. I would have to evaluate this further. |
Yes, the weights are fp16. You can convert and run 4-bit using https://github.com/ggerganov/llama.cpp. I think 30B with full precision might be at least on par to 65B 4-bit in case of results. Llama.cpp runs on CPU, including Apple Silicon, which might be a good choice for developers with recent Macbooks, they could develop and run experiments locally with langchain without a need of GPUs. |
Confirmed working on a single consumer grade 4090 here with 13B. Waiting on the 30B 4 bit weights - failed at trying to run them at fp16. :) |
I am aware of all these alternatives. We are waiting to hear back from Huggingface before the decision is made. Once we have a concrete answer from them we will proceed from there. I have some concerns about Llama.cpp since the author seems to have noted he has no interest in maintaining it. And there are other things to factor in when adding dependencies that can not be easily installed. It needs to be a relatively effortless setup for the best user experience. |
Using GPTQ 4-bit quantized 30B model, outputs are (as far as I can tell) very good. Hope to see GPTQ 4-bit support in LangChain. The GPTQ quantization appears to be better than the 4-bit RTN quantization (currently) used in Llama.cpp 4-bit 30B model confirmed working on an OLD Tesla P40 GPU (24GB). |
Any info on running 7B model with Langchain? |
It'd be really neat if that's going to be an option 😄 |
Llama has been added to Huggingface: huggingface/transformers#21955 The only reason to add a specific wrapper would be to include the perf improvements from cpp or gptq |
I think you are talking about a Pythion wrapper. So I'm going to write a TS wrapper for llama.cpp and alpaca.cpp for localhost private usage, if no one is working on this yet. I will try extend the |
Here you are: https://github.com/linonetwo/langchain-alpaca https://www.npmjs.com/package/langchain-alpaca works on all platforms and works fully locally. For now, I will try to make a langchain-llama package. |
I'm eagerly waiting to try it for a project :D !!! |
If anyone's interested, I've made a pass at wrapping the llama.cpp shared library using ctypes and deriving a custom LLM class for it. |
FYI: I just submitted this pull request to integrate llama.cpp into langchain: |
Thank you very much!! Do you think it would be possible to run LLaMA on GPU as well somehow? |
You are able to load Llama in through Huggingface and use it in a GPU-accelerated environment. https://huggingface.co/docs/transformers/main/en/model_doc/llama |
I also added Kobold/text-generation-webui support so you can run Llama or whatever you want locally. |
I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp To run it:
|
did you happen to test this with https://github.com/oobabooga/text-generation-webui ? haven't dug into kobold enough to know if the APIs are similar enough |
Hi, @slavakurilyak! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, this issue is a request for LangChain to integrate with LlaMa, a more powerful and efficient language model developed by Facebook Research. There has been ongoing work to use GPTQ to compress the models to 3 or 4 bits, and there has been a discussion about running LlaMa on GPUs. Additionally, a Python wrapper for llama.cpp has been created, and there are plans to create a TS wrapper as well. It's worth mentioning that Llama has been added to Huggingface, and there are other alternatives like Kobold/text-generation-webui and langchain-llm-api. Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your contribution to the LangChain repository, and please don't hesitate to reach out if you have any further questions or concerns! Best regards, |
It would be great to see LangChain integrate with LlaMa, a collection of foundation language models ranging from 7B to 65B
parameters.
LlaMa is a language model that was developed to improve upon existing models such as ChatGPT and GPT-3. It has several advantages over these models, such as improved accuracy, faster training times, and more robust handling of out-of-vocabulary words. LlaMa is also more efficient in terms of memory usage and computational resources. In terms of accuracy, LlaMa outperforms ChatGPT and GPT-3 on several natural language understanding tasks, including sentiment analysis, question answering, and text summarization. Additionally, LlaMa can be trained on larger datasets, enabling it to better capture the nuances of natural language. Overall, LlaMa is a more powerful and efficient language model than ChatGPT and GPT-3.
Here's the official repo by @facebookresearch. Here's the research abstract and PDF, respectively.
Note, this project is not to be confused with LlamaIndex (previously GPT Index) by @jerryjliu.
The text was updated successfully, but these errors were encountered: