-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BigCode models (StarCoder etc.) #5
Comments
Hi, @casper-hansen Thanks for your great work. Here is my code:
I am not sure do i need do something special? |
Seems something must have gone wrong here when converting. I will look into the specification of the layers. @curname Can you paste the code you used to measure the accuracy?
This could be reasonable dependent on hardware/model size, but seems there is room for improvement here. |
Hi, @casper-hansen
The code for measuring the HE accuracy comes from the OpenAI human-eval project, the address is here https://github.com/openai/human-eval/tree/master/human_eval.
I did the above implementation on A100 80G, the speed of awq and gptq is almost same, the experiments in the paper prove that awq is better than gptq, although the experimental model is mainly llama, not starcoder, if I want to further improve the inference speed to 30ms/token, or even 20ms/token, I would appreciate it if you could give me some suggestions. |
And the code like this:
|
@curname Did you get it working with better accuracy yet? Also, did you test perplexity before and after on wikitext? A normal in wikitext perplexity is between 2-5% (LLaMa 7B is around 2%). I wish I could test these models for you but unfortunately, I do not have many GPU resources available to me because of the cost associated with them. The code for testing can be found here: Lines 134 to 138 in 783afe5
|
Hi I tried this code, seems to work in terms of model generating correctly. But strangely very slow output when compared to fp16, the model I tried was 1B. Using a 3090, GPU util is very low < 10% when generating: from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'abacaj/starcoderbase-1b-sft'
quant_path = 'starcoder-1b-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path) |
The process to support models is first to support quantization and then we move on to optimize the inference after by fusing layers. Additionally, we have an upcoming PR that will enable speeding up inference much easier. |
https://huggingface.co/bigcode/starcoder
The text was updated successfully, but these errors were encountered: