-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22
Comments
Interesting, doing these changes (and a couple of more hacks) I was able to run the 13B model on my HW (AMD Ryzen 7 3700X 8-Core Processor, 3593 Mhz, 8 Core(s), 16 Logical Processor(s), 32gb RAM) and I was able to get 268ms per token, with around 8GB of ram usage! I forced the usage of AVX2 and that gave a huge speed up. |
@etra0 here are my 13B model tests, based on number of threads & AVX2 (thanks!): 4: 3809.57 ms per token (default settings) 4: 495.08 ms per token (with AVX2) Clearly AVX2 gives a huge boost. I see however that you are still way ahead with your 268 ms. What other optimizations do you have? |
Yes, AVX2 flags are very important for high performance.
This is not very desirable - I don't want an extra file added. Although the QK constants everywhere are indeed problematic. |
I could do that, but I'm unsure whether to create a Solution, or move the project to CMake, because Windows doesn't support Make by default, sadly. I always try to avoid Solutions because they're not multiplatform, but from looking at the makefile, rewriting it to CMake would take a bit more time. In the meantime I could do a PR to fix the things that won't compile. |
CMake is better than Solutions. The https://github.com/ggerganov/whisper.cpp project has a CMake build system that is compatible for Windows and the project is very similar. It should be easy to adapt |
Great! These changes finally fixed compilation for me using VS cl command (#2) and also cmake with @etra0 repo. I get 140ms per token on i9900k and about 5gb ram usage with 7B. Unfortunately bigger prompts are kind of unusable. Dno if it's windows issue or this library isn't yet optimized in this case. Making hardcoded 512 token limit a parameter was easy to change but it's too slow as it repeats all prompt tokens. |
Maybe the context size has to be increased - it's currently hardcoded to 512: Line 768 in da1a4ff
Haven't tested if it works with other values |
I didn't see your PR when I read the issue so went ahead and made one, very similar. |
Using the fix in #31, however, the results from 4 bit models are still repetitive nonsense. FP16 works but the results are also very bad. Relevant spec: Intel-13700k, 240ms/token |
As I said, I made parameter out of it and it fixes longer prompts but they are still slow. What I'm saying is that without some kind of quicker prompt loading/caching this is very far from ChatGPT. How's let's say 300 tokens prompt for you? |
Any chance we could publish binaries for windows? |
Here https://github.com/jaykrell/llama.cpp/releases/tag/1 |
Here is an updated fork based on initial adjustments done by @etra0 @etra0 kindly ask to merge my pull request and push it to @ggerganov repo. |
I don't think I'll merge this, sadly. I don't want to add solutions to the project, I'd rather go with the nmake solution or finish writing the CMake. |
@jaykrell thank you for your work, I've tried it and it worked! However the quantizer seemed like it run, but didn't produce any bin files (tried with 7b and 13b), but I could run with the original model on an i5-9600k 10 times slower, but. :D |
* Apply fixes suggested to build on windows Issue: #22 * Remove unsupported VLAs * MSVC: Remove features that are only available on MSVC C++20. * Fix zero initialization of the other fields. * Change the use of vector for stack allocations.
@ggerganov would this suport merge to master? |
Successfully compilied this on msys2(ucrt). |
I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at #75. If you pull my changes, you can build the project with the following instructions: # Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release That will build the two executables,
The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now. EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it. cc @jinfagang. |
Small feedback: llama.exe should be renamed to main.exe somewhere to be consistent with readme commands. |
I was able to build with clang (from VS2022 prompt), without any changes: clang -march=native -O3 -fuse-ld=lld-link -flto main.cpp ggml.c utils.cpp
clang -march=native -O3 -fuse-ld=lld-link -flto quantize.cpp ggml.c utils.cpp Seems to be 10% faster (than timings in #39), ymmv. |
I installed VS 2022 build tools, installed MSVC and cmake But I get this error:
What am I doing wrong? |
@Zerogoki00 From the looks of it, it seems that you have no C/C++ compiler. Did you make sure selecting C++ development when installing build tools? |
Builds fine for me. Interactive mode doesn't work correctly, program ends after first generation. |
help me please main: seed = 1678814584 |
@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README. I know we still need to update the instructions for Windows, but I just haven't found the time yet. |
I know we still need to update the instructions for Windows, but I just haven't found the time yet. Yes, I did everything according to the instructions. Okay, I'll wait for the updated instructions. And then several hours trying to start =) Just write it in detail, please, with each step =) Thank you very much. |
Interactive Mode not working right. It returns to the Bash command prompt after the first message: == Running in interactive mode. ==
main: mem per token = 14434244 bytes |
Hello everyone) how do I install it, and how to turn it on and off on my PC, who can explain? I have Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz 2.40GHz . 12.0 GB (available: 11.9 GB); Windows 11 Pro. I hope it will work fine. |
Assuming you are at a VS2022 command prompt and you've installed set PATH=%DevEnvDir%CommonExtensions\Microsoft\TeamFoundation\Team Explorer\Git\cmd;%PATH%
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release If you installed another Building the repo gives you I can't really help beyond that because I have a different build environment I'm using @eldash666 12GB might be tight. @RedLeader721, Interactive mode has several issues. First #120 is need for Windows support for Ctrl-C handler. Second it's possible for the reverse prompt to appear as different tokens and be ignored. Also, I'd try a better prompt #199 - give an example or two, lead the model as to what you want and it will follow. |
|
我用中文完整的叙述一下吧,如果英语国家的请自行翻译。(I only know a little English) 接着我们会发现得到三个文件 然后看你心情把llama.exe加入环境变量还是直接拖过来操作,参数都有给出,照着来即可。(记得把GBK改成UTF,该死的编码问题。) |
Mostly taken from ggerganov/llama.cpp#22 Some might be unnecessary, this is the first version I managed to run.
Does anyone have the binary quantize.exe? Mine doesn't process the FP16 files. There is no error message, but there is no output file. Please publish the file if you have a working one. Thank you. |
I wrote a windows version of "quantize.sh". @echo off
setlocal enabledelayedexpansion
cd /d %~dp0
set PARAM_CHECK=FALSE
set MODEL_TYPE=%1
set PARAM="%2"
rem Is there a way to use findstr?
IF %MODEL_TYPE%==7B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==13B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==30B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==65B set PARAM_CHECK=TRUE
if %PARAM_CHECK%==FALSE (
echo;
echo "Usage: quantize.sh 7B|13B|30B|65B [--remove-f16]"
echo;
exit 1
)
for %%i in (models/%MODEL_TYPE%/ggml-model-f16.bin*) do (
call :Quantize %%i
)
exit 0
:Quantize
set INPUT_MODEL=%1
set OUTPUT_MODEL=!INPUT_MODEL:f16=q4_0!
call quantize.exe models\%MODEL_TYPE%\%INPUT_MODEL% models\%MODEL_TYPE%\%OUTPUT_MODEL% 2
if %PARAM%=="--remove-f16" (
call del models\%MODEL_TYPE%\%INPUT_MODEL%
) |
@Beeplex64 the output file is still not created after using tour BAT file. Can you please publish your quantize.exe? Not sure why the one that I compiled doesn't work. Thank you. CC @tmzncty I saw your snapshot, if you could publish the binary quantize.exe, I would appreciate it. Thank you. |
Does anyone have the binary quantize.exe and llama.exe? I just have the llama.lib after cmake build openration. How can i deal with it? |
@huangl22 Check the directory llama.cpp\build\bin\Release - assuming you saw the llama.lib in llama.cpp\build\Release |
there is quantize.exe in llama.cpp\build\bin\Release, but there isn't llama.exe in it. |
@huangl22 |
Closing this as there doesn't seem to be a concrete issue on Windows anymore, and we have CI checks now. If you still have problems, please open a new issue. |
* Apply fixes suggested to build on windows Issue: ggerganov/llama.cpp#22 * Remove unsupported VLAs * MSVC: Remove features that are only available on MSVC C++20. * Fix zero initialization of the other fields. * Change the use of vector for stack allocations.
First of all thremendous work Georgi! I managed to run your project with a small adjustments on:
Here is the list of those small fixes:
(uint8_t*)(y
to:((uint8_t*)y
(const uint8_t*)(x
to((const uint8_t*)x
(const uint8_t*)(y
to((const uint8_t*)y
uint8_t pp[qk / 2];
)It would be really great if you could incorporate those small fixes.
The text was updated successfully, but these errors were encountered: