Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22

bsiminski · 2023-03-11T20:44:33Z

First of all thremendous work Georgi! I managed to run your project with a small adjustments on:

Intel(R) Core(TM) i7-10700T CPU @ 2.00GHz / 16GB as x64 bit app, it takes around 5GB of RAM.

Here is the list of those small fixes:

main.cpp: added ggml_time_init() at start of main (division by zero otherwise)
quantize.cpp: same as above at start of main (division by zero otherwise)
ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)
ggml.c: replace fopen with fopen_s (VS secure error message)
ggml.c: below changes due to 'expression must be a pointer or complete object type':

2x (uint8_t*)(y to: ((uint8_t*)y
4x (const uint8_t*)(x to ((const uint8_t*)x
2x (const uint8_t*)(y to ((const uint8_t*)y

quantize.cpp: removed qk in ggml_quantize_q4_0 & ggml_quantize_q4_1 calls
utils.cpp: use of QK value instead of parameter value (VS raise error for: uint8_t pp[qk / 2];)

It would be really great if you could incorporate those small fixes.

The text was updated successfully, but these errors were encountered:

etra0 · 2023-03-11T22:26:01Z

Interesting, doing these changes (and a couple of more hacks) I was able to run the 13B model on my HW (AMD Ryzen 7 3700X 8-Core Processor, 3593 Mhz, 8 Core(s), 16 Logical Processor(s), 32gb RAM) and I was able to get 268ms per token, with around 8GB of ram usage!

I forced the usage of AVX2 and that gave a huge speed up.

Issue: ggerganov#22

bsiminski · 2023-03-11T23:08:40Z

@etra0 here are my 13B model tests, based on number of threads & AVX2 (thanks!):

4: 3809.57 ms per token (default settings)
8: 3617.09 ms per token (default settings)
12: 2967.79 ms per token (default settings)

4: 495.08 ms per token (with AVX2)
8: 519.78 ms per token (with AVX2)
12: 490.53 ms per token (with AVX2)

Clearly AVX2 gives a huge boost. I see however that you are still way ahead with your 268 ms. What other optimizations do you have?

ggerganov · 2023-03-11T23:37:33Z

Yes, AVX2 flags are very important for high performance.
Could you wrap these changes in a PR?

ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)

This is not very desirable - I don't want an extra file added. Although the QK constants everywhere are indeed problematic.
Some other fix?

etra0 · 2023-03-11T23:52:48Z

Could you wrap these changes in a PR?

I could do that, but I'm unsure whether to create a Solution, or move the project to CMake, because Windows doesn't support Make by default, sadly.

I always try to avoid Solutions because they're not multiplatform, but from looking at the makefile, rewriting it to CMake would take a bit more time. In the meantime I could do a PR to fix the things that won't compile.

ggerganov · 2023-03-11T23:59:37Z

CMake is better than Solutions. The https://github.com/ggerganov/whisper.cpp project has a CMake build system that is compatible for Windows and the project is very similar. It should be easy to adapt

Issue: ggerganov#22

kamyker · 2023-03-12T05:36:15Z

Great! These changes finally fixed compilation for me using VS cl command (#2) and also cmake with @etra0 repo.

I get 140ms per token on i9900k and about 5gb ram usage with 7B.

Unfortunately bigger prompts are kind of unusable. Dno if it's windows issue or this library isn't yet optimized in this case. Making hardcoded 512 token limit a parameter was easy to change but it's too slow as it repeats all prompt tokens.

ggerganov · 2023-03-12T06:28:35Z

@kamyker

Maybe the context size has to be increased - it's currently hardcoded to 512:

llama.cpp/main.cpp

Line 768 in da1a4ff

    
           if (!llama_model_load(params.model, model, vocab, 512)) {  // TODO: set context from user input ??

Haven't tested if it works with other values

jaykrell · 2023-03-12T06:50:22Z

I didn't see your PR when I read the issue so went ahead and made one, very similar.
I made the existing Makefile work on Unix and Microsoft nmake.
#36

0xbitches · 2023-03-12T07:13:14Z

Using the fix in #31, however, the results from 4 bit models are still repetitive nonsense. FP16 works but the results are also very bad.

Relevant spec: Intel-13700k, 240ms/token
Built make.exe with mwing64

kamyker · 2023-03-12T07:33:09Z

@kamyker

Maybe the context size has to be increased - it's currently hardcoded to 512:

llama.cpp/main.cpp

Line 768 in da1a4ff

if (!llama_model_load(params.model, model, vocab, 512)) { // TODO: set context from user input ??

Haven't tested if it works with other values

As I said, I made parameter out of it and it fixes longer prompts but they are still slow. What I'm saying is that without some kind of quicker prompt loading/caching this is very far from ChatGPT.

How's let's say 300 tokens prompt for you?

teknium1 · 2023-03-12T08:24:16Z

Any chance we could publish binaries for windows?

jaykrell · 2023-03-12T09:58:28Z

@teknium1

Any chance we could publish binaries for windows?

Here https://github.com/jaykrell/llama.cpp/releases/tag/1
but perhaps that is kinda rude of me. I'll delete if there are objections.

bsiminski · 2023-03-12T10:57:42Z

Here is an updated fork based on initial adjustments done by @etra0
Visual Studio 2022 - vsproj version

@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.

etra0 · 2023-03-12T15:48:45Z

Here is an updated fork based on initial adjustments done by @etra0 Visual Studio 2022 - vsproj version

@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.

I don't think I'll merge this, sadly. I don't want to add solutions to the project, I'd rather go with the nmake solution or finish writing the CMake.

kbalint · 2023-03-12T16:22:24Z

@jaykrell thank you for your work, I've tried it and it worked! However the quantizer seemed like it run, but didn't produce any bin files (tried with 7b and 13b), but I could run with the original model on an i5-9600k 10 times slower, but. :D

* Apply fixes suggested to build on windows Issue: #22 * Remove unsupported VLAs * MSVC: Remove features that are only available on MSVC C++20. * Fix zero initialization of the other fields. * Change the use of vector for stack allocations.

lucasjinreal · 2023-03-13T12:56:46Z

@ggerganov would this suport merge to master?

ShouNichi · 2023-03-13T13:20:16Z

Successfully compilied this on msys2(ucrt).

etra0 · 2023-03-13T13:43:35Z

I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at #75.

If you pull my changes, you can build the project with the following instructions:

# Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release

That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.

EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.

cc @jinfagang.

kamyker · 2023-03-13T15:23:26Z

That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

Small feedback: llama.exe should be renamed to main.exe somewhere to be consistent with readme commands.

bitRAKE · 2023-03-13T15:28:03Z

I was able to build with clang (from VS2022 prompt), without any changes:

clang -march=native -O3 -fuse-ld=lld-link -flto main.cpp ggml.c utils.cpp
clang -march=native -O3 -fuse-ld=lld-link -flto quantize.cpp ggml.c utils.cpp

Seems to be 10% faster (than timings in #39), ymmv.

Zerogoki00 · 2023-03-13T20:19:44Z

I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at #75.

If you pull my changes, you can build the project with the following instructions:
# Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release
That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.

EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.

cc @jinfagang.

I installed VS 2022 build tools, installed MSVC and cmake

But I get this error:

C:\Users\quela\Downloads\LLaMA\llama.cpp\build>cmake ..
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version  to target Windows 10.0.22621.
-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
CMake Error at CMakeLists.txt:2 (project):
  No CMAKE_C_COMPILER could be found.



CMake Error at CMakeLists.txt:2 (project):
  No CMAKE_CXX_COMPILER could be found.



-- Configuring incomplete, errors occurred!
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeOutput.log".
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeError.log".

What am I doing wrong?

etra0 · 2023-03-13T21:44:54Z

@Zerogoki00 From the looks of it, it seems that you have no C/C++ compiler. Did you make sure selecting C++ development when installing build tools?

kamyker · 2023-03-14T02:14:51Z

Builds fine for me.

Interactive mode doesn't work correctly, program ends after first generation.

1octopus1 · 2023-03-14T17:29:01Z

help me please

main: seed = 1678814584
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: failed to open './models/7B/ggml-model-q4_0.bin'
main: failed to load model from './models/7B/ggml-model-q4_0.bin'

1octopus1 · 2023-03-14T19:27:29Z

help me please

etra0 · 2023-03-14T19:41:16Z

@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.

I know we still need to update the instructions for Windows, but I just haven't found the time yet.

1octopus1 · 2023-03-14T20:00:17Z

@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.

I know we still need to update the instructions for Windows, but I just haven't found the time yet.

Yes, I did everything according to the instructions. Okay, I'll wait for the updated instructions. And then several hours trying to start =) Just write it in detail, please, with each step =) Thank you very much.

RedLeader721 · 2023-03-15T02:11:43Z

Interactive Mode not working right. It returns to the Bash command prompt after the first message:
$ ./Release/llama.exe -m ../../../Users/ron/llama.cpp/models/7B/ggml-model-q4_0.bin -t 8 --repeat_penalty 1.2 --temp 0.9 --top_p 0.9 -n 256 --color -i -r "User:" -p "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision."

== Running in interactive mode. ==

Press Return to return control to LLaMa.
If you want to submit another line, end your input in ''.
Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
User:I really want a pizza.
Assistant [Bob]:OK then, what do you want? [end of text]

main: mem per token = 14434244 bytes
main: load time = 3234.19 ms
main: sample time = 12.65 ms
main: predict time = 12828.59 ms / 183.27 ms per token
main: total time = 31762.91 ms
(venv)
ron@LAPTOP-JIBCUHGM MINGW64 /c/llama/llama.cpp/build (master)
$

eldash666 · 2023-03-15T16:20:04Z

Hello everyone) how do I install it, and how to turn it on and off on my PC, who can explain? I have Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz 2.40GHz . 12.0 GB (available: 11.9 GB); Windows 11 Pro. I hope it will work fine.

bitRAKE · 2023-03-15T17:36:50Z

Assuming you are at a VS2022 command prompt and you've installed git/cmake support through the VS Installer:

set PATH=%DevEnvDir%CommonExtensions\Microsoft\TeamFoundation\Team Explorer\Git\cmd;%PATH%
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release

If you installed another git then that first line might not be needed. Yeah, MS decided not to add git to the path, doh!

Building the repo gives you llama.exe and quantize.exe in the llama.cpp\build\Release directory. You'll need to convert and quantize the model by following the directions for that.

I can't really help beyond that because I have a different build environment I'm using clang from the terminal.

@eldash666 12GB might be tight.

@RedLeader721, Interactive mode has several issues. First #120 is need for Windows support for Ctrl-C handler. Second it's possible for the reverse prompt to appear as different tokens and be ignored. Also, I'd try a better prompt #199 - give an example or two, lead the model as to what you want and it will follow.

tmzncty · 2023-03-16T01:33:52Z

Assuming you are at a VS2022 command prompt and you've installed git/cmake support through the VS Installer:
set PATH=%DevEnvDir%CommonExtensions\Microsoft\TeamFoundation\Team Explorer\Git\cmd;%PATH%
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release
If you installed another git then that first line might not be needed. Yeah, MS decided not to add git to the path, doh!

Building the repo gives you llama.exe and quantize.exe in the llama.cpp\build\Release directory. You'll need to convert and quantize the model by following the directions for that.

I can't really help beyond that because I have a different build environment I'm using clang from the terminal.

@eldash666 12GB might be tight.

@RedLeader721, Interactive mode has several issues. First #120 is need for Windows support for Ctrl-C handler. Second it's possible for the reverse prompt to appear as different tokens and be ignored. Also, I'd try a better prompt - give an example or two, lead the model as to what you want and it will follow.

Thanks~

tmzncty · 2023-03-16T01:59:25Z

我用中文完整的叙述一下吧，如果英语国家的请自行翻译。（I only know a little English）
直接按照
这位大佬的方法编译，记得用

这个东西把cmake装上去。

然后愉快的编译开始了，等待即可。

接着我们会发现得到三个文件

后面两个EXE是需要用到的。
然后我们来转化模型（链接：https://pan.baidu.com/s/1Y7YWdFWX1Yzy2Yuujp8Tqg?pwd=1p5n
提取码：1p5n
--来自百度网盘超级会员V4的分享
）
直接写原模型的绝对路径（在实操过程中很多时候会被路径坑死）

python convert-pth-to-ggml.py B:LLaMA/7B 1
然后等待

完成后利用之前编译好的quantize.exe进一步转换
quantize.exe ggml-model-f16.bin ggml-model-q4.bin 2

等待完成。

然后看你心情把llama.exe加入环境变量还是直接拖过来操作，参数都有给出，照着来即可。（记得把GBK改成UTF，该死的编码问题。）
llama.exe -m ggml-model-q4.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i

玩得开心。

Mostly taken from ggerganov/llama.cpp#22 Some might be unnecessary, this is the first version I managed to run.

mulyadi · 2023-03-28T14:31:36Z

Does anyone have the binary quantize.exe? Mine doesn't process the FP16 files. There is no error message, but there is no output file. Please publish the file if you have a working one. Thank you.

Beeplex64 · 2023-03-28T16:38:58Z

I wrote a windows version of "quantize.sh".
If you want to use this, just copy code down below and paste to notepad.
Then save as "quantize.bat".

@echo off
setlocal enabledelayedexpansion
cd /d %~dp0

set PARAM_CHECK=FALSE
set MODEL_TYPE=%1
set PARAM="%2"

rem Is there a way to use findstr?
IF %MODEL_TYPE%==7B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==13B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==30B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==65B set PARAM_CHECK=TRUE

if %PARAM_CHECK%==FALSE (
    echo;
    echo "Usage: quantize.sh 7B|13B|30B|65B [--remove-f16]"
    echo;
    exit 1
)

for %%i in (models/%MODEL_TYPE%/ggml-model-f16.bin*) do (
    call :Quantize %%i
)
exit 0

:Quantize
    set INPUT_MODEL=%1
    set OUTPUT_MODEL=!INPUT_MODEL:f16=q4_0!
    call quantize.exe models\%MODEL_TYPE%\%INPUT_MODEL% models\%MODEL_TYPE%\%OUTPUT_MODEL% 2
    if %PARAM%=="--remove-f16" (
        call del models\%MODEL_TYPE%\%INPUT_MODEL%
    )

mulyadi · 2023-03-28T23:04:06Z

@Beeplex64 the output file is still not created after using tour BAT file. Can you please publish your quantize.exe? Not sure why the one that I compiled doesn't work. Thank you.

CC @tmzncty I saw your snapshot, if you could publish the binary quantize.exe, I would appreciate it. Thank you.

huangl22 · 2023-04-01T03:05:27Z

Does anyone have the binary quantize.exe and llama.exe? I just have the llama.lib after cmake build openration. How can i deal with it?

danskycode · 2023-04-01T05:28:47Z

Does anyone have the binary quantize.exe and llama.exe? I just have the llama.lib after cmake build openration. How can i deal with it?

@huangl22 Check the directory llama.cpp\build\bin\Release - assuming you saw the llama.lib in llama.cpp\build\Release

huangl22 · 2023-04-01T07:46:16Z

Does anyone have the binary quantize.exe and llama.exe? I just have the llama.lib after cmake build openration. How can i deal with it?

@huangl22 Check the directory llama.cpp\build\bin\Release - assuming you saw the llama.lib in llama.cpp\build\Release

there is quantize.exe in llama.cpp\build\bin\Release, but there isn't llama.exe in it.

kevingosse · 2023-04-01T14:04:39Z

@huangl22 main.exe is the old llama.exe #22 (comment)

sw · 2023-04-16T10:25:54Z

Closing this as there doesn't seem to be a concrete issue on Windows anymore, and we have CI checks now. If you still have problems, please open a new issue.

* Apply fixes suggested to build on windows Issue: ggerganov/llama.cpp#22 * Remove unsupported VLAs * MSVC: Remove features that are only available on MSVC C++20. * Fix zero initialization of the other fields. * Change the use of vector for stack allocations.

etra0 added a commit to etra0/llama.cpp that referenced this issue Mar 11, 2023

Apply fixes suggested to build on windows

e9c33ba

Issue: ggerganov#22

gyunggyung mentioned this issue Mar 12, 2023

[20230312] Weekly AI ArXiv 만담 시즌2 - 9회차 jungwoo-ha/WeeklyArxivTalk#75

Open

NotNite mentioned this issue Mar 12, 2023

ggml_new_tensor_impl: not enough space in the context's memory pool #29

Closed

etra0 added a commit to etra0/llama.cpp that referenced this issue Mar 12, 2023

Apply fixes suggested to build on windows

2d29d4b

Issue: ggerganov#22

etra0 mentioned this issue Mar 12, 2023

Windows fixes #31

Merged

ggerganov added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Mar 12, 2023

etra0 mentioned this issue Mar 12, 2023

Windows MSVC support #49

Closed

sgsdxzy mentioned this issue Mar 13, 2023

GPTQ quantization(3 or 4 bit quantization) support for LLaMa oobabooga/text-generation-webui#177

Closed

lapo-luchini added a commit to lapo-luchini/bloomz.cpp that referenced this issue Mar 16, 2023

A few fixes that make it compile on Mingw64.

5addd62

Mostly taken from ggerganov/llama.cpp#22 Some might be unnecessary, this is the first version I managed to run.

lapo-luchini mentioned this issue Mar 16, 2023

A few fixes that make it compile on Mingw64. NouamaneTazi/bloomz.cpp#7

Merged

kassane mentioned this issue Mar 19, 2023

How to build on windows? #103

Closed

prusnak added the windows Issues specific to Windows label Apr 1, 2023

sw closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2023

Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22

Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22

Comments

bsiminski commented Mar 11, 2023

etra0 commented Mar 11, 2023 • edited Loading

bsiminski commented Mar 11, 2023

ggerganov commented Mar 11, 2023

etra0 commented Mar 11, 2023 • edited Loading

ggerganov commented Mar 11, 2023

kamyker commented Mar 12, 2023

ggerganov commented Mar 12, 2023 • edited Loading

jaykrell commented Mar 12, 2023

0xbitches commented Mar 12, 2023 • edited Loading

kamyker commented Mar 12, 2023

teknium1 commented Mar 12, 2023

jaykrell commented Mar 12, 2023

bsiminski commented Mar 12, 2023

etra0 commented Mar 12, 2023

kbalint commented Mar 12, 2023

lucasjinreal commented Mar 13, 2023

ShouNichi commented Mar 13, 2023

etra0 commented Mar 13, 2023 • edited Loading

kamyker commented Mar 13, 2023 • edited Loading

bitRAKE commented Mar 13, 2023 • edited Loading

Zerogoki00 commented Mar 13, 2023

etra0 commented Mar 13, 2023

kamyker commented Mar 14, 2023 • edited Loading

1octopus1 commented Mar 14, 2023

1octopus1 commented Mar 14, 2023

etra0 commented Mar 14, 2023

1octopus1 commented Mar 14, 2023 • edited Loading

RedLeader721 commented Mar 15, 2023

eldash666 commented Mar 15, 2023 • edited Loading

bitRAKE commented Mar 15, 2023 • edited Loading

tmzncty commented Mar 16, 2023

tmzncty commented Mar 16, 2023 • edited Loading

mulyadi commented Mar 28, 2023

Beeplex64 commented Mar 28, 2023

mulyadi commented Mar 28, 2023

huangl22 commented Apr 1, 2023

danskycode commented Apr 1, 2023 • edited Loading

huangl22 commented Apr 1, 2023

kevingosse commented Apr 1, 2023 • edited Loading

sw commented Apr 16, 2023

etra0 commented Mar 11, 2023 •

edited

Loading

etra0 commented Mar 11, 2023 •

edited

Loading

ggerganov commented Mar 12, 2023 •

edited

Loading

0xbitches commented Mar 12, 2023 •

edited

Loading

etra0 commented Mar 13, 2023 •

edited

Loading

kamyker commented Mar 13, 2023 •

edited

Loading

bitRAKE commented Mar 13, 2023 •

edited

Loading

kamyker commented Mar 14, 2023 •

edited

Loading

1octopus1 commented Mar 14, 2023 •

edited

Loading

eldash666 commented Mar 15, 2023 •

edited

Loading

bitRAKE commented Mar 15, 2023 •

edited

Loading

tmzncty commented Mar 16, 2023 •

edited

Loading

danskycode commented Apr 1, 2023 •

edited

Loading

kevingosse commented Apr 1, 2023 •

edited

Loading