Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomic Vulkan backend #4456

Merged
merged 155 commits into from
Jan 29, 2024
Merged

Nomic Vulkan backend #4456

merged 155 commits into from
Jan 29, 2024

Conversation

cebtenzzre
Copy link
Collaborator

@cebtenzzre cebtenzzre commented Dec 13, 2023

This is Nomic's Kompute-based Vulkan backend from the GPT4All project, now available under the MIT license. It can be enabled by building with cmake and passing -DLLAMA_KOMPUTE=ON (make is currently not supported).

Structure

  • The C++ code for the backend is in ggml-kompute.cpp and ggml-kompute.h. This is MIT licensed.
  • Shaders are in a folder called kompute-shaders. These are based on the Metal backend and MIT licensed.
  • Nomic's fork of Kompute is provided as a git submodule. This is Apache-2.0 licensed.

Limitations

  • There is currently no partial offload support, so it is either -ngl 1 or -ngl 0, like Metal. We plan to implement this eventually, by implementing a split point in the compute graph. We do not plan to implement per-op offload (what most backends, do, including 0cc4am's Vulkan backend). (Partial offloading is now implemented.)
  • Supported model architectures are currently Falcon and Llama.
  • Supported model formats are Q4_0, Q4_1, FP16, and FP32, with or without Q6_K output tensors.
  • GPU-accelerated matmul for prompt processing is currently disabled due to a known issue with incorrect output. Token generation runs 100% on the GPU. (GPU prompt processing has been fixed and re-enabled.)
  • Discrete NVIDIA and AMD GPUs have been tested successfully, but both integrated and discrete Intel GPUs are currently known not to work (garbage output).
  • This backend currently requires shaderFloat16, which e.g. not available in Pascal (GTX 10-series) and older GPUs from NVIDIA. You can check if your GPU is currently supported by going to vulkan.gpuinfo.org, finding your device, and searching for shaderFloat16 under "Features > Core 1.2" - it must be supported.
  • Windows and Linux are fully supported. macOS has Metal so we haven't attempted to build this backend there.
  • This PR still needs to be updated for the per-layer KV cache. Done a while ago.

Contributions are welcome! This backend currently works well enough for Nomic's purposes so at the moment we are not focused on adding new features, only maintenance - our team is quite small right now. The goal of this PR is to reduce the maintenance burden for us, as llama.cpp frequently introduces changes that affect all backends.

niansa and others added 30 commits October 5, 2023 13:39
should no longer have new external deps other than libvulkan

```
ubuntu@ip-172-31-1-24:~/repo/gpt4all/gpt4all-backend/build$ ldd ./libllamamodel-mainline-avxonly.so
        linux-vdso.so.1 (0x00007ffcb53bb000)
        libvulkan.so.1 => /lib/x86_64-linux-gnu/libvulkan.so.1 (0x00007f239dab5000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f239d800000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f239d719000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f239da95000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f239d400000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f239dd1d000)
```
… and

stop using so many static objects so we can tear down and bring up vulkan
on new devices in the same runtime.
There are some warnings in debug builds that are likely to be false
positives.
@slaren
Copy link
Collaborator

slaren commented Jan 28, 2024

@ggerganov My concern is that a call to ggml_vk_init_device is required before the backend can be used. This creates two problems:

  • Adds unnecessary backend-specific code to llama.cpp
  • Forces other ggml applications that want to use this backend to add more backend-specific code

While I understand that this is not very high priority, the fix should be very simple (just move the initialization calls to the backend code), so I don't really see any reason to not fix it before merging.

@cebtenzzre
Copy link
Collaborator Author

It does not use GPU for the moment.

Right now only Falcon and Llama are whitelisted. This backend is strict about the model format because it is designed to run a contiguous graph on the GPU (like Metal), instead of only offloading the ops that are implemented (like Occam's Vulkan PR).

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 28, 2024

It does not use GPU for the moment.

Right now only Falcon and Llama are whitelisted. This backend is strict about the model format because it is designed to run a contiguous graph on the GPU (like Metal), instead of only offloading the ops that are implemented (like Occam's Vulkan PR).

At this point my backend also runs the full graph on GPU contiguously (if all layers are offloaded).

@cebtenzzre cebtenzzre requested a review from slaren January 29, 2024 17:57
The previous attempt actually broke GPU inference with the 'main'
example, which was previously working.

deviceName is a vk::ArrayWrapper1D. Be careful when we convert it to a
std::string, so we don't get null bytes at the end.
@cebtenzzre cebtenzzre merged commit fbf1dde into master Jan 29, 2024
53 checks passed
@cebtenzzre cebtenzzre deleted the ceb/nomic-vulkan branch January 29, 2024 20:50
@rajeevn1
Copy link

ggml_vk_graph_compute: error: unsupported op 'MUL_MAT'

I get this error using kompute backend, on Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] (rev 02)

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Jan 29, 2024

ggml_vk_graph_compute: error: unsupported op 'MUL_MAT'

What model are you using? Maybe the fallback code got broken among the recent changes (edit: yes, it did) - it sounds like you are using an unsupported quantization type.

Also, Intel GPUs are not currently known to work, and I don't know how well integrated GPUs work in general at the moment.

@rajeevn1
Copy link

I am using mistral-7b-instruct-v0.2.Q5_K_M.gguf

@Titaniumtown
Copy link

ggml_vk_graph_compute: error: unsupported op 'MUL_MAT'

I get this error using kompute backend, on Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] (rev 02)

I can reproduce on Alder Lake (Iris Xe)

@rajeevn1
Copy link

I tried with different quantization mistral-7b-instruct-v0.2.Q4_K_M.gguf but get the same error

ggml_vk_graph_compute: error: unsupported op 'MUL_MAT'

@cebtenzzre
Copy link
Collaborator Author

I tried with different quantization mistral-7b-instruct-v0.2.Q4_K_M.gguf but get the same error

This backend currently only supports Q4_0, Q4_1, and F16 quantizations. The latest master of llama.cpp will at least fall back to CPU in this case instead of failing.

@sorasoras
Copy link

I tried with different quantization mistral-7b-instruct-v0.2.Q4_K_M.gguf but get the same error

This backend currently only supports Q4_0, Q4_1, and F16 quantizations. The latest master of llama.cpp will at least fall back to CPU in this case instead of failing.

It would be nice if it can fallback to the other vulkan backend.
Just use two vulkan backend together lol.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
@niansa
Copy link
Contributor

niansa commented Feb 12, 2024

I am really happy the Kompute implementation finally made it into mainline llama.cpp!

@akingoverlook
Copy link

akingoverlook commented Feb 25, 2024

This is Nomic's Kompute-based Vulkan backend from the GPT4All project, now available under the MIT license. It can be enabled by building with cmake and passing -DLLAMA_KOMPUTE=ON (make is currently not supported).

Structure

Nomic appears to claim support for Qualcomm GPUs by "Nomic Vulkan":
"September 18th, 2023: Nomic Vulkan launches supporting local LLM inference on AMD, Intel, Samsung, Qualcomm and NVIDIA GPUs."

Perhaps that is different from "Nomic's Kompute based Vulkan backend", that is not exactly clear. But what is clear is that this backend can't run on Qualcomm GPUs at all because it wants uniformAndStorageBuffer (8 and 16bit) access, which their Vulkan driver does not show as supported (on any of their GPUs).

FYI, it also can't run on ARM GPUs, because they have maximum subgroup size of 16 (even in the top of the line Immortalis chips), and this backend wants 32.

Any chance these dependencies could be worked around, @cebtenzzre ?

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Very important issue need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.