Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama compatibility with new ggml quantization #642

Merged
merged 8 commits into from
May 21, 2023

Conversation

niansa
Copy link
Contributor

@niansa niansa commented May 19, 2023

Describe your changes

This change introduces full compatibility with new ggml quanitzation without killing the old one (which is renamed to {llama,ggml}-old).
The API is be kept unchanged and these changes completely invisible to it.

Issue ticket number and link

Every single one that complains about new llama models not working :-)

@niansa niansa marked this pull request as ready for review May 19, 2023 16:33
@niansa niansa changed the title Add compatibility with new ggml quantization Add llama compatibility with new ggml quantization May 19, 2023
@niansa niansa requested a review from manyoso May 19, 2023 17:50
@niansa
Copy link
Contributor Author

niansa commented May 19, 2023

Closes #541 once dlopen_backend is merged into main

@kuvaus
Copy link
Contributor

kuvaus commented May 19, 2023

Closes #541 once dlopen_backend is merged into main

Nice! Thank you! :)

I looked at the code and looks like the temperature sampling is the same. You added tfs_z typical_p but hardcoded them to 1.0 = disabled. This is probably better because at some point it would be worth to add typical_p, tfs_z and other parameters to the prompt_context. But lets just get this running first

@niansa
Copy link
Contributor Author

niansa commented May 19, 2023

Closes #541 once dlopen_backend is merged into main

Nice! Thank you! :)

I looked at the code and looks like the temperature sampling is the same. You added tfs_z typical_p but hardcoded them to 1.0 = disabled. This is probably better because at some point it would be worth to add typical_p, tfs_z and other parameters to the prompt_context. But lets just get this running first

That's why I put them into the gpt_params for now! It's something for a later PR to actually expose it.

@niansa
Copy link
Contributor Author

niansa commented May 19, 2023

Looks like ggerganov made another breaking change.

@niansa
Copy link
Contributor Author

niansa commented May 19, 2023

Thanks to @imaami for helping with the commit history :-)

@imaami
Copy link
Contributor

imaami commented May 19, 2023

Closes #541 once dlopen_backend is merged into main

Nice! Thank you! :)

I looked at the code and looks like the temperature sampling is the same. You added tfs_z typical_p but hardcoded them to 1.0 = disabled. This is probably better because at some point it would be worth to add typical_p, tfs_z and other parameters to the prompt_context. But lets just get this running first

I did something like this, or at least I hope. If you're interested: imaami@fea4747

@niansa
Copy link
Contributor Author

niansa commented May 20, 2023

I did something like this, or at least I hope. If you're interested: imaami@fea4747

Please feel free to PR this once this whole thing is merged into master!

We should add as few unrelated things as possible to the dlopen_backend branch.

@kuvaus
Copy link
Contributor

kuvaus commented May 20, 2023

I did something like this, or at least I hope. If you're interested: imaami@fea4747

This is great! I like it. :)
@manyoso will likely want you to update the set(LLMODEL_VERSION_PATCH 2) in gpt4all-backend/CMakeLists.txt in imaami@fea4747 since your changes add to the llmodel_c.h interface and bindings developers now have access to extra features.

I'll close #541 as this #642 implementation is better.

@imaami
Copy link
Contributor

imaami commented May 20, 2023

I did something like this, or at least I hope. If you're interested: imaami@fea4747

Please feel free to PR this once this whole thing is merged into master!

We should add as few unrelated things as possible to the dlopen_backend branch.

Sure thing, let's do things in order.

gpt4all-backend/llamamodel.cpp Outdated Show resolved Hide resolved
Signed-off-by: niansa/tuxifan <tuxifan@posteo.de>
Copy link
Collaborator

@manyoso manyoso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments...

@@ -71,18 +73,32 @@ foreach(BUILD_VARIANT IN LISTS BUILD_VARIANTS)
PROPERTY INTERPROCEDURAL_OPTIMIZATION ${IPO_SUPPORTED})
endfunction()

# Add each individual implementation
add_library(llamamodel-${BUILD_VARIANT} SHARED
# Add each individual implementations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, you don't want the plural here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that as well, but decided to leave it as is since it's not worth a commit. Will batch this with further things that may come up.

url = https://github.com/manyoso/llama.cpp.git
[submodule "llama.cpp-mainline"]
path = gpt4all-backend/llama.cpp-mainline
url = https://github.com/ggerganov/llama.cpp.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ok, i get ya, but this isn't actually pinning them. Also, I think I still want all of them to use the 'manyoso' fork as this gives us further control,right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, the manyoso fork hasn't been updated to latest llama.cpp, it's 132 commits behind...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that fork only adds alibi, which is only needed for MPT

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we should update that fork, and point to it I believe. lemme do that now.

llamamodel.cpp)
prepare_target(llamamodel llama)
target_compile_definitions(llamamodel-mainline-${BUILD_VARIANT} PRIVATE
LLAMA_VERSIONS=>=3 LLAMA_DATE=999999)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=>= oh man cmake.. you're kiling me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yup. Looks confusing, is confusing, but does what we need quite flexibly.

Copy link
Contributor

@imaami imaami May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That conditional should probably be changed to a slightly less cursed variant:

#if LLAMA_VERSION <= 123456
// ...
#elif LLAMA_VERSION >= 654321
// ...
#endif

At least then it would be a readily recognizable pattern of tragic stylistic compromise instead of a confusing entirely new way to crush one's hopes and dreams. Would also shrink the cmake side a little.

Pardon the gallows humour, can't help it whenever pre-processor macros seem necessary. ;)


add_library(gptj-${BUILD_VARIANT} SHARED
gptj.cpp)
prepare_target(gptj ggml)
prepare_target(gptj ggml-230511)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, where are you tagging the actual ggml with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama.cpp.cmake adds the given suffix to ggml as well.

int32_t n_keep = 0; // number of tokens to keep from initial prompt
#if LLAMA_DATE <= 230511
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crux of it. We're going to use macros...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our other option would be to have an extensive collection of almost-identical llamamodel.cpp files for different llama.cpp versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this is the right choice of a bunch of bad choices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also CRTP and C++ template magic, but I agree it's not the time to go there yet.

add_library(llama${SUFFIX}
${DIRECTORY}/llama.cpp
${DIRECTORY}/llama.h
${DIRECTORY}/llama_util.h)
${DIRECTORY}/${LLAMA_UTIL_SOURCE_FILE})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch doesn't actually introduce this file, right? It exists upstream in one of the pinned submodules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename was changed.

llama_sample_top_p(ctx, &candidates_p, top_p, 1);
llama_sample_temperature(ctx, &candidates_p, temp);
return llama_sample_token(ctx, &candidates_p);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to assume this is giving you sane results? Have you made sure to go through and test models with each of the pinned variants and file formats? Man, we almost want regression or unit tests here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! I did. Man was my harddrive full..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also how it's done in the llama.cpp main example.

gpt4all-backend/llamamodel.cpp Show resolved Hide resolved
@manyoso manyoso merged commit 1e037f8 into nomic-ai:dlopen_backend May 21, 2023
@redthing1
Copy link
Contributor

Thank god someone is handling this migration the sane way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants