Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to use quantized ggml_mul_mat in attention layer #1098

Closed
ggerganov opened this issue Apr 21, 2023 · 1 comment · May be fixed by #1103
Closed

Try to use quantized ggml_mul_mat in attention layer #1098

ggerganov opened this issue Apr 21, 2023 · 1 comment · May be fixed by #1103
Assignees
Labels
good first issue Good for newcomers performance Speed related topics

Comments

@ggerganov
Copy link
Member

The following 2 matrix multiplication calls sill remain in FP16 precission:

Was wondering, if we quantize those on-the-fly would there be any benefit.
The quantization can be done with an extra ggml_cpy() call, before the ggml_mul_mat() call.

See if this speeds up the computation and how it affects perplexity

@ggerganov
Copy link
Member Author

Performance stats on 7B indicate that the F16 matrix multiplication account for just 2% of the total processing time.
The PoC in #1103 does not show any significant performance gains, so closing this for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers performance Speed related topics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant