-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port to Google Tensor G2/G3 on Pixel Phones #829
Comments
Yes |
Any progress on this? Would be really cool to take advantage of this feature! |
Re-upping. |
I don't think there is any API that allows us to use the TPU - is that correct? |
I found this:
https://www.tensorflow.org/lite/android/delegates/gpu_native |
It looks like there is an API, but it will never support llama model arches. |
for my curiosity using Vulkan backend is not enough ? |
Vulkan probably isn't that efficient in terms of performance per Watt at int8 compared to the G2 and G3 TPUs: https://ai-benchmark.com/ranking_processors Benchmark looks to be using MobileBERT, so not necessarily applicable to llamas. |
Not an expert here, but why it is not applicable to llama models if TF lite supports the backend? |
Here's hoping UXL provides a standardised mobile API across all Android phones: |
Also, TF lite supports conversion from TF models https://www.tensorflow.org/lite/api_docs/python/tf/lite/TFLiteConverter |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I built Vulkan for a Pixel 7 pro, and it just works so much slower than a CPU build. using ngl=1 halves the speed from CPU inference and using ngl=2 onwards keeps halving it. (Measuring tokens per second in the llama-cli output) I don't think performance per watt is the issue as mentioned here
I was going to do a study to see if we can get away with using a more power consuming inference API. But I think the MALI GPU that the Pixel 7 Pro has is not efficient enough at computing, and is probably struggling with such a large model (my theory is that there is actually a lot of data transfer happening at each inference layer for GPU access (I still am not familiar how the shared memory works as compared to dedicated VRAM) and so it just slows down. I am not sure how much the TPUs would impact performance - but it seems like we would first need to get the XLA build unless these TPU processors are completely different from the high end ones on kaggle
Originally posted by @ymwangg in #3253 I am going to eventually check out if I can get this running on a pytorch build in the pixel 7 pro and see if that offers a speedup |
It would be nice to use the TPU in those SoCs to improve speed
The text was updated successfully, but these errors were encountered: