Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to Google Tensor G2/G3 on Pixel Phones #829

Closed
teaalltr opened this issue Apr 7, 2023 · 13 comments
Closed

Port to Google Tensor G2/G3 on Pixel Phones #829

teaalltr opened this issue Apr 7, 2023 · 13 comments
Labels

Comments

@teaalltr
Copy link

teaalltr commented Apr 7, 2023

It would be nice to use the TPU in those SoCs to improve speed

@Sovenok-Hacker
Copy link

It would be nice to use the TPU in those SoCs to improve speed

Yes

@Titaniumtown
Copy link

Any progress on this? Would be really cool to take advantage of this feature!

@github-actions github-actions bot added the stale label Mar 25, 2024
@gjmulder
Copy link
Collaborator

gjmulder commented Apr 3, 2024

Re-upping.

@ggerganov
Copy link
Member

I don't think there is any API that allows us to use the TPU - is that correct?

@gjmulder
Copy link
Collaborator

gjmulder commented Apr 3, 2024

I don't think there is any API that allows us to use the TPU - is that correct?

I found this:

Using graphics processing units (GPUs) to run your machine learning (ML) models can dramatically improve the performance and the user experience of your ML-enabled applications. On Android devices, you can enable GPU-accelerated execution of your models using a delegate and one of the following APIs:

Interpreter API - guide
Task library API - guide
Native (C/C++) API - this guide
This guide covers advanced uses of the GPU delegate for the C API, C++ API, and use of quantized models. For more information about using the GPU delegate for TensorFlow Lite, including best practices and advanced techniques, see the GPU delegates page.

https://www.tensorflow.org/lite/android/delegates/gpu_native

@gjmulder
Copy link
Collaborator

gjmulder commented Apr 3, 2024

It looks like there is an API, but it will never support llama model arches.

@gjmulder gjmulder closed this as completed Apr 3, 2024
@gjmulder gjmulder closed this as not planned Won't fix, can't repro, duplicate, stale Apr 3, 2024
@phymbert
Copy link
Collaborator

phymbert commented Apr 3, 2024

for my curiosity using Vulkan backend is not enough ?

@gjmulder
Copy link
Collaborator

gjmulder commented Apr 3, 2024

Vulkan probably isn't that efficient in terms of performance per Watt at int8 compared to the G2 and G3 TPUs:

https://ai-benchmark.com/ranking_processors

Benchmark looks to be using MobileBERT, so not necessarily applicable to llamas.

@teaalltr
Copy link
Author

teaalltr commented Apr 4, 2024

Not an expert here, but why it is not applicable to llama models if TF lite supports the backend?

@gjmulder gjmulder reopened this Apr 4, 2024
@gjmulder
Copy link
Collaborator

gjmulder commented Apr 4, 2024

Here's hoping UXL provides a standardised mobile API across all Android phones:

https://www.oneapi.io/spec/

@gjmulder gjmulder changed the title Port to Google Tensor/T2 on Pixel phones Port to Google Tensor G2/G3 on Pixel Phones Apr 4, 2024
@teaalltr
Copy link
Author

teaalltr commented Apr 4, 2024

Not an expert here, but why it is not applicable to llama models if TF lite supports the backend?

Also, TF lite supports conversion from TF models https://www.tensorflow.org/lite/api_docs/python/tf/lite/TFLiteConverter
Dunno of any other technical limitation, please correct me if I'm wrong. Llama in TF exists, to the best of my knowledge

@github-actions github-actions bot removed the stale label Apr 5, 2024
@github-actions github-actions bot added the stale label May 6, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@Ananay-22
Copy link

Ananay-22 commented Jan 27, 2025

I built Vulkan for a Pixel 7 pro, and it just works so much slower than a CPU build. using ngl=1 halves the speed from CPU inference and using ngl=2 onwards keeps halving it. (Measuring tokens per second in the llama-cli output)

I don't think performance per watt is the issue as mentioned here

Vulkan probably isn't that efficient in terms of performance per Watt at int8 compared to the G2 and G3 TPUs:

https://ai-benchmark.com/ranking_processors

Benchmark looks to be using MobileBERT, so not necessarily applicable to llamas.

I was going to do a study to see if we can get away with using a more power consuming inference API.

But I think the MALI GPU that the Pixel 7 Pro has is not efficient enough at computing, and is probably struggling with such a large model (my theory is that there is actually a lot of data transfer happening at each inference layer for GPU access (I still am not familiar how the shared memory works as compared to dedicated VRAM) and so it just slows down.

I am not sure how much the TPUs would impact performance - but it seems like we would first need to get the XLA build unless these TPU processors are completely different from the high end ones on kaggle

It's possible, but I guess it requires substantial amount of engineering efforts. I've seen some Google folks make llama work through pytorch/xla https://pytorch.org/blog/path-achieve-low-inference-latency/.

Originally posted by @ymwangg in #3253

I am going to eventually check out if I can get this running on a pytorch build in the pixel 7 pro and see if that offers a speedup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants