Port to Google Tensor G2/G3 on Pixel Phones #829

teaalltr · 2023-04-07T10:57:33Z

It would be nice to use the TPU in those SoCs to improve speed

Sovenok-Hacker · 2023-05-20T15:31:20Z

It would be nice to use the TPU in those SoCs to improve speed

Yes

Titaniumtown · 2023-07-23T22:10:37Z

Any progress on this? Would be really cool to take advantage of this feature!

gjmulder · 2024-04-03T12:43:47Z

Re-upping.

ggerganov · 2024-04-03T12:53:15Z

I don't think there is any API that allows us to use the TPU - is that correct?

gjmulder · 2024-04-03T13:22:31Z

I don't think there is any API that allows us to use the TPU - is that correct?

I found this:

Using graphics processing units (GPUs) to run your machine learning (ML) models can dramatically improve the performance and the user experience of your ML-enabled applications. On Android devices, you can enable GPU-accelerated execution of your models using a delegate and one of the following APIs:

Interpreter API - guide
Task library API - guide
Native (C/C++) API - this guide
This guide covers advanced uses of the GPU delegate for the C API, C++ API, and use of quantized models. For more information about using the GPU delegate for TensorFlow Lite, including best practices and advanced techniques, see the GPU delegates page.

https://www.tensorflow.org/lite/android/delegates/gpu_native

gjmulder · 2024-04-03T13:32:03Z

It looks like there is an API, but it will never support llama model arches.

phymbert · 2024-04-03T14:40:08Z

for my curiosity using Vulkan backend is not enough ?

gjmulder · 2024-04-03T17:55:53Z

Vulkan probably isn't that efficient in terms of performance per Watt at int8 compared to the G2 and G3 TPUs:

https://ai-benchmark.com/ranking_processors

Benchmark looks to be using MobileBERT, so not necessarily applicable to llamas.

teaalltr · 2024-04-04T07:28:28Z

Not an expert here, but why it is not applicable to llama models if TF lite supports the backend?

gjmulder · 2024-04-04T08:30:33Z

Here's hoping UXL provides a standardised mobile API across all Android phones:

https://www.oneapi.io/spec/

teaalltr · 2024-04-04T13:46:45Z

Not an expert here, but why it is not applicable to llama models if TF lite supports the backend?

Also, TF lite supports conversion from TF models https://www.tensorflow.org/lite/api_docs/python/tf/lite/TFLiteConverter
Dunno of any other technical limitation, please correct me if I'm wrong. Llama in TF exists, to the best of my knowledge

github-actions · 2024-05-20T01:09:08Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Ananay-22 · 2025-01-27T15:51:02Z

I built Vulkan for a Pixel 7 pro, and it just works so much slower than a CPU build. using ngl=1 halves the speed from CPU inference and using ngl=2 onwards keeps halving it. (Measuring tokens per second in the llama-cli output)

I don't think performance per watt is the issue as mentioned here

Vulkan probably isn't that efficient in terms of performance per Watt at int8 compared to the G2 and G3 TPUs:

https://ai-benchmark.com/ranking_processors

Benchmark looks to be using MobileBERT, so not necessarily applicable to llamas.

I was going to do a study to see if we can get away with using a more power consuming inference API.

But I think the MALI GPU that the Pixel 7 Pro has is not efficient enough at computing, and is probably struggling with such a large model (my theory is that there is actually a lot of data transfer happening at each inference layer for GPU access (I still am not familiar how the shared memory works as compared to dedicated VRAM) and so it just slows down.

I am not sure how much the TPUs would impact performance - but it seems like we would first need to get the XLA build unless these TPU processors are completely different from the high end ones on kaggle

It's possible, but I guess it requires substantial amount of engineering efforts. I've seen some Google folks make llama work through pytorch/xla https://pytorch.org/blog/path-achieve-low-inference-latency/.

Originally posted by @ymwangg in #3253

I am going to eventually check out if I can get this running on a pytorch build in the pixel 7 pro and see if that offers a speedup

github-actions bot added the stale label Mar 25, 2024

gjmulder closed this as completed Apr 3, 2024

gjmulder closed this as not planned Apr 3, 2024

gjmulder reopened this Apr 4, 2024

gjmulder changed the title ~~Port to Google Tensor/T2 on Pixel phones~~ Port to Google Tensor G2/G3 on Pixel Phones Apr 4, 2024

github-actions bot removed the stale label Apr 5, 2024

github-actions bot added the stale label May 6, 2024

github-actions bot closed this as completed May 20, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port to Google Tensor G2/G3 on Pixel Phones #829

Port to Google Tensor G2/G3 on Pixel Phones #829

teaalltr commented Apr 7, 2023

Sovenok-Hacker commented May 20, 2023

Titaniumtown commented Jul 23, 2023

gjmulder commented Apr 3, 2024

ggerganov commented Apr 3, 2024

gjmulder commented Apr 3, 2024

gjmulder commented Apr 3, 2024

phymbert commented Apr 3, 2024

gjmulder commented Apr 3, 2024

teaalltr commented Apr 4, 2024

gjmulder commented Apr 4, 2024

teaalltr commented Apr 4, 2024

github-actions bot commented May 20, 2024

Ananay-22 commented Jan 27, 2025 •

edited

Loading

Port to Google Tensor G2/G3 on Pixel Phones #829

Port to Google Tensor G2/G3 on Pixel Phones #829

Comments

teaalltr commented Apr 7, 2023

Sovenok-Hacker commented May 20, 2023

Titaniumtown commented Jul 23, 2023

gjmulder commented Apr 3, 2024

ggerganov commented Apr 3, 2024

gjmulder commented Apr 3, 2024

gjmulder commented Apr 3, 2024

phymbert commented Apr 3, 2024

gjmulder commented Apr 3, 2024

teaalltr commented Apr 4, 2024

gjmulder commented Apr 4, 2024

teaalltr commented Apr 4, 2024

github-actions bot commented May 20, 2024

Ananay-22 commented Jan 27, 2025 • edited Loading

Ananay-22 commented Jan 27, 2025 •

edited

Loading