Supporting dedicated backends (e.g edge TPU) with existing models for it #155

Utopiah · 2021-08-23T10:37:58Z

Utopiah
Aug 23, 2021

I'm getting some limited performance on limited hardware (e.g 1fps on a RPi3 for BlazePose) which is expected.

I start to see more and more boards dedicated to AI/ML with not just CPU or GPU but NPU. For example I'm considering the SOEdge by Pine64 https://wiki.pine64.org/wiki/SOEdge and noticed linked to specific models https://verisilicon.github.io/acuity-models/ for this architecture.

Assuming that it would result in radically better performance, e.g. 100x, could it be interesting to support this workflow?

vladmandic · 2021-08-23T12:18:13Z

vladmandic
Aug 23, 2021
Maintainer

I though about EdgeTPU for a while now, but I can't support HW that I don't have available for testing

Regarding SOEdge, it sounds easy enough given that their Actuity is built on top of TF to start with, so conversion should be straight forward. But looking at HW specs, I doubt performance improvement would be 100x, more likely 5-10x at best

Also when combining it with RPi, the issue can easily become one of I/O bottleneck (IMO, this is due to fact that RPi platform does not have a dedicated I/O controller, but instead uses CPU for data management) - RPi platform is pretty bad on that side, so how do you keep SOEdge saturated (e.g, if you need to pass off a tensor-equivalent of 640x480image at 25FPS to SOEdge, that is ~25MB/sec which is already enough to cause issues on RPi3 (ok on RPi4). And that is low resolution and far from high FPS

For that reason, I'm more interested in tightly coupled accelerators like nVidia Jetson Nano (although that one is pretty old nowadays and there hasn't been a refresh in a while)

But then there is a question which exact ML kernel ops are supported by the backend - e.g, depthwiseconv2d may be implemented in the backend for full HW acceleration, but depthwiseconv3d might not so either model conversion would fail due to unsupported op or it would fall back to CPU implementation and run slower than without ML accelerator

Another issue is that a lot of low-end accelerators are notoriously bad with FP32 precision, their best acceleration comes in INT32 land
Which is ok for some models that can be converted to INT32 without issues, but some do start having a clipping issue - value overflows which don't prevent model from running, but are causing lower quality outputs (e.g., handpose is such an example, it works great with in F32 backend, but starts having issues with F16 or INT32)
So its a hit-or-miss, some models might work fine after conversions, some may not

All-in-all, I'm in wait-and-see game when it comes to SOC solutions - waiting for them to mature a bit

0 replies

Utopiah · 2021-08-23T15:16:29Z

Utopiah
Aug 23, 2021
Author

Yes I'm in the same situation, namely wait&see, but I admit I'm growing increasingly curious. For prototyping I can use RPi3, RPi4, PinePhone or my desktop but I admit having a dedicated NPU in order to better understand the workflow could be nice. I keep on seeing new setups arriving, like the SOEdge, so I'm wondering where the opportunity could be. I imagine mostly for privacy sensitive (so no iOS/Android mobile setups) with real-time requirement and a small footprint. I was also considering the Jetson Nano so that would indeed be an easier start. I'll share some feedback then, hopefully I won't have to resort to shenanigans like https://enricopiccini.com/en/kb/HOW_TO_RUN_TensorflowJs_in_NodeJs_on_NVidia_Jetson_Nano_arm64_-658

0 replies

vladmandic · 2021-08-23T16:35:28Z

vladmandic
Aug 23, 2021
Maintainer

FYI, building TF from scratch using Bazel is not that complicated, but it's extremely CPU and memory intensive and will run into many issues unless you edit bazel configuration to work in your environment. And then it's going to take a while (on RPi, more than a day)

Btw, I used to use RPi4 as my home server, but it's slow I/O was a constant issue for me plus the constant need to rebuild stuff due to lack of ARM64 support. At the end, I switched to using x86 architecture

Which brings me to another topic, Human is build around idea of lightweight models so everything can be a) pre-bundled, b) run in near-real-time. But for other projects I'm using faaar heavier models, there is no chance they could run on Edge TPU

E.g., my nVidia GPU with 6GB VRAM does frequently run into OOM issues and there is no chance I can run with batch numbers higher than 1 or concurrent models (must serialize everything)

Which means that for my purposes, smallest Edge TPU that would fit my needs is nVidia Jetson TX2 - and it's just easier to stay with my desktop

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting dedicated backends (e.g edge TPU) with existing models for it #155

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Supporting dedicated backends (e.g edge TPU) with existing models for it #155

Utopiah Aug 23, 2021

Replies: 3 comments

vladmandic Aug 23, 2021 Maintainer

Utopiah Aug 23, 2021 Author

vladmandic Aug 23, 2021 Maintainer

Utopiah
Aug 23, 2021

vladmandic
Aug 23, 2021
Maintainer

Utopiah
Aug 23, 2021
Author

vladmandic
Aug 23, 2021
Maintainer