[Suggestion] Create some tutorials on how to use py-videocore6 #43

filiphhh · 2020-09-23T10:42:18Z

Hi!

I have been following py-videocore6 since you first made it public and py-videocore before that.
I'd like to thank you for simplifying QPU-programming on the Raspberry Pi's.
I have been trying to better learn how one would go about and better utilize the QPUs but I think it's difficult to find resources for it. I would love to see some tutorials on how to write applications with py-videocore6 and on how to parallellize the programs to fully utilize videocores potential.

Thank you @Terminus-IMRC, @notogawa and Idein for all your hard work!

Terminus-IMRC · 2020-09-23T13:05:02Z

Thank you for supporting us!
We also think we need some tutorials for beginners, but currently, no tutorials exist except for the Japanese one: https://qiita.com/9_ties/items/15ab7fa198991a61a3a9

Because the instruction set of VC6 QPU is very similar to the one of VC4 QPU, you can learn how QPU basically works (add/mul ALU dual-issue, three branch delay slots, TMU unit, etc.) from the VideoCore IV 3D Architecture Reference Guide.

Though there is no publicly available documentation for VC6 QPU, you can gain an understanding of it from working examples.
I added some simple example programs to this repository, which may help you when you write VC6 QPU codes:

examples/summation.py adds up 32-bit integers in an array.
With this QPU code, you can learn how to fill the eight-depth TMU read request queue to hide cache/memory latency.
examples/memset.py sets a single 32-bit value to an array.
You will find out that there is no request queue for TMU write.
examples/scopy.py copies an array, which combines the above two facts.

These codes support multiple-QPU execution up to eight, where you can see how to assign input/output memory area to each QPU.

Also, notogawa added matrix-matrix multiplication code examples/sgemm.py.
The innermost loop of this code utilizes in-QPU vector rotation to reduce the number of memory loads/stores.

In conclusion, I recommend you to start writing a primitive program (simple memory read/write or array addition/subtraction/multiplication) by referring to the examples.
Then you may consider how to achieve the theoretical 32 [Gflop/s] peak performance (by utilizing the register files and TMU/L2 caches).

Terminus-IMRC · 2020-09-24T09:03:57Z

We've just released other VC6 QPU examples: https://github.com/Idein/qmkl6

filiphhh · 2020-09-24T10:13:36Z

Thank you for your quick replies!
I would never have found the Japanese tutorial if you didn't share the link to it, Google Translate seems to do a pretty good job at translating it.
Great to see QMKL for rpi4 too!

I hope that your libraries will get some more traction in the community as it unlocks a lot more power in these little devices.
Thank you for all the low level rpi resources you have produced!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Create some tutorials on how to use py-videocore6 #43

[Suggestion] Create some tutorials on how to use py-videocore6 #43

filiphhh commented Sep 23, 2020

Terminus-IMRC commented Sep 23, 2020

Terminus-IMRC commented Sep 24, 2020

filiphhh commented Sep 24, 2020

[Suggestion] Create some tutorials on how to use py-videocore6 #43

[Suggestion] Create some tutorials on how to use py-videocore6 #43

Comments

filiphhh commented Sep 23, 2020

Terminus-IMRC commented Sep 23, 2020

Terminus-IMRC commented Sep 24, 2020

filiphhh commented Sep 24, 2020