Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Create some tutorials on how to use py-videocore6 #43

Open
filiphhh opened this issue Sep 23, 2020 · 3 comments
Open

[Suggestion] Create some tutorials on how to use py-videocore6 #43

filiphhh opened this issue Sep 23, 2020 · 3 comments

Comments

@filiphhh
Copy link

Hi!

I have been following py-videocore6 since you first made it public and py-videocore before that.
I'd like to thank you for simplifying QPU-programming on the Raspberry Pi's.
I have been trying to better learn how one would go about and better utilize the QPUs but I think it's difficult to find resources for it. I would love to see some tutorials on how to write applications with py-videocore6 and on how to parallellize the programs to fully utilize videocores potential.

Thank you @Terminus-IMRC, @notogawa and Idein for all your hard work!

@Terminus-IMRC
Copy link
Contributor

Thank you for supporting us!
We also think we need some tutorials for beginners, but currently, no tutorials exist except for the Japanese one: https://qiita.com/9_ties/items/15ab7fa198991a61a3a9

Because the instruction set of VC6 QPU is very similar to the one of VC4 QPU, you can learn how QPU basically works (add/mul ALU dual-issue, three branch delay slots, TMU unit, etc.) from the VideoCore IV 3D Architecture Reference Guide.

Though there is no publicly available documentation for VC6 QPU, you can gain an understanding of it from working examples.
I added some simple example programs to this repository, which may help you when you write VC6 QPU codes:

  • examples/summation.py adds up 32-bit integers in an array.
    With this QPU code, you can learn how to fill the eight-depth TMU read request queue to hide cache/memory latency.
  • examples/memset.py sets a single 32-bit value to an array.
    You will find out that there is no request queue for TMU write.
  • examples/scopy.py copies an array, which combines the above two facts.

These codes support multiple-QPU execution up to eight, where you can see how to assign input/output memory area to each QPU.

Also, notogawa added matrix-matrix multiplication code examples/sgemm.py.
The innermost loop of this code utilizes in-QPU vector rotation to reduce the number of memory loads/stores.

In conclusion, I recommend you to start writing a primitive program (simple memory read/write or array addition/subtraction/multiplication) by referring to the examples.
Then you may consider how to achieve the theoretical 32 [Gflop/s] peak performance (by utilizing the register files and TMU/L2 caches).

@Terminus-IMRC
Copy link
Contributor

We've just released other VC6 QPU examples: https://github.com/Idein/qmkl6

@filiphhh
Copy link
Author

Thank you for your quick replies!
I would never have found the Japanese tutorial if you didn't share the link to it, Google Translate seems to do a pretty good job at translating it.
Great to see QMKL for rpi4 too!

I hope that your libraries will get some more traction in the community as it unlocks a lot more power in these little devices.
Thank you for all the low level rpi resources you have produced!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants