Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task04 Данил Конев SPbU #139

Open
wants to merge 3 commits into
base: task04
Choose a base branch
from

Conversation

kondaevnil
Copy link

@kondaevnil kondaevnil commented Oct 3, 2024

Локальный вывод

$ ./matrix_transpose
OpenCL devices:
  Device #0: GPU. Apple M2. Total memory: 5461 Mb
Using device #0: GPU. Apple M2. Total memory: 5461 Mb
Data generated for M=4096, K=4096
[matrix_transpose_naive]
    GPU: 0.00183733+-3.80516e-05 s
    GPU: 9131.29 millions/s
[matrix_transpose_local_bad_banks]
    GPU: 0.00182022+-2.45853e-05 s
    GPU: 9217.15 millions/s
[matrix_transpose_local_good_banks]
    GPU: 0.00180377+-2.44093e-05 s
    GPU: 9301.21 millions/s

$ ./matrix_multiplication
OpenCL devices:
  Device #0: GPU. Apple M2. Total memory: 5461 Mb
Using device #0: GPU. Apple M2. Total memory: 5461 Mb
Data generated for M=1024, K=1024, N=1024
CPU: 3.79174+-1.41629e-08 s
CPU: 0.527462 GFlops
[naive, ts=4]
    GPU: 0.0122115+-0.000476769 s
    GPU: 163.78 GFlops
    Average difference: 0%
[naive, ts=8]
    GPU: 0.0161957+-0.000438779 s
    GPU: 123.49 GFlops
    Average difference: 0%
[naive, ts=16]
    GPU: 0.0118767+-0.00039979 s
    GPU: 168.397 GFlops
    Average difference: 0%
[local, ts=4]
    GPU: 0.016141+-0.000714008 s
    GPU: 123.908 GFlops
    Average difference: 0%
[local, ts=8]
    GPU: 0.00592617+-0.000109758 s
    GPU: 337.486 GFlops
    Average difference: 0%
[local, ts=16]
    GPU: 0.00522333+-0.000135656 s
    GPU: 382.897 GFlops
    Average difference: 0%
[local wpt, ts=4, wpt=2]
    GPU: 0.0223027+-0.000515098 s
    GPU: 89.6754 GFlops
    Average difference: 0%
[local wpt, ts=4, wpt=4]
    GPU: 0.0370613+-0.000280732 s
    GPU: 53.9646 GFlops
    Average difference: 0%
[local wpt, ts=8, wpt=2]
    GPU: 0.00453133+-2.25807e-05 s
    GPU: 441.371 GFlops
    Average difference: 0%
[local wpt, ts=8, wpt=4]
    GPU: 0.00747317+-2.97681e-05 s
    GPU: 267.624 GFlops
    Average difference: 0%
[local wpt, ts=8, wpt=8]
    GPU: 0.0173278+-0.000459474 s
    GPU: 115.421 GFlops
    Average difference: 0%
[local wpt, ts=16, wpt=2]
    GPU: 0.00449667+-3.01367e-05 s
    GPU: 444.774 GFlops
    Average difference: 0%
[local wpt, ts=16, wpt=4]
    GPU: 0.00366533+-2.25955e-05 s
    GPU: 545.653 GFlops
    Average difference: 0%
[local wpt, ts=16, wpt=8]
    GPU: 0.0040725+-2.22017e-05 s
    GPU: 491.099 GFlops
    Average difference: 0%
[local wpt, ts=16, wpt=16]
    GPU: 0.0114167+-9.26942e-05 s
    GPU: 175.182 GFlops
    Average difference: 0%

Вывод Github CI

$ ./matrix_transpose
OpenCL devices:
  Device #0: CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Using device #0: CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Data generated for M=4096, K=4096
[matrix_transpose_naive]
    GPU: 0.0161321+-0.000144695 s
    GPU: 1039.99 millions/s
[matrix_transpose_local_bad_banks]
    GPU: 0.0133337+-5.87615e-05 s
    GPU: 1258.26 millions/s
[matrix_transpose_local_good_banks]
    GPU: 0.0142759+-5.10294e-05 s
    GPU: 1175.21 millions/s

$ ./matrix_multiplication
OpenCL devices:
  Device #0: CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Using device #0: CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15991 Mb
Data generated for M=1024, K=1024, N=1024
CPU: 6.08544+-0 s
CPU: 0.328653 GFlops
[naive, ts=4]
    GPU: 0.278694+-0.00796799 s
    GPU: 7.17632 GFlops
    Average difference: 0.000149043%
[naive, ts=8]
    GPU: 0.271026+-0.00315905 s
    GPU: 7.37937 GFlops
    Average difference: 0.000149043%
[naive, ts=16]
    GPU: 0.31961+-0.0282449 s
    GPU: 6.25762 GFlops
    Average difference: 0.000149043%
[local, ts=4]
    GPU: 0.594748+-0.000545432 s
    GPU: 3.36277 GFlops
    Average difference: 0.000149043%
[local, ts=8]
    GPU: 0.109308+-0.000198738 s
    GPU: 18.2969 GFlops
    Average difference: 0.000149043%
[local, ts=16]
    GPU: 0.0699212+-0.000325857 s
    GPU: 28.6036 GFlops
    Average difference: 0.000149043%
[local wpt, ts=4, wpt=2]
    GPU: 0.539262+-0.00272789 s
    GPU: 3.70877 GFlops
    Average difference: 0.000149043%
[local wpt, ts=4, wpt=4]
    GPU: 0.475112+-0.00168706 s
    GPU: 4.20954 GFlops
    Average difference: 0.000149043%
[local wpt, ts=8, wpt=2]
    GPU: 0.135911+-0.000502784 s
    GPU: 14.7155 GFlops
    Average difference: 0.000149043%
[local wpt, ts=8, wpt=4]
    GPU: 0.151507+-0.000329341 s
    GPU: 13.2007 GFlops
    Average difference: 0.000149043%
[local wpt, ts=8, wpt=8]
    GPU: 0.143789+-0.000426475 s
    GPU: 13.9093 GFlops
    Average difference: 0.000149043%
[local wpt, ts=16, wpt=2]
    GPU: 0.0847648+-0.000240268 s
    GPU: 23.5947 GFlops
    Average difference: 0.000149043%
[local wpt, ts=16, wpt=4]
    GPU: 0.0771217+-0.000196725 s
    GPU: 25.933 GFlops
    Average difference: 0.000149043%
[local wpt, ts=16, wpt=8]
    GPU: 0.0788197+-0.000917917 s
    GPU: 25.3744 GFlops
    Average difference: 0.000149043%
[local wpt, ts=16, wpt=16]
    GPU: 0.0921885+-0.000260632 s
    GPU: 21.6947 GFlops
    Average difference: 0.000149043%

@kondaevnil
Copy link
Author

  1. Транспонирование:
    • Сильных различий в производительности между реализациями нет на GPU M2 нет, однако на nvidia 1650 реализация с локальной памятью и хорошим доступом на 30% быстрее наивной.
  2. Умножение:
    • Наилучшим образом себя показала реализация [local wpt, ts=16, wpt=4] прирост примерно 300% в сравнении с наивной; при этом наихудшей так же была реализация, основанная на wpt: [local wpt, ts=4, wpt=4]. На второй GPU значения +-схожи, но оптимальной является [local wpt, ts=16, wpt=8]. Можем сделать вывод, что локальная память существенно увеличивает производительность при большом количестве обращений в глобальную.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant