Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Program build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <sum_global_atomic> was successfully vectorized (8)
Kernel <sum_cycle> was not vectorized
Kernel <sum_cycle_coalesce> was successfully vectorized (8)
Kernel <sum_local_mem> was successfully vectorized (8)
Kernel <sum_tree> was successfully vectorized (8)
Done.
Runtime:
Бейзлайн. Все операции суммирования стоят в очереди и драйвер никак не захотет это оптимизировать
GPU sum_global_atomic: 2.36046+-0.0251082 s
GPU sum_global_atomic: 42.3647 millions/s
Разбили работу на несколько групп и в разы сократили количество атомарных сложений. Получили существенный буст. Однако судя из билд лога у компиятора не получилось векторизовать эту операцию
GPU sum_cycle: 0.0818463+-0.001403 s
GPU sum_cycle: 1221.8 millions/s
Здесь уже получилось векторизовать кернел, получили еще буст.
GPU sum_cycle_coalesce: 0.0582893+-0.0026487 s
GPU sum_cycle_coalesce: 1715.58 millions/s
Пользуемся локальным кэшем. Удивлен что дало буст учитывая что запускаюсь на цпу. Видимо VRAM мапится в RAM а кэш гпу в кэш процесора.
GPU sum_local_mem: 0.0461117+-0.00118241 s
GPU sum_local_mem: 2168.65 millions/s
Дерево на ЦПУ дает не особо хороший буст по сравнению с бейзлайном.
GPU sum_tree: 0.162337+-0.00387873 s
GPU sum_tree: 616.001 millions/s