-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example: generic tensor network #22
base: main
Are you sure you want to change the base?
Conversation
Welcome to Codecov 🎉Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment. Thanks for integrating Codecov - We've got you covered ☂️ |
The same benchmark on A800Before CuTropicalGEMMjulia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
4.007866 seconds (1.71 M allocations: 81.452 MiB)
julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
59.282821 seconds (11.29 M allocations: 427.296 MiB, 0.56% gc time, 0.17% compilation time) After CuTropicalGEMMjulia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
1.210406 seconds (1.29 M allocations: 61.970 MiB)
julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
40.313153 seconds (10.04 M allocations: 369.083 MiB, 0.59% gc time, 0.00% compilation time) It is slower than A4500... Profilejulia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 1.24 s, capturing 157509 events.
Host-side activity: calling CUDA APIs took 264.52 ms (21.37% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│ 13.68% │ 169.35 ms │ 8289 │ 20.43 µs │ 2.38 µs │ 142.61 ms │ cudaLaunchKernel │
│ 3.53% │ 43.64 ms │ 14646 │ 2.98 µs │ 1.43 µs │ 5.76 ms │ cuMemAllocFromPoolAsync │
│ 2.02% │ 25.05 ms │ 6967 │ 3.6 µs │ 2.62 µs │ 65.09 µs │ cuLaunchKernel │
│ 1.37% │ 16.97 ms │ 4694 │ 3.62 µs │ 1.43 µs │ 6.89 ms │ cuStreamSynchronize │
│ 0.96% │ 11.89 ms │ 4692 │ 2.53 µs │ 1.91 µs │ 20.03 µs │ cuMemcpyHtoDAsync │
│ 0.00% │ 26.7 µs │ 1 │ 26.7 µs │ 26.7 µs │ 26.7 µs │ cuMemcpyDtoHAsync │
│ 0.00% │ 1.19 µs │ 6 │ 198.68 ns │ 0.0 ns │ 476.84 ns │ cuDeviceGetCount │
│ 0.00% │ 238.42 ns │ 1 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverGetVersion │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘
Device-side activity: GPU was busy for 298.18 ms (24.09% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│ 20.84% │ 257.85 ms │ 8282 │ 31.13 µs │ 4.77 µs │ 167.64 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE ⋯
│ 0.66% │ 8.19 ms │ 3329 │ 2.46 µs │ 1.67 µs │ 3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.49% │ 6.07 ms │ 4692 │ 1.29 µs │ 953.67 ns │ 2.15 µs │ [copy pageable to device memory] ⋯
│ 0.42% │ 5.25 ms │ 11 │ 476.97 µs │ 2.86 µs │ 5.11 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.42% │ 5.24 ms │ 1575 │ 3.33 µs │ 2.38 µs │ 4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.42% │ 5.16 ms │ 33 │ 156.45 µs │ 2.15 µs │ 5.02 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.26% │ 3.25 ms │ 4 │ 812.23 µs │ 8.82 µs │ 1.7 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.19% │ 2.4 ms │ 1001 │ 2.39 µs │ 1.67 µs │ 4.05 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
...
julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
Profiler ran for 91.44 s, capturing 1077252 events.
Host-side activity: calling CUDA APIs took 3.42 s (3.74% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│ 3.91% │ 3.58 s │ 70375 │ 50.82 µs │ 1.43 µs │ 1.72 s │ cuStreamSynchronize │
│ 0.41% │ 372.21 ms │ 23459 │ 15.87 µs │ 12.64 µs │ 45.06 µs │ cuMemcpyDtoHAsync │
│ 0.32% │ 291.3 ms │ 63080 │ 4.62 µs │ 2.62 µs │ 7.47 ms │ cuLaunchKernel │
│ 0.29% │ 261.51 ms │ 25043 │ 10.44 µs │ 2.38 µs │ 159.61 ms │ cudaLaunchKernel │
│ 0.21% │ 189.58 ms │ 53455 │ 3.55 µs │ 1.43 µs │ 10.1 ms │ cuMemAllocFromPoolAsync │
│ 0.09% │ 81.43 ms │ 23457 │ 3.47 µs │ 2.15 µs │ 40.05 µs │ cuMemcpyHtoDAsync │
│ 0.00% │ 1.19 µs │ 4 │ 298.02 ns │ 238.42 ns │ 476.84 ns │ cuDeviceGetCount │
│ 0.00% │ 476.84 ns │ 2 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDeviceGet │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘
Device-side activity: GPU was busy for 4.22 s (4.62% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│ 3.95% │ 3.62 s │ 24932 │ 145.03 µs │ 4.77 µs │ 186.6 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE ⋯
│ 0.11% │ 100.07 ms │ 1 │ 100.07 ms │ 100.07 ms │ 100.07 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.08% │ 74.37 ms │ 9382 │ 7.93 µs │ 3.34 µs │ 11.95 ms │ _Z22partial_mapreduce_grid8identity9reduction ⋯
│ 0.08% │ 73.09 ms │ 2 │ 36.55 ms │ 36.53 ms │ 36.57 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.06% │ 54.96 ms │ 23459 │ 2.34 µs │ 1.43 µs │ 3.58 µs │ [copy device to pageable memory] ⋯
│ 0.04% │ 32.26 ms │ 9382 │ 3.44 µs │ 1.91 µs │ 1.88 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.03% │ 31.29 ms │ 23457 │ 1.33 µs │ 953.67 ns │ 4.53 µs │ [copy pageable to device memory] ⋯
│ 0.03% │ 24.49 ms │ 10001 │ 2.45 µs │ 1.67 µs │ 3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.03% │ 23.08 ms │ 9382 │ 2.46 µs │ 1.43 µs │ 973.46 µs │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic ⋯
│ 0.02% │ 22.43 ms │ 1 │ 22.43 ms │ 22.43 ms │ 22.43 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.02% │ 22.14 ms │ 9382 │ 2.36 µs │ 1.43 µs │ 397.92 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.02% │ 19.92 ms │ 2 │ 9.96 ms │ 11.44 µs │ 19.91 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.02% │ 16.53 ms │ 1 │ 16.53 ms │ 16.53 ms │ 16.53 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.02% │ 14.25 ms │ 39 │ 365.39 µs │ 2.15 µs │ 7.24 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.01% │ 10.38 ms │ 3152 │ 3.29 µs │ 2.15 µs │ 4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
... From the profile we can see the slow down is maybe related to the stream synchronization. |
In some tasks, the CuTropicalGEMM speed up the code as expected. However, in another set of tasks, the stream synchronization takes too muhc time. Maybe it is related to stream synchronizing mechanism and is related to #19 . |
That is strange, all stream synchronization should be done in C code but not in Julia. |
Profile the version without CuTropicalGEMMI profiled the version without CuTropicalGEMM. It also has a lot of time spent on neither host nor device. So wield. julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 4.15 s, capturing 198954 events.
Host-side activity: calling CUDA APIs took 3.1 s (74.69% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────
│ 72.43% │ 3.0 s │ 15256 │ 196.97 µs │ 2.38 µs │ 2.72 s │ cuLaunchK ⋯
│ 1.34% │ 55.78 ms │ 14646 │ 3.81 µs │ 2.86 µs │ 31.71 µs │ cuMemAllo ⋯
│ 0.28% │ 11.71 ms │ 4692 │ 2.5 µs │ 1.91 µs │ 16.45 µs │ cuMemcpyH ⋯
│ 0.24% │ 10.09 ms │ 4694 │ 2.15 µs │ 1.43 µs │ 4.53 µs │ cuStreamS ⋯
│ 0.00% │ 29.56 µs │ 1 │ 29.56 µs │ 29.56 µs │ 29.56 µs │ cuMemcpyD ⋯
│ 0.00% │ 715.26 ns │ 6 │ 119.21 ns │ 0.0 ns │ 476.84 ns │ cuDeviceG ⋯
│ 0.00% │ 238.42 ns │ 1 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverG ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────
1 column omitted
Device-side activity: GPU was busy for 3.07 s (74.12% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬──────────┬─────────────
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼──────────┼─────────────
│ 73.24% │ 3.04 s │ 8289 │ 366.53 µs │ 1.43 µs │ 2.74 s │ matmatmul_ ⋯
│ 0.16% │ 6.77 ms │ 3329 │ 2.03 µs │ 1.19 µs │ 2.86 µs │ _Z18permut ⋯
│ 0.15% │ 6.05 ms │ 4692 │ 1.29 µs │ 953.67 ns │ 2.15 µs │ [copy page ⋯
│ 0.12% │ 5.11 ms │ 33 │ 154.94 µs │ 1.67 µs │ 4.97 ms │ _Z18permut ⋯
│ 0.12% │ 5.07 ms │ 11 │ 460.88 µs │ 2.86 µs │ 4.94 ms │ _Z18permut ⋯
│ 0.10% │ 4.23 ms │ 1575 │ 2.68 µs │ 1.91 µs │ 3.81 µs │ _Z16broadc ⋯
│ 0.08% │ 3.25 ms │ 4 │ 811.46 µs │ 8.58 µs │ 1.7 ms │ _Z18permut ⋯
...
julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float64)
Profiler ran for 122.63 s, capturing 1202467 events.
Host-side activity: calling CUDA APIs took 16.83 s (13.73% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│ 13.83% │ 16.96 s │ 70375 │ 241.05 µs │ 476.84 ns │ 5.15 s │ cuStreamSynchronize │
│ 4.82% │ 5.91 s │ 88123 │ 67.06 µs │ 2.15 µs │ 5.09 s │ cuLaunchKernel │
│ 0.41% │ 502.22 ms │ 53455 │ 9.4 µs │ 2.86 µs │ 106.68 ms │ cuMemAllocFromPoolAsync │
│ 0.31% │ 375.17 ms │ 23459 │ 15.99 µs │ 12.4 µs │ 68.43 µs │ cuMemcpyDtoHAsync │
│ 0.07% │ 81.05 ms │ 23457 │ 3.46 µs │ 1.91 µs │ 40.77 µs │ cuMemcpyHtoDAsync │
│ 0.00% │ 1.67 µs │ 4 │ 417.23 ns │ 238.42 ns │ 715.26 ns │ cuDeviceGetCount │
│ 0.00% │ 238.42 ns │ 2 │ 119.21 ns │ 0.0 ns │ 238.42 ns │ cuDeviceGet │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘
Device-side activity: GPU was busy for 23.07 s (18.81% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│ 18.35% │ 22.5 s │ 25043 │ 898.56 µs │ 1.43 µs │ 5.1 s │ matmatmul_ ⋯
│ 0.07% │ 83.97 ms │ 1 │ 83.97 ms │ 83.97 ms │ 83.97 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.06% │ 74.99 ms │ 2 │ 37.49 ms │ 37.49 ms │ 37.5 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.06% │ 68.46 ms │ 9382 │ 7.3 µs │ 2.86 µs │ 10.2 ms │ _Z22partial_mapreduce_grid8identity9reduction ⋯
│ 0.04% │ 54.84 ms │ 23459 │ 2.34 µs │ 1.67 µs │ 5.01 µs │ [copy device to pageable memory] ⋯
│ 0.03% │ 31.13 ms │ 23457 │ 1.33 µs │ 715.26 ns │ 2.62 µs │ [copy pageable to device memory] ⋯
│ 0.02% │ 30.49 ms │ 9382 │ 3.25 µs │ 1.43 µs │ 2.95 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.02% │ 21.08 ms │ 1 │ 21.08 ms │ 21.08 ms │ 21.08 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.02% │ 19.82 ms │ 10001 │ 1.98 µs │ 1.19 µs │ 3.1 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.02% │ 19.18 ms │ 9382 │ 2.04 µs │ 1.19 µs │ 1.11 ms │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic ⋯
│ 0.01% │ 17.54 ms │ 9382 │ 1.87 µs │ 1.19 µs │ 322.58 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.01% │ 17.24 ms │ 1 │ 17.24 ms │ 17.24 ms │ 17.24 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│ 0.01% │ 16.85 ms │ 2 │ 8.42 ms │ 13.35 µs │ 16.83 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│ 0.01% │ 16.2 ms │ 39 │ 415.4 µs │ 2.62 µs │ 8.12 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
... |
A related issue: under-Peter/OMEinsum.jl#133 (comment) |
The performance on A800 seems to be better after I fixed the stream problem, two times faster that the previous benchmark on A4500. julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
0.558056 seconds (636.56 k allocations: 32.896 MiB)
julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
11.641459 seconds (9.31 M allocations: 351.891 MiB, 2.37% gc time, 0.01% compilation time) It is shown by the profiler that the time neither on CPU nor on GPU disappear. julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 661.94 ms, capturing 202287 events.
Host-side activity: calling CUDA APIs took 390.1 ms (58.93% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────────────┼─────────────────────────┤
│ 23.95% │ 158.53 ms │ 8289 │ 19.13 µs ± 1353.11 ( 3.1 ‥ 123196.84) │ cudaLaunchKernel │
│ 21.22% │ 140.46 ms │ 14646 │ 9.59 µs ± 329.82 ( 1.19 ‥ 27344.47) │ cuMemAllocFromPoolAsync │
│ 6.59% │ 43.62 ms │ 6967 │ 6.26 µs ± 123.84 ( 3.58 ‥ 10338.78) │ cuLaunchKernel │
│ 2.18% │ 14.4 ms │ 4692 │ 3.07 µs ± 1.29 ( 2.38 ‥ 88.69) │ cuMemcpyHtoDAsync │
│ 1.13% │ 7.45 ms │ 4694 │ 1.59 µs ± 59.94 ( 0.48 ‥ 4107.48) │ cuStreamSynchronize │
│ 0.01% │ 35.29 µs │ 1 │ │ cuMemcpyDtoHAsync │
│ 0.00% │ 238.42 ns │ 6 │ 39.74 ns ± 97.33 ( 0.0 ‥ 238.42) │ cuDeviceGetCount │
│ 0.00% │ 238.42 ns │ 1 │ │ cuDriverGetVersion │
└──────────┴────────────┴───────┴──────────────────────────────────────────┴─────────────────────────┘
Device-side activity: GPU was busy for 281.15 ms (42.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 37.33% │ 247.08 ms │ 8282 │ 29.83 µs ± 1869.26 ( 4.29 ‥ 169630.77) │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii ⋯
│ 0.93% │ 6.13 ms │ 33 │ 185.66 µs ± 1045.81 ( 1.67 ‥ 6011.25) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.89% │ 5.87 ms │ 4692 │ 1.25 µs ± 0.17 ( 0.95 ‥ 3.1) │ [copy pageable to device memory] ⋯
│ 0.86% │ 5.7 ms │ 3329 │ 1.71 µs ± 0.23 ( 1.19 ‥ 2.62) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_ES3_IS4_S4_ ⋯
│ 0.67% │ 4.46 ms │ 11 │ 405.9 µs ± 1305.73 ( 2.38 ‥ 4342.56) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.55% │ 3.65 ms │ 1575 │ 2.32 µs ± 0.21 ( 1.67 ‥ 3.1) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In ⋯
│ 0.43% │ 2.85 ms │ 4 │ 713.65 µs ± 618.86 ( 10.25 ‥ 1378.3) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.25% │ 1.67 ms │ 1001 │ 1.67 µs ± 0.21 ( 1.19 ‥ 3.1) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_ES3_IS4_ ⋯
│ 0.23% │ 1.54 ms │ 122 │ 12.6 µs ± 107.18 ( 1.43 ‥ 1185.18) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.11% │ 758.65 µs │ 435 │ 1.74 µs ± 0.45 ( 1.43 ‥ 6.44) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_ES3_I ⋯
│ 0.06% │ 429.39 µs │ 230 │ 1.87 µs ± 0.63 ( 1.43 ‥ 6.91) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_ES ⋯
│ 0.05% │ 312.33 µs │ 75 │ 4.16 µs ± 5.81 ( 1.43 ‥ 27.89) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.03% │ 212.43 µs │ 89 │ 2.39 µs ± 0.4 ( 1.91 ‥ 5.01) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In ⋯
│ 0.02% │ 153.54 µs │ 11 │ 13.96 µs ± 21.05 ( 2.38 ‥ 70.57) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.01% │ 82.25 µs │ 7 │ 11.75 µs ± 12.38 ( 2.86 ‥ 36.24) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.01% │ 79.87 µs │ 4 │ 19.97 µs ± 16.62 ( 4.77 ‥ 35.29) │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.01% │ 78.68 µs │ 39 │ 2.02 µs ± 0.24 ( 1.43 ‥ 2.38) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5In ⋯
│ 0.01% │ 35.76 µs │ 1 │ │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│ 0.00% │ 21.93 µs │ 4 │ 5.48 µs ± 0.28 ( 5.25 ‥ 5.72) │ _Z16FLOAT_maxplus_TTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii ⋯
│ 0.00% │ 16.69 µs │ 3 │ 5.56 µs ± 0.28 ( 5.25 ‥ 5.72) │ _Z16FLOAT_maxplus_NTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii ⋯
│ 0.00% │ 4.53 µs │ 1 │ │ _Z22partial_mapreduce_grid8identity1_8TropicalI7Float32E16CartesianIndicesILi2E5TupleI5OneToI5Int64ES5_IS6_EEES3_ILi2ES4_IS5_IS6_ES5 ⋯
│ 0.00% │ 4.05 µs │ 1 │ │ [copy device to pageable memory] ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
But I failed to profile the second one, it seems that when I try
it used up all the 80G global memory.
|
To run the example
Performance
Device: NVIDIA RTX A4500, CUDA 12.0
Before CuTropicalGEMM
After CuTropicalGEMM
Profile