Example: generic tensor network #22

GiggleLiu · 2023-10-14T08:17:53Z

To run the example

cd examples/combinatorial-optimization
julia --project -e "using Pkg; Pkg.instantiate()"
julia --project main.jl

Performance

Device: NVIDIA RTX A4500, CUDA 12.0

Before CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
 10.772101 seconds (1.79 M allocations: 83.646 MiB, 0.95% gc time)

@time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 42.961929 seconds (11.23 M allocations: 425.264 MiB, 0.29% gc time, 0.06% compilation time)

After CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  1.109122 seconds (1.29 M allocations: 61.970 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 20.308956 seconds (9.96 M allocations: 366.654 MiB, 0.48% gc time, 0.01% compilation time)

Profile

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 1.7 s, capturing 157501 events.

Host-side activity: calling CUDA APIs took 732.81 ms (43.10% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   32.07% │ 545.26 ms │  8289 │  65.78 µs │    3.1 µs │ 507.52 ms │ cudaLaunchKernel        │
│   22.21% │  377.7 ms │  4694 │  80.47 µs │   2.86 µs │   60.8 ms │ cuStreamSynchronize     │
│    4.17% │  70.98 ms │ 14646 │   4.85 µs │   2.15 µs │ 416.76 µs │ cuMemAllocFromPoolAsync │
│    2.81% │  47.79 ms │  4692 │  10.19 µs │   3.34 µs │  11.46 ms │ cuMemcpyHtoDAsync       │
│    2.26% │  38.41 ms │  6967 │   5.51 µs │   3.81 µs │  45.06 µs │ cuLaunchKernel          │
│    0.01% │ 211.95 µs │     1 │ 211.95 µs │ 211.95 µs │ 211.95 µs │ cuMemcpyDtoHAsync       │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 712.51 ms (41.90% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                                                                 ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   38.98% │ 662.74 ms │  8282 │  80.02 µs │   3.58 µs │  534.3 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                                        ⋯
│    0.74% │   12.6 ms │    11 │   1.15 ms │   1.43 µs │  12.45 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_ES3_IS ⋯
│    0.73% │  12.47 ms │    33 │ 377.79 µs │   1.19 µs │  12.35 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_ES3_IS4_S ⋯
│    0.48% │   8.24 ms │     4 │   2.06 ms │  10.49 µs │   4.16 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
│    0.26% │   4.47 ms │  3329 │   1.34 µs │ 953.67 ns │   4.05 µs │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_ES3_IS4_S4_E                ⋯
│    0.20% │    3.4 ms │  4692 │ 725.62 ns │ 238.42 ns │   2.38 µs │ [copy pageable to device memory]                                                                                                             
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32)
Profiler ran for 28.53 s, capturing 1077654 events.

Host-side activity: calling CUDA APIs took 9.62 s (33.72% of the trace)
┌──────────┬───────────┬───────┬───────────┬──────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │ Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼──────────┼───────────┼─────────────────────────┤
│   59.20% │   16.89 s │ 70375 │ 240.04 µs │  2.15 µs │    2.99 s │ cuStreamSynchronize     │
│    3.16% │ 900.96 ms │ 63114 │  14.28 µs │  3.81 µs │   7.76 ms │ cuLaunchKernel          │
│    2.87% │  818.6 ms │ 25043 │  32.69 µs │  3.34 µs │  548.1 ms │ cudaLaunchKernel        │
│    2.70% │ 770.51 ms │ 23459 │  32.84 µs │ 14.07 µs │ 534.53 µs │ cuMemcpyDtoHAsync       │
│    1.74% │ 495.97 ms │ 53489 │   9.27 µs │  2.15 µs │   1.02 ms │ cuMemAllocFromPoolAsync │
│    1.06% │ 302.92 ms │ 23457 │  12.91 µs │  3.58 µs │   3.81 ms │ cuMemcpyHtoDAsync       │
└──────────┴───────────┴───────┴───────────┴──────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 7.8 s (27.34% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                                                                 ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   22.70% │    6.48 s │ 24932 │ 259.74 µs │    3.1 µs │ 582.84 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                                        ⋯
│    1.21% │  345.8 ms │     1 │  345.8 ms │  345.8 ms │  345.8 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi28ELi1EE11BroadcastedI12CuArrayStyleILi28EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    1.01% │ 286.84 ms │     2 │ 143.42 ms │ 140.71 ms │ 146.13 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi27ELi1EE11BroadcastedI12CuArrayStyleILi27EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    0.40% │ 114.32 ms │  9382 │  12.19 µs │   2.38 µs │  31.29 ms │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI4Bool5Int64E16CartesianIndicesILi1ES2_I5OneToIS4_EEES5_ILi1ES2_IS6_IS4_EEE3ValILinfalse ⋯
│    0.35% │ 100.62 ms │ 23459 │   4.29 µs │ 476.84 ns │ 161.65 µs │ [copy device to pageable memory]                                                                                                                     ⋯
│    0.26% │  73.67 ms │     1 │  73.67 ms │  73.67 ms │  73.67 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
│    0.23% │  65.95 ms │     1 │  65.95 ms │  65.95 ms │  65.95 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi26ELi1EE11BroadcastedI12CuArrayStyleILi26EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    0.21% │  60.48 ms │     2 │  30.24 ms │  11.44 µs │  60.47 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
│    0.14% │  39.56 ms │  9382 │   4.22 µs │   1.19 µs │   8.41 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI4BoolLi1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE8isapproxS4_I8ExtrudedIS0 ⋯
│    0.14% │  38.68 ms │    39 │  991.7 µs │   1.43 µs │  19.18 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_ES3_IS ⋯
│    0.09% │  24.75 ms │     1 │  24.75 ms │  24.75 ms │  24.75 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi25ELi1EE11BroadcastedI12CuArrayStyleILi25EE5TupleI5OneToI5Int64ES6_IS7_ES6 ⋯
│    0.07% │  20.93 ms │    96 │ 218.02 µs │   1.19 µs │  13.96 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_ES3_IS4_S ⋯
│    0.07% │  20.69 ms │    10 │   2.07 ms │  10.49 µs │   6.73 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ ⋯
...

codecov · 2023-10-14T08:20:36Z

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

GiggleLiu · 2023-10-14T11:20:57Z

The same benchmark on A800

Before CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  4.007866 seconds (1.71 M allocations: 81.452 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 59.282821 seconds (11.29 M allocations: 427.296 MiB, 0.56% gc time, 0.17% compilation time)

After CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  1.210406 seconds (1.29 M allocations: 61.970 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 40.313153 seconds (10.04 M allocations: 369.083 MiB, 0.59% gc time, 0.00% compilation time)

It is slower than A4500...

Profile

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 1.24 s, capturing 157509 events.

Host-side activity: calling CUDA APIs took 264.52 ms (21.37% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   13.68% │ 169.35 ms │  8289 │  20.43 µs │   2.38 µs │ 142.61 ms │ cudaLaunchKernel        │
│    3.53% │  43.64 ms │ 14646 │   2.98 µs │   1.43 µs │   5.76 ms │ cuMemAllocFromPoolAsync │
│    2.02% │  25.05 ms │  6967 │    3.6 µs │   2.62 µs │  65.09 µs │ cuLaunchKernel          │
│    1.37% │  16.97 ms │  4694 │   3.62 µs │   1.43 µs │   6.89 ms │ cuStreamSynchronize     │
│    0.96% │  11.89 ms │  4692 │   2.53 µs │   1.91 µs │  20.03 µs │ cuMemcpyHtoDAsync       │
│    0.00% │   26.7 µs │     1 │   26.7 µs │   26.7 µs │   26.7 µs │ cuMemcpyDtoHAsync       │
│    0.00% │   1.19 µs │     6 │ 198.68 ns │    0.0 ns │ 476.84 ns │ cuDeviceGetCount        │
│    0.00% │ 238.42 ns │     1 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverGetVersion      │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 298.18 ms (24.09% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│   20.84% │ 257.85 ms │  8282 │  31.13 µs │   4.77 µs │ 167.64 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE ⋯
│    0.66% │   8.19 ms │  3329 │   2.46 µs │   1.67 µs │   3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.49% │   6.07 ms │  4692 │   1.29 µs │ 953.67 ns │   2.15 µs │ [copy pageable to device memory]              ⋯
│    0.42% │   5.25 ms │    11 │ 476.97 µs │   2.86 µs │   5.11 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.42% │   5.24 ms │  1575 │   3.33 µs │   2.38 µs │   4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.42% │   5.16 ms │    33 │ 156.45 µs │   2.15 µs │   5.02 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.26% │   3.25 ms │     4 │ 812.23 µs │   8.82 µs │    1.7 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.19% │    2.4 ms │  1001 │   2.39 µs │   1.67 µs │   4.05 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
Profiler ran for 91.44 s, capturing 1077252 events.

Host-side activity: calling CUDA APIs took 3.42 s (3.74% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│    3.91% │    3.58 s │ 70375 │  50.82 µs │   1.43 µs │    1.72 s │ cuStreamSynchronize     │
│    0.41% │ 372.21 ms │ 23459 │  15.87 µs │  12.64 µs │  45.06 µs │ cuMemcpyDtoHAsync       │
│    0.32% │  291.3 ms │ 63080 │   4.62 µs │   2.62 µs │   7.47 ms │ cuLaunchKernel          │
│    0.29% │ 261.51 ms │ 25043 │  10.44 µs │   2.38 µs │ 159.61 ms │ cudaLaunchKernel        │
│    0.21% │ 189.58 ms │ 53455 │   3.55 µs │   1.43 µs │   10.1 ms │ cuMemAllocFromPoolAsync │
│    0.09% │  81.43 ms │ 23457 │   3.47 µs │   2.15 µs │  40.05 µs │ cuMemcpyHtoDAsync       │
│    0.00% │   1.19 µs │     4 │ 298.02 ns │ 238.42 ns │ 476.84 ns │ cuDeviceGetCount        │
│    0.00% │ 476.84 ns │     2 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDeviceGet             │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 4.22 s (4.62% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│    3.95% │    3.62 s │ 24932 │ 145.03 µs │   4.77 µs │  186.6 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE ⋯
│    0.11% │ 100.07 ms │     1 │ 100.07 ms │ 100.07 ms │ 100.07 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.08% │  74.37 ms │  9382 │   7.93 µs │   3.34 µs │  11.95 ms │ _Z22partial_mapreduce_grid8identity9reduction ⋯
│    0.08% │  73.09 ms │     2 │  36.55 ms │  36.53 ms │  36.57 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.06% │  54.96 ms │ 23459 │   2.34 µs │   1.43 µs │   3.58 µs │ [copy device to pageable memory]              ⋯
│    0.04% │  32.26 ms │  9382 │   3.44 µs │   1.91 µs │   1.88 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.03% │  31.29 ms │ 23457 │   1.33 µs │ 953.67 ns │   4.53 µs │ [copy pageable to device memory]              ⋯
│    0.03% │  24.49 ms │ 10001 │   2.45 µs │   1.67 µs │   3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.03% │  23.08 ms │  9382 │   2.46 µs │   1.43 µs │ 973.46 µs │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic ⋯
│    0.02% │  22.43 ms │     1 │  22.43 ms │  22.43 ms │  22.43 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  22.14 ms │  9382 │   2.36 µs │   1.43 µs │ 397.92 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.02% │  19.92 ms │     2 │   9.96 ms │  11.44 µs │  19.91 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  16.53 ms │     1 │  16.53 ms │  16.53 ms │  16.53 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.02% │  14.25 ms │    39 │ 365.39 µs │   2.15 µs │   7.24 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.01% │  10.38 ms │  3152 │   3.29 µs │   2.15 µs │   4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
...

From the profile we can see the slow down is maybe related to the stream synchronization.

GiggleLiu · 2023-10-14T11:34:48Z

In some tasks, the CuTropicalGEMM speed up the code as expected. However, in another set of tasks, the stream synchronization takes too muhc time.

Maybe it is related to stream synchronizing mechanism and is related to #19 .

ArrogantGao · 2023-10-14T11:59:50Z

That is strange, all stream synchronization should be done in C code but not in Julia.
I will have a check of that and try to fix #19.

GiggleLiu · 2023-10-14T12:09:23Z

Profile the version without CuTropicalGEMM

I profiled the version without CuTropicalGEMM. It also has a lot of time spent on neither host nor device. So wield.

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 4.15 s, capturing 198954 events.

Host-side activity: calling CUDA APIs took 3.1 s (74.69% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name      ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────
│   72.43% │     3.0 s │ 15256 │ 196.97 µs │   2.38 µs │    2.72 s │ cuLaunchK ⋯
│    1.34% │  55.78 ms │ 14646 │   3.81 µs │   2.86 µs │  31.71 µs │ cuMemAllo ⋯
│    0.28% │  11.71 ms │  4692 │    2.5 µs │   1.91 µs │  16.45 µs │ cuMemcpyH ⋯
│    0.24% │  10.09 ms │  4694 │   2.15 µs │   1.43 µs │   4.53 µs │ cuStreamS ⋯
│    0.00% │  29.56 µs │     1 │  29.56 µs │  29.56 µs │  29.56 µs │ cuMemcpyD ⋯
│    0.00% │ 715.26 ns │     6 │ 119.21 ns │    0.0 ns │ 476.84 ns │ cuDeviceG ⋯
│    0.00% │ 238.42 ns │     1 │ 238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverG ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────
                                                                1 column omitted

Device-side activity: GPU was busy for 3.07 s (74.12% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬──────────┬─────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │ Max time │ Name       ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼──────────┼─────────────
│   73.24% │    3.04 s │  8289 │ 366.53 µs │   1.43 µs │   2.74 s │ matmatmul_ ⋯
│    0.16% │   6.77 ms │  3329 │   2.03 µs │   1.19 µs │  2.86 µs │ _Z18permut ⋯
│    0.15% │   6.05 ms │  4692 │   1.29 µs │ 953.67 ns │  2.15 µs │ [copy page ⋯
│    0.12% │   5.11 ms │    33 │ 154.94 µs │   1.67 µs │  4.97 ms │ _Z18permut ⋯
│    0.12% │   5.07 ms │    11 │ 460.88 µs │   2.86 µs │  4.94 ms │ _Z18permut ⋯
│    0.10% │   4.23 ms │  1575 │   2.68 µs │   1.91 µs │  3.81 µs │ _Z16broadc ⋯
│    0.08% │   3.25 ms │     4 │ 811.46 µs │   8.58 µs │   1.7 ms │ _Z18permut ⋯
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float64)
Profiler ran for 122.63 s, capturing 1202467 events.

Host-side activity: calling CUDA APIs took 16.83 s (13.73% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   13.83% │   16.96 s │ 70375 │ 241.05 µs │ 476.84 ns │    5.15 s │ cuStreamSynchronize     │
│    4.82% │    5.91 s │ 88123 │  67.06 µs │   2.15 µs │    5.09 s │ cuLaunchKernel          │
│    0.41% │ 502.22 ms │ 53455 │    9.4 µs │   2.86 µs │ 106.68 ms │ cuMemAllocFromPoolAsync │
│    0.31% │ 375.17 ms │ 23459 │  15.99 µs │   12.4 µs │  68.43 µs │ cuMemcpyDtoHAsync       │
│    0.07% │  81.05 ms │ 23457 │   3.46 µs │   1.91 µs │  40.77 µs │ cuMemcpyHtoDAsync       │
│    0.00% │   1.67 µs │     4 │ 417.23 ns │ 238.42 ns │ 715.26 ns │ cuDeviceGetCount        │
│    0.00% │ 238.42 ns │     2 │ 119.21 ns │    0.0 ns │ 238.42 ns │ cuDeviceGet             │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 23.07 s (18.81% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│   18.35% │    22.5 s │ 25043 │ 898.56 µs │   1.43 µs │     5.1 s │ matmatmul_                                    ⋯
│    0.07% │  83.97 ms │     1 │  83.97 ms │  83.97 ms │  83.97 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.06% │  74.99 ms │     2 │  37.49 ms │  37.49 ms │   37.5 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.06% │  68.46 ms │  9382 │    7.3 µs │   2.86 µs │   10.2 ms │ _Z22partial_mapreduce_grid8identity9reduction ⋯
│    0.04% │  54.84 ms │ 23459 │   2.34 µs │   1.67 µs │   5.01 µs │ [copy device to pageable memory]              ⋯
│    0.03% │  31.13 ms │ 23457 │   1.33 µs │ 715.26 ns │   2.62 µs │ [copy pageable to device memory]              ⋯
│    0.02% │  30.49 ms │  9382 │   3.25 µs │   1.43 µs │   2.95 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.02% │  21.08 ms │     1 │  21.08 ms │  21.08 ms │  21.08 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  19.82 ms │ 10001 │   1.98 µs │   1.19 µs │    3.1 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.02% │  19.18 ms │  9382 │   2.04 µs │   1.19 µs │   1.11 ms │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic ⋯
│    0.01% │  17.54 ms │  9382 │   1.87 µs │   1.19 µs │ 322.58 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.01% │  17.24 ms │     1 │  17.24 ms │  17.24 ms │  17.24 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi ⋯
│    0.01% │  16.85 ms │     2 │   8.42 ms │  13.35 µs │  16.83 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
│    0.01% │   16.2 ms │    39 │  415.4 µs │   2.62 µs │   8.12 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe ⋯
...

GiggleLiu · 2023-10-14T12:37:29Z

A related issue: under-Peter/OMEinsum.jl#133 (comment)

ArrogantGao · 2023-12-17T12:16:34Z

The performance on A800 seems to be better after I fixed the stream problem, two times faster that the previous benchmark on A4500.

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  0.558056 seconds (636.56 k allocations: 32.896 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 11.641459 seconds (9.31 M allocations: 351.891 MiB, 2.37% gc time, 0.01% compilation time)

It is shown by the profiler that the time neither on CPU nor on GPU disappear.

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 661.94 ms, capturing 202287 events.

Host-side activity: calling CUDA APIs took 390.1 ms (58.93% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                        │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────────┼─────────────────────────┤
│   23.95% │  158.53 ms │  8289 │  19.13 µs ± 1353.11 (   3.1 ‥ 123196.84) │ cudaLaunchKernel        │
│   21.22% │  140.46 ms │ 14646 │   9.59 µs ± 329.82 (  1.19 ‥ 27344.47)   │ cuMemAllocFromPoolAsync │
│    6.59% │   43.62 ms │  6967 │   6.26 µs ± 123.84 (  3.58 ‥ 10338.78)   │ cuLaunchKernel          │
│    2.18% │    14.4 ms │  4692 │   3.07 µs ± 1.29   (  2.38 ‥ 88.69)      │ cuMemcpyHtoDAsync       │
│    1.13% │    7.45 ms │  4694 │   1.59 µs ± 59.94  (  0.48 ‥ 4107.48)    │ cuStreamSynchronize     │
│    0.01% │   35.29 µs │     1 │                                          │ cuMemcpyDtoHAsync       │
│    0.00% │  238.42 ns │     6 │  39.74 ns ± 97.33  (   0.0 ‥ 238.42)     │ cuDeviceGetCount        │
│    0.00% │  238.42 ns │     1 │                                          │ cuDriverGetVersion      │
└──────────┴────────────┴───────┴──────────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 281.15 ms (42.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                        │ Name                                                                                                                                 ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   37.33% │  247.08 ms │  8282 │  29.83 µs ± 1869.26 (  4.29 ‥ 169630.77) │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        ⋯
│    0.93% │    6.13 ms │    33 │ 185.66 µs ± 1045.81 (  1.67 ‥ 6011.25)   │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.89% │    5.87 ms │  4692 │   1.25 µs ± 0.17   (  0.95 ‥ 3.1)        │ [copy pageable to device memory]                                                                                                     ⋯
│    0.86% │     5.7 ms │  3329 │   1.71 µs ± 0.23   (  1.19 ‥ 2.62)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_ES3_IS4_S4_ ⋯
│    0.67% │    4.46 ms │    11 │  405.9 µs ± 1305.73 (  2.38 ‥ 4342.56)   │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.55% │    3.65 ms │  1575 │   2.32 µs ± 0.21   (  1.67 ‥ 3.1)        │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In ⋯
│    0.43% │    2.85 ms │     4 │ 713.65 µs ± 618.86 ( 10.25 ‥ 1378.3)     │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.25% │    1.67 ms │  1001 │   1.67 µs ± 0.21   (  1.19 ‥ 3.1)        │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_ES3_IS4_ ⋯
│    0.23% │    1.54 ms │   122 │   12.6 µs ± 107.18 (  1.43 ‥ 1185.18)    │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.11% │  758.65 µs │   435 │   1.74 µs ± 0.45   (  1.43 ‥ 6.44)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_ES3_I ⋯
│    0.06% │  429.39 µs │   230 │   1.87 µs ± 0.63   (  1.43 ‥ 6.91)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_ES ⋯
│    0.05% │  312.33 µs │    75 │   4.16 µs ± 5.81   (  1.43 ‥ 27.89)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.03% │  212.43 µs │    89 │   2.39 µs ± 0.4    (  1.91 ‥ 5.01)       │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In ⋯
│    0.02% │  153.54 µs │    11 │  13.96 µs ± 21.05  (  2.38 ‥ 70.57)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.01% │   82.25 µs │     7 │  11.75 µs ± 12.38  (  2.86 ‥ 36.24)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.01% │   79.87 µs │     4 │  19.97 µs ± 16.62  (  4.77 ‥ 35.29)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.01% │   78.68 µs │    39 │   2.02 µs ± 0.24   (  1.43 ‥ 2.38)       │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5In ⋯
│    0.01% │   35.76 µs │     1 │                                          │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 ⋯
│    0.00% │   21.93 µs │     4 │   5.48 µs ± 0.28   (  5.25 ‥ 5.72)       │ _Z16FLOAT_maxplus_TTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        ⋯
│    0.00% │   16.69 µs │     3 │   5.56 µs ± 0.28   (  5.25 ‥ 5.72)       │ _Z16FLOAT_maxplus_NTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        ⋯
│    0.00% │    4.53 µs │     1 │                                          │ _Z22partial_mapreduce_grid8identity1_8TropicalI7Float32E16CartesianIndicesILi2E5TupleI5OneToI5Int64ES5_IS6_EEES3_ILi2ES4_IS5_IS6_ES5 ⋯
│    0.00% │    4.05 µs │     1 │                                          │ [copy device to pageable memory]                                                                                                     ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

But I failed to profile the second one, it seems that when I try

CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);

it used up all the 80G global memory.

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    187920      C   ....10.0-rc2+0.x64.linux.gnu/bin/julia    81084MiB |
+---------------------------------------------------------------------------------------+

example - generic tensor network

6c81193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example: generic tensor network #22

Example: generic tensor network #22

GiggleLiu commented Oct 14, 2023 •

edited

Loading

codecov bot commented Oct 14, 2023

GiggleLiu commented Oct 14, 2023 •

edited

Loading

GiggleLiu commented Oct 14, 2023 •

edited

Loading

ArrogantGao commented Oct 14, 2023

GiggleLiu commented Oct 14, 2023 •

edited

Loading

GiggleLiu commented Oct 14, 2023

ArrogantGao commented Dec 17, 2023 •

edited

Loading

Example: generic tensor network #22

Are you sure you want to change the base?

Example: generic tensor network #22

Conversation

GiggleLiu commented Oct 14, 2023 • edited Loading

To run the example

Performance

Before CuTropicalGEMM

After CuTropicalGEMM

Profile

codecov bot commented Oct 14, 2023

Welcome to Codecov 🎉

GiggleLiu commented Oct 14, 2023 • edited Loading

The same benchmark on A800

Before CuTropicalGEMM

After CuTropicalGEMM

Profile

GiggleLiu commented Oct 14, 2023 • edited Loading

ArrogantGao commented Oct 14, 2023

GiggleLiu commented Oct 14, 2023 • edited Loading

Profile the version without CuTropicalGEMM

GiggleLiu commented Oct 14, 2023

ArrogantGao commented Dec 17, 2023 • edited Loading

GiggleLiu commented Oct 14, 2023 •

edited

Loading

GiggleLiu commented Oct 14, 2023 •

edited

Loading

GiggleLiu commented Oct 14, 2023 •

edited

Loading

GiggleLiu commented Oct 14, 2023 •

edited

Loading

ArrogantGao commented Dec 17, 2023 •

edited

Loading