Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example: generic tensor network #22

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Example: generic tensor network #22

wants to merge 1 commit into from

Conversation

GiggleLiu
Copy link
Member

@GiggleLiu GiggleLiu commented Oct 14, 2023

To run the example

cd examples/combinatorial-optimization
julia --project -e "using Pkg; Pkg.instantiate()"
julia --project main.jl

Performance

Device: NVIDIA RTX A4500, CUDA 12.0

Before CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
 10.772101 seconds (1.79 M allocations: 83.646 MiB, 0.95% gc time)

@time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 42.961929 seconds (11.23 M allocations: 425.264 MiB, 0.29% gc time, 0.06% compilation time)

After CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  1.109122 seconds (1.29 M allocations: 61.970 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 20.308956 seconds (9.96 M allocations: 366.654 MiB, 0.48% gc time, 0.01% compilation time)

Profile

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 1.7 s, capturing 157501 events.

Host-side activity: calling CUDA APIs took 732.81 ms (43.10% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   32.07%545.26 ms │  828965.78 µs │    3.1 µs │ 507.52 ms │ cudaLaunchKernel        │
│   22.21%377.7 ms │  469480.47 µs │   2.86 µs │   60.8 ms │ cuStreamSynchronize     │
│    4.17%70.98 ms │ 146464.85 µs │   2.15 µs │ 416.76 µs │ cuMemAllocFromPoolAsync │
│    2.81%47.79 ms │  469210.19 µs │   3.34 µs │  11.46 ms │ cuMemcpyHtoDAsync       │
│    2.26%38.41 ms │  69675.51 µs │   3.81 µs │  45.06 µs │ cuLaunchKernel          │
│    0.01%211.95 µs │     1211.95 µs │ 211.95 µs │ 211.95 µs │ cuMemcpyDtoHAsync       │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 712.51 ms (41.90% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                                                                 
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   38.98%662.74 ms │  828280.02 µs │   3.58 µs │  534.3 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                                        0.74%12.6 ms │    111.15 ms │   1.43 µs │  12.45 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_ES3_IS 0.73%12.47 ms │    33377.79 µs │   1.19 µs │  12.35 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_ES3_IS4_S 0.48%8.24 ms │     42.06 ms │  10.49 µs │   4.16 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ 0.26%4.47 ms │  33291.34 µs │ 953.67 ns │   4.05 µs │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_ES3_IS4_S4_E                0.20%3.4 ms │  4692725.62 ns │ 238.42 ns │   2.38 µs │ [copy pageable to device memory]                                                                                                             
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32)
Profiler ran for 28.53 s, capturing 1077654 events.

Host-side activity: calling CUDA APIs took 9.62 s (33.72% of the trace)
┌──────────┬───────────┬───────┬───────────┬──────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │ Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼──────────┼───────────┼─────────────────────────┤
│   59.20%16.89 s │ 70375240.04 µs │  2.15 µs │    2.99 s │ cuStreamSynchronize     │
│    3.16%900.96 ms │ 6311414.28 µs │  3.81 µs │   7.76 ms │ cuLaunchKernel          │
│    2.87%818.6 ms │ 2504332.69 µs │  3.34 µs │  548.1 ms │ cudaLaunchKernel        │
│    2.70%770.51 ms │ 2345932.84 µs │ 14.07 µs │ 534.53 µs │ cuMemcpyDtoHAsync       │
│    1.74%495.97 ms │ 534899.27 µs │  2.15 µs │   1.02 ms │ cuMemAllocFromPoolAsync │
│    1.06%302.92 ms │ 2345712.91 µs │  3.58 µs │   3.81 ms │ cuMemcpyHtoDAsync       │
└──────────┴───────────┴───────┴───────────┴──────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 7.8 s (27.34% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                                                                 
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   22.70%6.48 s │ 24932259.74 µs │    3.1 µs │ 582.84 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                                        1.21%345.8 ms │     1345.8 ms │  345.8 ms │  345.8 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi28ELi1EE11BroadcastedI12CuArrayStyleILi28EE5TupleI5OneToI5Int64ES6_IS7_ES6 1.01%286.84 ms │     2143.42 ms │ 140.71 ms │ 146.13 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi27ELi1EE11BroadcastedI12CuArrayStyleILi27EE5TupleI5OneToI5Int64ES6_IS7_ES6 0.40%114.32 ms │  938212.19 µs │   2.38 µs │  31.29 ms │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI4Bool5Int64E16CartesianIndicesILi1ES2_I5OneToIS4_EEES5_ILi1ES2_IS6_IS4_EEE3ValILinfalse 0.35%100.62 ms │ 234594.29 µs │ 476.84 ns │ 161.65 µs │ [copy device to pageable memory]                                                                                                                     0.26%73.67 ms │     173.67 ms │  73.67 ms │  73.67 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ 0.23%65.95 ms │     165.95 ms │  65.95 ms │  65.95 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi26ELi1EE11BroadcastedI12CuArrayStyleILi26EE5TupleI5OneToI5Int64ES6_IS7_ES6 0.21%60.48 ms │     230.24 ms │  11.44 µs │  60.47 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ 0.14%39.56 ms │  93824.22 µs │   1.19 µs │   8.41 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI4BoolLi1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE8isapproxS4_I8ExtrudedIS0 0.14%38.68 ms │    39991.7 µs │   1.43 µs │  19.18 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_ES3_IS 0.09%24.75 ms │     124.75 ms │  24.75 ms │  24.75 ms │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi25ELi1EE11BroadcastedI12CuArrayStyleILi25EE5TupleI5OneToI5Int64ES6_IS7_ES6 0.07%20.93 ms │    96218.02 µs │   1.19 µs │  13.96 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_ES3_IS4_S 0.07%20.69 ms │    102.07 ms │  10.49 µs │   6.73 ms │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4_S4_S4_S4_S4_S4_ 
...

@codecov
Copy link

codecov bot commented Oct 14, 2023

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

@GiggleLiu
Copy link
Member Author

GiggleLiu commented Oct 14, 2023

The same benchmark on A800

Before CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  4.007866 seconds (1.71 M allocations: 81.452 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 59.282821 seconds (11.29 M allocations: 427.296 MiB, 0.56% gc time, 0.17% compilation time)

After CuTropicalGEMM

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  1.210406 seconds (1.29 M allocations: 61.970 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 40.313153 seconds (10.04 M allocations: 369.083 MiB, 0.59% gc time, 0.00% compilation time)

It is slower than A4500...

Profile

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 1.24 s, capturing 157509 events.

Host-side activity: calling CUDA APIs took 264.52 ms (21.37% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   13.68%169.35 ms │  828920.43 µs │   2.38 µs │ 142.61 ms │ cudaLaunchKernel        │
│    3.53%43.64 ms │ 146462.98 µs │   1.43 µs │   5.76 ms │ cuMemAllocFromPoolAsync │
│    2.02%25.05 ms │  69673.6 µs │   2.62 µs │  65.09 µs │ cuLaunchKernel          │
│    1.37%16.97 ms │  46943.62 µs │   1.43 µs │   6.89 ms │ cuStreamSynchronize     │
│    0.96%11.89 ms │  46922.53 µs │   1.91 µs │  20.03 µs │ cuMemcpyHtoDAsync       │
│    0.00%26.7 µs │     126.7 µs │   26.7 µs │   26.7 µs │ cuMemcpyDtoHAsync       │
│    0.00%1.19 µs │     6198.68 ns │    0.0 ns │ 476.84 ns │ cuDeviceGetCount        │
│    0.00%238.42 ns │     1238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverGetVersion      │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 298.18 ms (24.09% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│   20.84%257.85 ms │  828231.13 µs │   4.77 µs │ 167.64 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE 0.66%8.19 ms │  33292.46 µs │   1.67 µs │   3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.49%6.07 ms │  46921.29 µs │ 953.67 ns │   2.15 µs │ [copy pageable to device memory]              0.42%5.25 ms │    11476.97 µs │   2.86 µs │   5.11 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.42%5.24 ms │  15753.33 µs │   2.38 µs │   4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.42%5.16 ms │    33156.45 µs │   2.15 µs │   5.02 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.26%3.25 ms │     4812.23 µs │   8.82 µs │    1.7 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.19%2.4 ms │  10012.39 µs │   1.67 µs │   4.05 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe 
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
Profiler ran for 91.44 s, capturing 1077252 events.

Host-side activity: calling CUDA APIs took 3.42 s (3.74% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│    3.91%3.58 s │ 7037550.82 µs │   1.43 µs │    1.72 s │ cuStreamSynchronize     │
│    0.41%372.21 ms │ 2345915.87 µs │  12.64 µs │  45.06 µs │ cuMemcpyDtoHAsync       │
│    0.32%291.3 ms │ 630804.62 µs │   2.62 µs │   7.47 ms │ cuLaunchKernel          │
│    0.29%261.51 ms │ 2504310.44 µs │   2.38 µs │ 159.61 ms │ cudaLaunchKernel        │
│    0.21%189.58 ms │ 534553.55 µs │   1.43 µs │   10.1 ms │ cuMemAllocFromPoolAsync │
│    0.09%81.43 ms │ 234573.47 µs │   2.15 µs │  40.05 µs │ cuMemcpyHtoDAsync       │
│    0.00%1.19 µs │     4298.02 ns │ 238.42 ns │ 476.84 ns │ cuDeviceGetCount        │
│    0.00%476.84 ns │     2238.42 ns │ 238.42 ns │ 238.42 ns │ cuDeviceGet             │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 4.22 s (4.62% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│    3.95%3.62 s │ 24932145.03 µs │   4.77 µs │  186.6 ms │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EE 0.11%100.07 ms │     1100.07 ms │ 100.07 ms │ 100.07 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.08%74.37 ms │  93827.93 µs │   3.34 µs │  11.95 ms │ _Z22partial_mapreduce_grid8identity9reduction 0.08%73.09 ms │     236.55 ms │  36.53 ms │  36.57 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.06%54.96 ms │ 234592.34 µs │   1.43 µs │   3.58 µs │ [copy device to pageable memory]              0.04%32.26 ms │  93823.44 µs │   1.91 µs │   1.88 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.03%31.29 ms │ 234571.33 µs │ 953.67 ns │   4.53 µs │ [copy pageable to device memory]              0.03%24.49 ms │ 100012.45 µs │   1.67 µs │   3.34 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.03%23.08 ms │  93822.46 µs │   1.43 µs │ 973.46 µs │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic 0.02%22.43 ms │     122.43 ms │  22.43 ms │  22.43 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.02%22.14 ms │  93822.36 µs │   1.43 µs │ 397.92 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.02%19.92 ms │     29.96 ms │  11.44 µs │  19.91 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.02%16.53 ms │     116.53 ms │  16.53 ms │  16.53 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.02%14.25 ms │    39365.39 µs │   2.15 µs │   7.24 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.01%10.38 ms │  31523.29 µs │   2.15 µs │   4.05 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi 
...

From the profile we can see the slow down is maybe related to the stream synchronization.

@GiggleLiu
Copy link
Member Author

GiggleLiu commented Oct 14, 2023

In some tasks, the CuTropicalGEMM speed up the code as expected. However, in another set of tasks, the stream synchronization takes too muhc time.

Maybe it is related to stream synchronizing mechanism and is related to #19 .

@ArrogantGao
Copy link
Collaborator

That is strange, all stream synchronization should be done in C code but not in Julia.
I will have a check of that and try to fix #19.

@GiggleLiu
Copy link
Member Author

GiggleLiu commented Oct 14, 2023

Profile the version without CuTropicalGEMM

I profiled the version without CuTropicalGEMM. It also has a lot of time spent on neither host nor device. So wield.

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 4.15 s, capturing 198954 events.

Host-side activity: calling CUDA APIs took 3.1 s (74.69% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name      
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────
│   72.43%3.0 s │ 15256196.97 µs │   2.38 µs │    2.72 s │ cuLaunchK 1.34%55.78 ms │ 146463.81 µs │   2.86 µs │  31.71 µs │ cuMemAllo 0.28%11.71 ms │  46922.5 µs │   1.91 µs │  16.45 µs │ cuMemcpyH 0.24%10.09 ms │  46942.15 µs │   1.43 µs │   4.53 µs │ cuStreamS 0.00%29.56 µs │     129.56 µs │  29.56 µs │  29.56 µs │ cuMemcpyD 0.00%715.26 ns │     6119.21 ns │    0.0 ns │ 476.84 ns │ cuDeviceG 0.00%238.42 ns │     1238.42 ns │ 238.42 ns │ 238.42 ns │ cuDriverG 
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────
                                                                1 column omitted

Device-side activity: GPU was busy for 3.07 s (74.12% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬──────────┬─────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │ Max time │ Name       
├──────────┼───────────┼───────┼───────────┼───────────┼──────────┼─────────────
│   73.24%3.04 s │  8289366.53 µs │   1.43 µs │   2.74 s │ matmatmul_ 0.16%6.77 ms │  33292.03 µs │   1.19 µs │  2.86 µs │ _Z18permut 0.15%6.05 ms │  46921.29 µs │ 953.67 ns │  2.15 µs │ [copy page 0.12%5.11 ms │    33154.94 µs │   1.67 µs │  4.97 ms │ _Z18permut 0.12%5.07 ms │    11460.88 µs │   2.86 µs │  4.94 ms │ _Z18permut 0.10%4.23 ms │  15752.68 µs │   1.91 µs │  3.81 µs │ _Z16broadc 0.08%3.25 ms │     4811.46 µs │   8.58 µs │   1.7 ms │ _Z18permut 
...

julia> CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float64)
Profiler ran for 122.63 s, capturing 1202467 events.

Host-side activity: calling CUDA APIs took 16.83 s (13.73% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   13.83%16.96 s │ 70375241.05 µs │ 476.84 ns │    5.15 s │ cuStreamSynchronize     │
│    4.82%5.91 s │ 8812367.06 µs │   2.15 µs │    5.09 s │ cuLaunchKernel          │
│    0.41%502.22 ms │ 534559.4 µs │   2.86 µs │ 106.68 ms │ cuMemAllocFromPoolAsync │
│    0.31%375.17 ms │ 2345915.99 µs │   12.4 µs │  68.43 µs │ cuMemcpyDtoHAsync       │
│    0.07%81.05 ms │ 234573.46 µs │   1.91 µs │  40.77 µs │ cuMemcpyHtoDAsync       │
│    0.00%1.67 µs │     4417.23 ns │ 238.42 ns │ 715.26 ns │ cuDeviceGetCount        │
│    0.00%238.42 ns │     2119.21 ns │    0.0 ns │ 238.42 ns │ cuDeviceGet             │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 23.07 s (18.81% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                          
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────
│   18.35%22.5 s │ 25043898.56 µs │   1.43 µs │     5.1 s │ matmatmul_                                    0.07%83.97 ms │     183.97 ms │  83.97 ms │  83.97 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.06%74.99 ms │     237.49 ms │  37.49 ms │   37.5 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.06%68.46 ms │  93827.3 µs │   2.86 µs │   10.2 ms │ _Z22partial_mapreduce_grid8identity9reduction 0.04%54.84 ms │ 234592.34 µs │   1.67 µs │   5.01 µs │ [copy device to pageable memory]              0.03%31.13 ms │ 234571.33 µs │ 715.26 ns │   2.62 µs │ [copy pageable to device memory]              0.02%30.49 ms │  93823.25 µs │   1.43 µs │   2.95 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.02%21.08 ms │     121.08 ms │  21.08 ms │  21.08 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.02%19.82 ms │ 100011.98 µs │   1.19 µs │    3.1 µs │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.02%19.18 ms │  93822.04 µs │   1.19 µs │   1.11 ms │ _Z2_615CuKernelContext13CuDeviceArrayI8Tropic 0.01%17.54 ms │  93821.87 µs │   1.19 µs │ 322.58 µs │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.01%17.24 ms │     117.24 ms │  17.24 ms │  17.24 ms │ _Z16broadcast_kernel15CuKernelContext13CuDevi 0.01%16.85 ms │     28.42 ms │  13.35 µs │  16.83 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 0.01%16.2 ms │    39415.4 µs │   2.62 µs │   8.12 ms │ _Z18permutedims_kernel15CuKernelContext13CuDe 
...

@GiggleLiu
Copy link
Member Author

A related issue: under-Peter/OMEinsum.jl#133 (comment)

@ArrogantGao
Copy link
Collaborator

ArrogantGao commented Dec 17, 2023

The performance on A800 seems to be better after I fixed the stream problem, two times faster that the previous benchmark on A4500.

julia> @time Array(solve(tn, SizeMax(); usecuda=true, T=Float32));
  0.558056 seconds (636.56 k allocations: 32.896 MiB)

julia> @time solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);
 11.641459 seconds (9.31 M allocations: 351.891 MiB, 2.37% gc time, 0.01% compilation time)

It is shown by the profiler that the time neither on CPU nor on GPU disappear.

julia> CUDA.@profile Array(solve(tn, SizeMax(); usecuda=true, T=Float32))
Profiler ran for 661.94 ms, capturing 202287 events.

Host-side activity: calling CUDA APIs took 390.1 ms (58.93% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                        │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────────┼─────────────────────────┤
│   23.95%158.53 ms │  828919.13 µs ± 1353.11 (   3.1123196.84) │ cudaLaunchKernel        │
│   21.22%140.46 ms │ 146469.59 µs ± 329.82 (  1.1927344.47)   │ cuMemAllocFromPoolAsync │
│    6.59%43.62 ms │  69676.26 µs ± 123.84 (  3.5810338.78)   │ cuLaunchKernel          │
│    2.18%14.4 ms │  46923.07 µs ± 1.29   (  2.3888.69)      │ cuMemcpyHtoDAsync       │
│    1.13%7.45 ms │  46941.59 µs ± 59.94  (  0.484107.48)    │ cuStreamSynchronize     │
│    0.01%35.29 µs │     1 │                                          │ cuMemcpyDtoHAsync       │
│    0.00%238.42 ns │     639.74 ns ± 97.33  (   0.0238.42)     │ cuDeviceGetCount        │
│    0.00%238.42 ns │     1 │                                          │ cuDriverGetVersion      │
└──────────┴────────────┴───────┴──────────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 281.15 ms (42.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                        │ Name                                                                                                                                 
├──────────┼────────────┼───────┼──────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   37.33%247.08 ms │  828229.83 µs ± 1869.26 (  4.29169630.77) │ _Z16FLOAT_maxplus_NNILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        0.93%6.13 ms │    33185.66 µs ± 1045.81 (  1.676011.25)   │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.89%5.87 ms │  46921.25 µs ± 0.17   (  0.953.1)        │ [copy pageable to device memory]                                                                                                     0.86%5.7 ms │  33291.71 µs ± 0.23   (  1.192.62)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_ES3_IS4_S4_ 0.67%4.46 ms │    11405.9 µs ± 1305.73 (  2.384342.56)   │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.55%3.65 ms │  15752.32 µs ± 0.21   (  1.673.1)        │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In 0.43%2.85 ms │     4713.65 µs ± 618.86 ( 10.251378.3)     │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.25%1.67 ms │  10011.67 µs ± 0.21   (  1.193.1)        │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_ES3_IS4_ 0.23%1.54 ms │   12212.6 µs ± 107.18 (  1.431185.18)    │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.11%758.65 µs │   4351.74 µs ± 0.45   (  1.436.44)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_ES3_I 0.06%429.39 µs │   2301.87 µs ± 0.63   (  1.436.91)       │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_ES 0.05%312.33 µs │    754.16 µs ± 5.81   (  1.4327.89)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.03%212.43 µs │    892.39 µs ± 0.4    (  1.915.01)       │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5In 0.02%153.54 µs │    1113.96 µs ± 21.05  (  2.3870.57)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.01%82.25 µs │     711.75 µs ± 12.38  (  2.8636.24)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.01%79.87 µs │     419.97 µs ± 16.62  (  4.7735.29)      │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.01%78.68 µs │    392.02 µs ± 0.24   (  1.432.38)       │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5In 0.01%35.76 µs │     1 │                                          │ _Z18permutedims_kernel15CuKernelContext13CuDeviceArrayI8TropicalI7Float32ELi1ELi1EES0_IS1_IS2_ELi1ELi1EE5TupleI6UInt32S4_S4_S4_S4_S4 0.00%21.93 µs │     45.48 µs ± 0.28   (  5.255.72)       │ _Z16FLOAT_maxplus_TTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        0.00%16.69 µs │     35.56 µs ± 0.28   (  5.255.72)       │ _Z16FLOAT_maxplus_NTILi64ELi32ELi64ELi4ELi4EEvPfS0_S0_ffiiiii                                                                        0.00%4.53 µs │     1 │                                          │ _Z22partial_mapreduce_grid8identity1_8TropicalI7Float32E16CartesianIndicesILi2E5TupleI5OneToI5Int64ES5_IS6_EEES3_ILi2ES4_IS5_IS6_ES5 0.00%4.05 µs │     1 │                                          │ [copy device to pageable memory]                                                                                                     
└──────────┴────────────┴───────┴──────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                   

But I failed to profile the second one, it seems that when I try

CUDA.@profile solve(tn, SingleConfigMax(; bounded=true); usecuda=true, T=Float32);

it used up all the 80G global memory.

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    187920      C   ....10.0-rc2+0.x64.linux.gnu/bin/julia    81084MiB |
+---------------------------------------------------------------------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants