-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel I/O #1399
Parallel I/O #1399
Conversation
Thanks a lot! Are the results from a fresh run or a later run without compilation? |
As another test, you could try to isolate the I/O part and only benchmark this code. You could even extract it from Trixi and just use randomly filled arrays as dummy data. This would probably help to isolate the underlying issue. Then, I would move to a different machine and check if you see the same behavior, just eliminate the possibility that it is a system specific issue. If the overly long write times persist, you can also try to recreate it in C/C++ to see if it is a Julia issue or just an artifact of the HDF5 library that is being used. |
@ranocha, to exclude compilation time I need to use tmpi, right? With 4 ranks using tmpi I get the following results for the second run of tree_2d_dgsem/elixir_advection_basic.jl: Running without parallel HDF5 enabled────────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 32.6ms / 96.7% 337KiB / 95.2%
│
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
rhs! 91 13.0ms 41.4% 143μs 77.8KiB 24.2% 875B
volume integral 91 3.74ms 11.9% 41.1μs 0.00B 0.0% 0.00B
finish MPI send 91 2.34ms 7.4% 25.7μs 18.5KiB 5.8% 208B
interface flux 91 1.61ms 5.1% 17.7μs 0.00B 0.0% 0.00B
finish MPI receive 91 1.51ms 4.8% 16.6μs 21.3KiB 6.6% 240B
~rhs!~ 91 1.06ms 3.4% 11.7μs 15.2KiB 4.7% 171B
start MPI send 91 799μs 2.5% 8.78μs 11.4KiB 3.5% 128B
surface integral 91 524μs 1.7% 5.76μs 0.00B 0.0% 0.00B
prolong2interfaces 91 459μs 1.5% 5.05μs 0.00B 0.0% 0.00B
MPI interface flux 91 396μs 1.3% 4.35μs 0.00B 0.0% 0.00B
start MPI receive 91 197μs 0.6% 2.16μs 11.4KiB 3.5% 128B
prolong2mpiinterfaces 91 110μs 0.3% 1.20μs 0.00B 0.0% 0.00B
reset ∂u/∂t 91 78.8μs 0.3% 866ns 0.00B 0.0% 0.00B
Jacobian 91 76.6μs 0.2% 842ns 0.00B 0.0% 0.00B
MPI mortar flux 91 40.8μs 0.1% 449ns 0.00B 0.0% 0.00B
prolong2mpimortars 91 23.1μs 0.1% 254ns 0.00B 0.0% 0.00B
prolong2boundaries 91 22.0μs 0.1% 242ns 0.00B 0.0% 0.00B
mortar flux 91 21.0μs 0.1% 231ns 0.00B 0.0% 0.00B
prolong2mortars 91 20.3μs 0.1% 224ns 0.00B 0.0% 0.00B
source terms 91 11.1μs 0.0% 122ns 0.00B 0.0% 0.00B
boundary flux 91 11.1μs 0.0% 122ns 0.00B 0.0% 0.00B
I/O 3 12.2ms 38.7% 4.06ms 170KiB 52.9% 56.6KiB
save solution 2 6.81ms 21.6% 3.40ms 132KiB 41.2% 66.1KiB
~I/O~ 3 5.35ms 17.0% 1.78ms 33.9KiB 10.6% 11.3KiB
get element variables 2 27.5μs 0.1% 13.7μs 3.66KiB 1.1% 1.83KiB
save mesh 2 501ns 0.0% 250ns 0.00B 0.0% 0.00B
analyze solution 2 5.15ms 16.3% 2.58ms 69.8KiB 21.7% 34.9KiB
calculate dt 19 1.13ms 3.6% 59.4μs 3.86KiB 1.2% 208B
──────────────────────────────────────────────────────────────────────────────────── Running with parallel HDF5 enabled────────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 35.2ms / 96.2% 276KiB / 94.2%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
I/O 3 15.4ms 45.5% 5.12ms 109KiB 41.8% 36.3KiB
save solution 2 11.1ms 32.8% 5.55ms 70.9KiB 27.3% 35.5KiB
~I/O~ 3 4.25ms 12.6% 1.42ms 34.2KiB 13.1% 11.4KiB
get element variables 2 25.8μs 0.1% 12.9μs 3.66KiB 1.4% 1.83KiB
save mesh 2 536ns 0.0% 268ns 0.00B 0.0% 0.00B
rhs! 91 14.0ms 41.5% 154μs 77.8KiB 29.9% 875B
volume integral 91 5.24ms 15.5% 57.6μs 0.00B 0.0% 0.00B
interface flux 91 1.99ms 5.9% 21.9μs 0.00B 0.0% 0.00B
finish MPI receive 91 1.37ms 4.0% 15.0μs 21.3KiB 8.2% 240B
~rhs!~ 91 1.13ms 3.3% 12.4μs 15.2KiB 5.8% 171B
start MPI send 91 1.01ms 3.0% 11.1μs 11.4KiB 4.4% 128B
surface integral 91 863μs 2.6% 9.48μs 0.00B 0.0% 0.00B
MPI interface flux 91 649μs 1.9% 7.13μs 0.00B 0.0% 0.00B
prolong2interfaces 91 584μs 1.7% 6.42μs 0.00B 0.0% 0.00B
finish MPI send 91 492μs 1.5% 5.40μs 18.5KiB 7.1% 208B
start MPI receive 91 221μs 0.7% 2.43μs 11.4KiB 4.4% 128B
prolong2mpiinterfaces 91 139μs 0.4% 1.53μs 0.00B 0.0% 0.00B
reset ∂u/∂t 91 99.4μs 0.3% 1.09μs 0.00B 0.0% 0.00B
Jacobian 91 96.1μs 0.3% 1.06μs 0.00B 0.0% 0.00B
prolong2boundaries 91 27.4μs 0.1% 301ns 0.00B 0.0% 0.00B Running serial for comparison────────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
Tot / % measured: 56.0ms / 95.6% 320KiB / 87.6%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────────────
rhs! 91 35.8ms 66.8% 393μs 9.33KiB 3.3% 105B
volume integral 91 20.2ms 37.7% 222μs 0.00B 0.0% 0.00B
interface flux 91 8.66ms 16.2% 95.2μs 0.00B 0.0% 0.00B
prolong2interfaces 91 2.64ms 4.9% 29.0μs 0.00B 0.0% 0.00B
surface integral 91 2.61ms 4.9% 28.7μs 0.00B 0.0% 0.00B
~rhs!~ 91 873μs 1.6% 9.59μs 9.33KiB 3.3% 105B
Jacobian 91 352μs 0.7% 3.87μs 0.00B 0.0% 0.00B
reset ∂u/∂t 91 339μs 0.6% 3.72μs 0.00B 0.0% 0.00B
prolong2mortars 91 29.2μs 0.1% 321ns 0.00B 0.0% 0.00B
prolong2boundaries 91 28.9μs 0.1% 318ns 0.00B 0.0% 0.00B
mortar flux 91 28.4μs 0.1% 312ns 0.00B 0.0% 0.00B
boundary flux 91 13.6μs 0.0% 150ns 0.00B 0.0% 0.00B
source terms 91 12.0μs 0.0% 132ns 0.00B 0.0% 0.00B
I/O 3 11.0ms 20.6% 3.67ms 249KiB 88.6% 82.9KiB
save solution 2 5.92ms 11.0% 2.96ms 211KiB 75.3% 106KiB
~I/O~ 3 5.07ms 9.5% 1.69ms 33.8KiB 12.1% 11.3KiB
get element variables 2 30.9μs 0.1% 15.4μs 3.53KiB 1.3% 1.77KiB
save mesh 2 473ns 0.0% 236ns 0.00B 0.0% 0.00B
analyze solution 2 6.74ms 12.6% 3.37ms 22.6KiB 8.0% 11.3KiB
calculate dt 19 28.6μs 0.1% 1.50μs 0.00B 0.0% 0.00B
──────────────────────────────────────────────────────────────────────────────────── So, still the parallel HDF5 takes almost twice as long as in the serial and in the poor man's version. I already started to create a simple Trixi.jl-independent example, as also @sloede suggested. I will elaborate on this. EDIT: I was able to (at least partly) reproduce the issue with an Trixi-independent MWE on different machines. On both machines the performance doesn't increase for parallel HDF5, see the tmpi terminals of the benchmarks below. My laptop using 4 ranks with tmpijulia> include("parallel_benchmark.jl") │julia> include("parallel_benchmark.jl")
rank 0 │rank 1
running parallel │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 7.225 ms … 676.040 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 4.151 ms … 676.024 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 17.866 ms ┊ GC (median): 0.00% │ Time (median): 17.929 ms ┊ GC (median): 0.00%
Time (mean ± σ): 48.891 ms ± 78.380 ms ┊ GC (mean ± σ): 0.00% ± 0.00% │ Time (mean ± σ): 48.851 ms ± 78.385 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
│
▄█ ▂▁ │ ██ ▁ ▄
▄▇██▆▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▆█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄ │ ▅▅▆██▅▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▅█▁▇█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
7.23 ms Histogram: log(frequency) by time 203 ms < │ 4.15 ms Histogram: log(frequency) by time 203 ms <
│
Memory estimate: 8.15 KiB, allocs estimate: 106. │ Memory estimate: 8.18 KiB, allocs estimate: 108.
running on_root │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 4.578 ms … 502.710 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 10.249 ms … 502.645 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.098 ms ┊ GC (median): 0.00% │ Time (median): 11.115 ms ┊ GC (median): 0.00%
Time (mean ± σ): 46.678 ms ± 72.112 ms ┊ GC (mean ± σ): 0.06% ± 0.94% │ Time (mean ± σ): 46.831 ms ± 72.069 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
│
█▆ ▄ │ █ ▃
▄██▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▄▁▁▄▄▄▁▄ ▄ │ █▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▁▄▁▄▁▄▄▁▄ ▄
4.58 ms Histogram: log(frequency) by time 199 ms < │ 10.2 ms Histogram: log(frequency) by time 199 ms <
│
Memory estimate: 3.06 MiB, allocs estimate: 110. │ Memory estimate: 128 bytes, allocs estimate: 2.
│
julia> │julia>
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
julia> include("parallel_benchmark.jl") │julia> include("parallel_benchmark.jl")
rank 2 │rank 3
running parallel │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 5.275 ms … 676.031 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 3.874 ms … 675.990 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 17.804 ms ┊ GC (median): 0.00% │ Time (median): 17.815 ms ┊ GC (median): 0.00%
Time (mean ± σ): 48.869 ms ± 78.394 ms ┊ GC (mean ± σ): 0.00% ± 0.00% │ Time (mean ± σ): 48.856 ms ± 78.400 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
│
█▂ ▂ │ ▆█ ▁ ▃
▄▇▁██▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄ │ ▅▁▇██▅▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁█▁▆█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
5.28 ms Histogram: log(frequency) by time 203 ms < │ 3.87 ms Histogram: log(frequency) by time 203 ms <
│
Memory estimate: 8.18 KiB, allocs estimate: 108. │ Memory estimate: 8.18 KiB, allocs estimate: 108.
running on_root │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 10.250 ms … 502.647 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 117.345 μs … 502.647 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.099 ms ┊ GC (median): 0.00% │ Time (median): 11.055 ms ┊ GC (median): 0.00%
Time (mean ± σ): 46.783 ms ± 72.096 ms ┊ GC (mean ± σ): 0.00% ± 0.00% │ Time (mean ± σ): 46.643 ms ± 72.171 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
│
█ ▃ │ █ ▃
█▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄▄▁▄▁▄▁▄▄▁▄ ▄ │ ▄▁▁█▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▄▁▁▄▄▄▁▄ ▄
10.3 ms Histogram: log(frequency) by time 199 ms < │ 117 μs Histogram: log(frequency) by time 199 ms <
│
Memory estimate: 128 bytes, allocs estimate: 2. │ Memory estimate: 128 bytes, allocs estimate: 2.
│
roci using 4 ranks with tmpijulia> include("parallel_benchmark.jl") │julia> include("parallel_benchmark.jl")
rank 0 │rank 1
running parallel │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 3.033 ms … 10.288 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 3.038 ms … 10.280 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.310 ms ┊ GC (median): 0.00% │ Time (median): 4.308 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.400 ms ± 962.308 μs ┊ GC (mean ± σ): 0.00% ± 0.00% │ Time (mean ± σ): 4.417 ms ± 976.753 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
│
▁ █▆█▃▆▃▆▄ ▁ ▃ │ █▃█ ▅▂▃▃
▆▇▆▄▄█▇▆▆▆▇████████▆█▁█▆▁▆▄▁▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄ │ ▅▅▇▄▄█▇▅▅▅▇████████▅█▄▇▇▁▅▄▁▄▁▁▁▁▁▁▁▄▁▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
3.03 ms Histogram: frequency by time 8.15 ms < │ 3.04 ms Histogram: frequency by time 8.16 ms <
│
Memory estimate: 7.74 KiB, allocs estimate: 101. │ Memory estimate: 7.77 KiB, allocs estimate: 103.
running on_root │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 2.525 ms … 10.106 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 2.706 ms … 10.110 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.325 ms ┊ GC (median): 0.00% │ Time (median): 4.344 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.460 ms ± 1.155 ms ┊ GC (mean ± σ): 0.98% ± 3.21% │ Time (mean ± σ): 4.481 ms ± 1.126 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
│
█ █ ▅▁▁█ ▂ │ ▁ ▂▂ ▄█▅ ▇
▃▃▃▁▁▁█▅▅▁▁▆▆▃█▆▆██████▃▆█▁▃▅▃▁▃▃▁▁▁▁▁▅▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▁▁▅ ▃ │ ▃▁▅▁█▅▃█▆█▅██▁██████▆█▅▅▁▅▁▁▃▃▁▃▁▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃
2.53 ms Histogram: frequency by time 8.17 ms < │ 2.71 ms Histogram: frequency by time 8.82 ms <
│
Memory estimate: 3.06 MiB, allocs estimate: 106. │ Memory estimate: 160 bytes, allocs estimate: 3.
│
julia> │julia>
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│
julia> include("parallel_benchmark.jl") │julia> include("parallel_benchmark.jl")
rank 2 │rank 3
running parallel │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 3.031 ms … 10.281 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 3.034 ms … 10.281 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.302 ms ┊ GC (median): 0.00% │ Time (median): 4.319 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.433 ms ± 1.021 ms ┊ GC (mean ± σ): 0.00% ± 0.00% │ Time (mean ± σ): 4.436 ms ± 1.030 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
│
▁ ██▆▃▆▃▆▃ ▄ │ █▅▅▃▂▃▃▃ ▂
▆▆▇▄▄█▇▆▇▆▇████████▆▇▄█▁▁▇▁▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄ ▄ │ ▅▅▇▄▅▇▇▄▇▇▇████████▄█▄█▄▁▅▄▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▄
3.03 ms Histogram: frequency by time 8.18 ms < │ 3.03 ms Histogram: frequency by time 8.14 ms <
│
Memory estimate: 7.77 KiB, allocs estimate: 103. │ Memory estimate: 7.77 KiB, allocs estimate: 103.
running on_root │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation. │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 2.699 ms … 10.108 ms ┊ GC (min … max): 0.00% … 0.00% │ Range (min … max): 2.704 ms … 10.106 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.355 ms ┊ GC (median): 0.00% │ Time (median): 4.366 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.494 ms ± 1.139 ms ┊ GC (mean ± σ): 0.00% ± 0.00% │ Time (mean ± σ): 4.502 ms ± 1.157 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
│
▂ █▄ ▄ │ ▃ ▂ █▄ ▄
▃▃▁▁█▆█▁▆▇▅██▃▇█████▇▅▆▃▃▅▁▃▃▃▁▁▃▁▁▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃ │ ▃▃▁▁██▃▁▆▇▆▇█▃▇█████▆▆▆▃▃▅▁▁▃▃▁▁▃▁▃▁▁▃▁▁▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃
2.7 ms Histogram: frequency by time 8.81 ms < │ 2.7 ms Histogram: frequency by time 8.8 ms <
│
Memory estimate: 160 bytes, allocs estimate: 3. │ Memory estimate: 160 bytes, allocs estimate: 3. With more MPI ranks the times scale linearly, which indicates that the parallelization doesn't work. EDIT2: I also tried using |
Yes, exactly. Thanks! |
Codecov Report
@@ Coverage Diff @@
## main #1399 +/- ##
==========================================
- Coverage 95.70% 92.82% -2.88%
==========================================
Files 360 356 -4
Lines 29816 29639 -177
==========================================
- Hits 28533 27511 -1022
- Misses 1283 2128 +845
Flags with carried forward coverage won't be shown. Click here to find out more.
|
I also looked at flamegraphs on each process, see below a screenshot (with some annotations) with the flamegraphs for 2 ranks (Profile.jl measured the time for 10 runs writing 1 000 000
|
Did you already test it on different machines, @JoshuaLampert? |
Yes, see also my edited comment above. |
Thanks, I didn't notice the update there |
How hard would it be to recreate your MWE ( Have you encountered the same performance issue when using an HDF5 library not provided by HDF5_jll but overriding the JLL product with your local version? |
I will try to reproduce the issue in C |
🤦 Right. |
I recreated my MWE in C now. Here, the parallel case with 4 ranks is around 10 % faster than the case where root writes all the data. Compared to julia it executes about 20 % faster (but still within the standard deviation). You can find the timings and the C and julia programs below. My laptop using 4 ranksrunning parallel... running on_root... running on_root... running on_root... running on_root... On_root on rank 0: On_root on rank 1: On_root on rank 2: C code#include "hdf5.h"
#include "stdlib.h"
#include <mpi.h>
#include <time.h>
#include <sys/time.h>
#define DIMS 1
void parallel(char *filename, int M, double *A, MPI_Comm comm, MPI_Info info) {
int myrank, Nproc;
MPI_Comm_rank(comm, &myrank);
MPI_Comm_size(comm, &Nproc);
hid_t plistId = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plistId, comm, info);
hid_t fileId = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plistId);
H5Pclose(plistId);
// Create the dataspace for the dataset.
hsize_t dimsm[1];
hsize_t dimsf[1];
dimsm[0] = M;
dimsf[0] = M * Nproc;
hid_t mspace = H5Screate_simple(DIMS, dimsm, NULL);
hid_t fspace = H5Screate_simple(DIMS, dimsf, NULL);
// Create dataset and select subset to write
hid_t dset = H5Dcreate(fileId, "A", H5T_NATIVE_DOUBLE, fspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
hsize_t start[] = {M * myrank};
hsize_t count[] = {M};
H5Sselect_hyperslab(fspace, H5S_SELECT_SET, start, NULL, count, NULL);
// Create a property list for collective data transfer
//plistId = H5Pcreate(H5P_DATASET_XFER);
// Write data
H5Dwrite(dset, H5T_NATIVE_DOUBLE, mspace, fspace, H5P_DEFAULT, A);
H5Dclose(dset);
H5Sclose(fspace);
H5Sclose(mspace);
H5Fclose(fileId);
//H5Pclose(plistId);
}
void on_root(char *filename, int M, double *A, MPI_Comm comm, MPI_Info info) {
int myrank, Nproc;
MPI_Comm_rank(comm, &myrank);
MPI_Comm_size(comm, &Nproc);
int root = 0;
// Send data to root
int *displs = (int *)malloc(Nproc * sizeof(int));
int *recvcounts = (int *)malloc(Nproc * sizeof(int));
for (int i = 0; i < Nproc; i++) {
displs[i] = M * i;
recvcounts[i] = M;
}
double *rbuf;
if (myrank == root)
rbuf = (double *)malloc(Nproc * M * sizeof(double));
MPI_Gatherv(A, M, MPI_DOUBLE, rbuf, recvcounts, displs, MPI_DOUBLE, root, comm);
// Write data on root
if (myrank == root) {
hid_t fileId = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
// Create the dataspace for the dataset.
hsize_t dimsm[1];
hsize_t dimsf[1];
dimsm[0] = M * Nproc;
dimsf[0] = M * Nproc;
hid_t mspace = H5Screate_simple(DIMS, dimsm, NULL);
hid_t fspace = H5Screate_simple(DIMS, dimsf, NULL);
// Create datset
hid_t dset = H5Dcreate(fileId, "A", H5T_NATIVE_DOUBLE, fspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
// Write data
H5Dwrite(dset, H5T_NATIVE_DOUBLE, mspace, fspace, H5P_DEFAULT, rbuf);
H5Dclose(dset);
H5Sclose(fspace);
H5Sclose(mspace);
H5Fclose(fileId);
}
}
double min(double *arr, int size) {
double min = arr[0];
for (int i = 1; i < size; i++) {
if (arr[i] < min)
min = arr[i];
}
return min;
}
double max(double *arr, int size) {
double max = arr[0];
for (int i = 1; i < size; i++) {
if (arr[i] > max)
max = arr[i];
}
return max;
}
double mean(double *arr, int size) {
double sum = 0.0;
for (int i = 0; i < size; i++) {
sum += arr[i];
}
double mean = sum / size;
return mean;
}
double get_time_ms(void) {
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)(tv.tv_usec / 1000) + (double)(tv.tv_sec * 1000);
}
int main (int argc, char **argv) {
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Info info = MPI_INFO_NULL;
MPI_Init(&argc, &argv);
int myrank;
MPI_Comm_rank(comm, &myrank);
int M = 100000;
double *A = (double *)malloc(sizeof(double) * M);
time_t t;
srand((unsigned) time(&t));
for (int i = 0; i < M; i++)
A[i] = (double)rand() / (double)RAND_MAX;
int samples = 100;
double begin, end;
double *times_parallel = (double *)malloc(sizeof(double) * samples);
printf("running parallel...\n");
for (int i = 0; i < samples; i++) {
begin = get_time_ms();
parallel("data_parallel.h5", M, A, comm, info);
end = get_time_ms();
MPI_Barrier(comm);
times_parallel[i] = (double)(end - begin);
}
printf("Parallel on rank %d:\n\tmin: %f ms\n\tmax: %f ms\n\tmean: %f ms\n\n", myrank, min(times_parallel, samples), max(times_parallel, samples), mean(times_parallel, samples));
double *times_on_root = (double *)malloc(sizeof(double) * samples);
printf("running on_root...\n");
for (int i = 0; i < samples; i++) {
begin = get_time_ms();
on_root("data_on_root.h5", M, A, comm, info);
end = get_time_ms();
MPI_Barrier(comm);
times_on_root[i] = (double)(end - begin);
}
printf("On_root on rank %d:\n\tmin: %f ms\n\tmax: %f ms\n\tmean: %f ms\n\n", myrank, min(times_on_root, samples), max(times_on_root, samples), mean(times_on_root, samples));
MPI_Finalize();
return 0;
} julia codeusing MPI
using HDF5
using BenchmarkTools
MPI.Init()
comm = MPI.COMM_WORLD
Nproc = MPI.Comm_size(comm)
myrank = MPI.Comm_rank(comm)
println("rank $myrank")
M = 100000
A = rand(M) # local data
# true parallel writing slices
function parallel(filename, A, comm)
M = size(A, 1)
myrank = MPI.Comm_rank(comm)
Nproc = MPI.Comm_size(comm)
h5open(filename, "w", comm) do file
# Create dataset
dset = create_dataset(file, "/A", datatype(eltype(A)), dataspace((M * Nproc,)))
# Write local data
slice = myrank * M + 1:(myrank + 1) * M
dset[slice] = A
end
end
# poor man's version
function on_root(filename, A, comm)
M = size(A, 1)
myrank = MPI.Comm_rank(comm)
Nproc = MPI.Comm_size(comm)
root = 0
# Send data to root
if myrank != root
MPI.Gatherv!(A, nothing, root, comm)
else
h5open(filename, "w") do file
# Create dataset
dset = create_dataset(file, "/A", datatype(eltype(A)), dataspace((M * Nproc,)))
# Receive data
recv = Vector{eltype(A)}(undef, (M * Nproc,))
MPI.Gatherv!(A, MPI.VBuffer(recv, fill(M, Nproc)), root, comm)
# Write local data
dset[:] = recv
end
end
end
samples = 100
println("running parallel...")
t1 = @benchmark parallel("data_parallel.h5", A, comm) samples=samples
println(IOContext(stdout, :compact => false), t1)
println("running on_root...")
t2 = @benchmark on_root("data_on_root.h5", A, comm) samples=samples
println(IOContext(stdout, :compact => false), t2)
#=using Profile, ProfileView
function doit_parallel(n, A, comm)
for i in 1:n
parallel("data_parallel.h5", A, comm)
end
end
function doit_on_root(n, A, comm)
for i in 1:n
on_root("data_on_root.h5", A, comm)
end
end
Profile.clear(); @profile doit_parallel(10, A, comm); ProfileView.view(windowname="parallel $myrank")
Profile.clear(); @profile doit_on_root(10, A, comm); ProfileView.view(windowname="on_root $myrank");=#
|
Thanks! Can you try to run the examples on roci as well? And I wouldn't run them on tmpi, I'd run them as a script to avoid any tmpi influence. You just have to discard the first invocation (since it includes compile time). But from the current results, do I understand correctly that in the MWE, the parallel HDF5 case is slightly faster than the poor man's version, as one would expect? How is the difference between Julia and C? |
I also ran the C MWE on roci. Here, the poor man's version is even faster than the parallel one (using 4 ranks: parallel version needs ~3.6 ms and root on poor man's version ~2.8 ms, using 16 ranks: 16.1 ms parallel vs. 11.59 ms poor man's version). Whether I use |
If you don;t have any other suggestions, @sloede, I think we can proceed with this. There does not seem to be a huge difference for small-scale applications while real parallel IO is required for large-scale simulations. Thus, this seems to be a net win to me. |
When I tested this, I tried the elixir under ERROR: LoadError: type NamedTuple has no field mpi_cache
Stacktrace:
[1] getproperty
@ ./Base.jl:37 [inlined]
[2] nelementsglobal
@ ~/.julia/dev/Trixi.jl/src/solvers/dg.jl:444 [inlined]
[3] ndofsglobal
@ ~/.julia/dev/Trixi.jl/src/solvers/dg.jl:387 [inlined]
[...] I didn't investigate further, since it also occurs on main and seems to be unrelated to this PR. Is this a known issue or should I create one? I also noticed that there are no MPI tests for the 3d |
@sloede IIRC, MPI on the |
Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
Do we have some tests for this? |
As discussed here I only tested the true parallel I/O locally, but not in CI. For this, I ran |
Great, thanks! Did you open the issue as discussed there? |
CI looks good. There is just the upstream error
Could you please update the filter for these warning in Lines 152 to 153 in 153ef65
accordingly? |
Done in #1486. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Good to go from my side. It would be great if @sloede could have a look, too
Correct. The parallel 2D TreeMesh is already implemented so inefficiently (by myself 😬) that I didn't think it would be worthwhile to make it work for 3D. @JoshuaLampert Could you maybe create an issue that we should add a more meaningful error message in case someone tries to run TreeMesh3D in parallel? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just one suggestion.
Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>
This is a first draft for parallel input and output with MPI using parallel HDF5.jl, see #1332. Currently, this only implements parallel writing of the solution files if parallel HDF5 is available. Parallel writing and reading restart files will follow.
Notice that in order to use parallel I/O you need to set MPIPreferences and therefore also the Preference for P4est.jl even if you don't want to use a system-provided p4est or even p4est at all. This is not that of a big problem, I guess, but not the most user-friendly option. However, I don't know yet how to circumvent this issue. I am open for a discussion.
Initial benchmarks indicate that something is still not working properly (although HDF5 successfully recognizes the local hdf5 library with MPI support) since the time for writing the solution files increases significantly (see below for running
mpiexecjl -n 4 --project julia --project examples/tree_2d_dgsem/elixir_advection_basic.jl
without and with parallel HDF5 support). I am trying to figure out what's the matter. Any ideas are welcome.Running without parallel HDF5 enabled (as it was before)
Running with parallel HDF5 enabled