Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel I/O #1399

Merged
merged 26 commits into from
May 26, 2023
Merged

Parallel I/O #1399

merged 26 commits into from
May 26, 2023

Conversation

JoshuaLampert
Copy link
Member

This is a first draft for parallel input and output with MPI using parallel HDF5.jl, see #1332. Currently, this only implements parallel writing of the solution files if parallel HDF5 is available. Parallel writing and reading restart files will follow.

Notice that in order to use parallel I/O you need to set MPIPreferences and therefore also the Preference for P4est.jl even if you don't want to use a system-provided p4est or even p4est at all. This is not that of a big problem, I guess, but not the most user-friendly option. However, I don't know yet how to circumvent this issue. I am open for a discussion.

Initial benchmarks indicate that something is still not working properly (although HDF5 successfully recognizes the local hdf5 library with MPI support) since the time for writing the solution files increases significantly (see below for running mpiexecjl -n 4 --project julia --project examples/tree_2d_dgsem/elixir_advection_basic.jl without and with parallel HDF5 support). I am trying to figure out what's the matter. Any ideas are welcome.

Running without parallel HDF5 enabled (as it was before)
────────────────────────────────────────────────────────────────────────────────
          Trixi.jl                    Time                    Allocations      
                             ───────────────────────   ────────────────────────
     Tot / % measured:            5.75s /  82.0%           57.6MiB /  78.9%    

Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
I/O                        3    2.53s   53.7%   843ms   10.7MiB   23.6%  3.57MiB
 save solution            2    1.61s   34.1%   805ms   9.55MiB   21.0%  4.77MiB
 ~I/O~                    3    919ms   19.5%   306ms   1.01MiB    2.2%   343KiB
 get element vari...      2   1.36ms    0.0%   679μs    158KiB    0.3%  78.9KiB
 save mesh                2    642ns    0.0%   321ns     0.00B    0.0%    0.00B
analyze solution           2    2.08s   44.0%   1.04s   34.5MiB   76.1%  17.3MiB
calculate dt              19   93.3ms    2.0%  4.91ms   97.0KiB    0.2%  5.11KiB
rhs!                      91   14.9ms    0.3%   164μs   70.7KiB    0.2%     795B
 finish MPI receive      91   7.41ms    0.2%  81.4μs   17.1KiB    0.0%     192B
 volume integral         91   3.20ms    0.1%  35.1μs     0.00B    0.0%    0.00B
 interface flux          91   1.17ms    0.0%  12.9μs     0.00B    0.0%    0.00B
 ~rhs!~                  91    800μs    0.0%  8.79μs   15.2KiB    0.0%     171B
 start MPI send          91    645μs    0.0%  7.08μs   11.4KiB    0.0%     128B
 surface integral        91    433μs    0.0%  4.76μs     0.00B    0.0%    0.00B
 prolong2interfaces      91    365μs    0.0%  4.01μs     0.00B    0.0%    0.00B
 MPI interface flux      91    308μs    0.0%  3.38μs     0.00B    0.0%    0.00B
 finish MPI send         91    133μs    0.0%  1.47μs   15.6KiB    0.0%     176B
 start MPI receive       91    105μs    0.0%  1.16μs   11.4KiB    0.0%     128B
 prolong2mpiinter...     91   85.3μs    0.0%   937ns     0.00B    0.0%    0.00B
 reset ∂u/∂t             91   70.5μs    0.0%   775ns     0.00B    0.0%    0.00B
 Jacobian                91   60.5μs    0.0%   665ns     0.00B    0.0%    0.00B
 prolong2mpimortars      91   21.6μs    0.0%   238ns     0.00B    0.0%    0.00B
 mortar flux             91   20.4μs    0.0%   224ns     0.00B    0.0%    0.00B
 prolong2mortars         91   19.8μs    0.0%   218ns     0.00B    0.0%    0.00B
 MPI mortar flux         91   17.4μs    0.0%   191ns     0.00B    0.0%    0.00B
 prolong2boundaries      91   15.6μs    0.0%   171ns     0.00B    0.0%    0.00B
 boundary flux           91   10.6μs    0.0%   117ns     0.00B    0.0%    0.00B
 source terms            91   9.10μs    0.0%   100ns     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────
Running with parallel HDF5 enabled
────────────────────────────────────────────────────────────────────────────────
          Trixi.jl                    Time                    Allocations      
                             ───────────────────────   ────────────────────────
     Tot / % measured:            15.9s /  64.5%           58.1MiB /  91.7%    

Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
I/O                        3    7.39s   71.9%   2.46s   19.2MiB   36.0%  6.39MiB
 save solution            2    6.27s   61.0%   3.13s   18.0MiB   33.7%  8.98MiB
 ~I/O~                    3    1.12s   10.9%   374ms   1.05MiB    2.0%   358KiB
 get element vari...      2   2.16ms    0.0%  1.08ms    158KiB    0.3%  78.9KiB
 save mesh                2    538ns    0.0%   269ns     0.00B    0.0%    0.00B
analyze solution           2    2.29s   22.3%   1.15s   33.9MiB   63.7%  17.0MiB
calculate dt              19    546ms    5.3%  28.7ms   97.0KiB    0.2%  5.11KiB
rhs!                      91   48.2ms    0.5%   529μs   77.8KiB    0.1%     875B
 finish MPI receive      91   35.0ms    0.3%   385μs   21.3KiB    0.0%     240B
 volume integral         91   4.96ms    0.0%  54.5μs     0.00B    0.0%    0.00B
 interface flux          91   2.10ms    0.0%  23.1μs     0.00B    0.0%    0.00B
 ~rhs!~                  91   1.41ms    0.0%  15.5μs   15.2KiB    0.0%     171B
 start MPI send          91   1.09ms    0.0%  12.0μs   11.4KiB    0.0%     128B
 finish MPI send         91    800μs    0.0%  8.79μs   18.5KiB    0.0%     208B
 surface integral        91    716μs    0.0%  7.87μs     0.00B    0.0%    0.00B
 prolong2interfaces      91    633μs    0.0%  6.96μs     0.00B    0.0%    0.00B
 MPI interface flux      91    560μs    0.0%  6.16μs     0.00B    0.0%    0.00B
 start MPI receive       91    323μs    0.0%  3.55μs   11.4KiB    0.0%     128B
 prolong2mpiinter...     91    145μs    0.0%  1.60μs     0.00B    0.0%    0.00B
 Jacobian                91    126μs    0.0%  1.39μs     0.00B    0.0%    0.00B
 reset ∂u/∂t             91    109μs    0.0%  1.20μs     0.00B    0.0%    0.00B
 mortar flux             91   36.6μs    0.0%   403ns     0.00B    0.0%    0.00B
 prolong2mortars         91   36.0μs    0.0%   396ns     0.00B    0.0%    0.00B
 prolong2boundaries      91   34.1μs    0.0%   375ns     0.00B    0.0%    0.00B
 MPI mortar flux         91   29.7μs    0.0%   327ns     0.00B    0.0%    0.00B
 prolong2mpimortars      91   29.2μs    0.0%   321ns     0.00B    0.0%    0.00B
 source terms            91   14.0μs    0.0%   154ns     0.00B    0.0%    0.00B
 boundary flux           91   13.8μs    0.0%   152ns     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────

@JoshuaLampert JoshuaLampert added the parallelization Related to MPI, threading, tasks etc. label Apr 17, 2023
@ranocha
Copy link
Member

ranocha commented Apr 18, 2023

Thanks a lot! Are the results from a fresh run or a later run without compilation?

src/Trixi.jl Outdated Show resolved Hide resolved
src/callbacks_step/save_solution_dg.jl Outdated Show resolved Hide resolved
src/callbacks_step/save_solution_dg.jl Outdated Show resolved Hide resolved
@sloede
Copy link
Member

sloede commented Apr 18, 2023

As another test, you could try to isolate the I/O part and only benchmark this code. You could even extract it from Trixi and just use randomly filled arrays as dummy data. This would probably help to isolate the underlying issue. Then, I would move to a different machine and check if you see the same behavior, just eliminate the possibility that it is a system specific issue.

If the overly long write times persist, you can also try to recreate it in C/C++ to see if it is a Julia issue or just an artifact of the HDF5 library that is being used.

@JoshuaLampert
Copy link
Member Author

JoshuaLampert commented Apr 18, 2023

@ranocha, to exclude compilation time I need to use tmpi, right? With 4 ranks using tmpi I get the following results for the second run of tree_2d_dgsem/elixir_advection_basic.jl:

Running without parallel HDF5 enabled
────────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations
                                 ───────────────────────   ────────────────────────                                
       Tot / % measured:             32.6ms /  96.7%            337KiB /  95.2%
                                                                                                                   │
Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
rhs!                          91   13.0ms   41.4%   143μs   77.8KiB   24.2%    875B
 volume integral             91   3.74ms   11.9%  41.1μs     0.00B    0.0%    0.00B
 finish MPI send             91   2.34ms    7.4%  25.7μs   18.5KiB    5.8%     208B
 interface flux              91   1.61ms    5.1%  17.7μs     0.00B    0.0%    0.00B
 finish MPI receive          91   1.51ms    4.8%  16.6μs   21.3KiB    6.6%     240B
 ~rhs!~                      91   1.06ms    3.4%  11.7μs   15.2KiB    4.7%     171B
 start MPI send              91    799μs    2.5%  8.78μs   11.4KiB    3.5%     128B
 surface integral            91    524μs    1.7%  5.76μs     0.00B    0.0%    0.00B
 prolong2interfaces          91    459μs    1.5%  5.05μs     0.00B    0.0%    0.00B
 MPI interface flux          91    396μs    1.3%  4.35μs     0.00B    0.0%    0.00B
 start MPI receive           91    197μs    0.6%  2.16μs   11.4KiB    3.5%     128B
 prolong2mpiinterfaces       91    110μs    0.3%  1.20μs     0.00B    0.0%    0.00B
 reset ∂u/∂t                 91   78.8μs    0.3%   866ns     0.00B    0.0%    0.00B
 Jacobian                    91   76.6μs    0.2%   842ns     0.00B    0.0%    0.00B
 MPI mortar flux             91   40.8μs    0.1%   449ns     0.00B    0.0%    0.00B
 prolong2mpimortars          91   23.1μs    0.1%   254ns     0.00B    0.0%    0.00B
 prolong2boundaries          91   22.0μs    0.1%   242ns     0.00B    0.0%    0.00B
 mortar flux                 91   21.0μs    0.1%   231ns     0.00B    0.0%    0.00B
 prolong2mortars             91   20.3μs    0.1%   224ns     0.00B    0.0%    0.00B
 source terms                91   11.1μs    0.0%   122ns     0.00B    0.0%    0.00B
 boundary flux               91   11.1μs    0.0%   122ns     0.00B    0.0%    0.00B
I/O                            3   12.2ms   38.7%  4.06ms    170KiB   52.9%  56.6KiB
 save solution                2   6.81ms   21.6%  3.40ms    132KiB   41.2%  66.1KiB
 ~I/O~                        3   5.35ms   17.0%  1.78ms   33.9KiB   10.6%  11.3KiB
 get element variables        2   27.5μs    0.1%  13.7μs   3.66KiB    1.1%  1.83KiB
 save mesh                    2    501ns    0.0%   250ns     0.00B    0.0%    0.00B
analyze solution               2   5.15ms   16.3%  2.58ms   69.8KiB   21.7%  34.9KiB
calculate dt                  19   1.13ms    3.6%  59.4μs   3.86KiB    1.2%     208B
────────────────────────────────────────────────────────────────────────────────────
Running with parallel HDF5 enabled
──────────────────────────────────────────────────────────────────────────────────── 
            Trixi.jl                      Time                    Allocations
                                 ───────────────────────   ────────────────────────                                
       Tot / % measured:             35.2ms /  96.2%            276KiB /  94.2%
Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
I/O                            3   15.4ms   45.5%  5.12ms    109KiB   41.8%  36.3KiB
 save solution                2   11.1ms   32.8%  5.55ms   70.9KiB   27.3%  35.5KiB
 ~I/O~                        3   4.25ms   12.6%  1.42ms   34.2KiB   13.1%  11.4KiB
 get element variables        2   25.8μs    0.1%  12.9μs   3.66KiB    1.4%  1.83KiB
 save mesh                    2    536ns    0.0%   268ns     0.00B    0.0%    0.00B
rhs!                          91   14.0ms   41.5%   154μs   77.8KiB   29.9%     875B
 volume integral             91   5.24ms   15.5%  57.6μs     0.00B    0.0%    0.00B
 interface flux              91   1.99ms    5.9%  21.9μs     0.00B    0.0%    0.00B
 finish MPI receive          91   1.37ms    4.0%  15.0μs   21.3KiB    8.2%     240B
 ~rhs!~                      91   1.13ms    3.3%  12.4μs   15.2KiB    5.8%     171B
 start MPI send              91   1.01ms    3.0%  11.1μs   11.4KiB    4.4%     128B
 surface integral            91    863μs    2.6%  9.48μs     0.00B    0.0%    0.00B
 MPI interface flux          91    649μs    1.9%  7.13μs     0.00B    0.0%    0.00B
 prolong2interfaces          91    584μs    1.7%  6.42μs     0.00B    0.0%    0.00B
 finish MPI send             91    492μs    1.5%  5.40μs   18.5KiB    7.1%     208B
 start MPI receive           91    221μs    0.7%  2.43μs   11.4KiB    4.4%     128B
 prolong2mpiinterfaces       91    139μs    0.4%  1.53μs     0.00B    0.0%    0.00B
 reset ∂u/∂t                 91   99.4μs    0.3%  1.09μs     0.00B    0.0%    0.00B
 Jacobian                    91   96.1μs    0.3%  1.06μs     0.00B    0.0%    0.00B
 prolong2boundaries          91   27.4μs    0.1%   301ns     0.00B    0.0%    0.00B
Running serial for comparison
────────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations
       Tot / % measured:             56.0ms /  95.6%            320KiB /  87.6%
Section                   ncalls     time    %tot     avg     alloc    %tot      avg  
───────────────────────────────────────────────────────────────────────────────────
rhs!                          91   35.8ms   66.8%   393μs   9.33KiB    3.3%     105B
 volume integral             91   20.2ms   37.7%   222μs     0.00B    0.0%    0.00B
 interface flux              91   8.66ms   16.2%  95.2μs     0.00B    0.0%    0.00B
 prolong2interfaces          91   2.64ms    4.9%  29.0μs     0.00B    0.0%    0.00B
 surface integral            91   2.61ms    4.9%  28.7μs     0.00B    0.0%    0.00B
 ~rhs!~                      91    873μs    1.6%  9.59μs   9.33KiB    3.3%     105B
 Jacobian                    91    352μs    0.7%  3.87μs     0.00B    0.0%    0.00B
 reset ∂u/∂t                 91    339μs    0.6%  3.72μs     0.00B    0.0%    0.00B
 prolong2mortars             91   29.2μs    0.1%   321ns     0.00B    0.0%    0.00B
 prolong2boundaries          91   28.9μs    0.1%   318ns     0.00B    0.0%    0.00B
 mortar flux                 91   28.4μs    0.1%   312ns     0.00B    0.0%    0.00B
 boundary flux               91   13.6μs    0.0%   150ns     0.00B    0.0%    0.00B
 source terms                91   12.0μs    0.0%   132ns     0.00B    0.0%    0.00B
I/O                            3   11.0ms   20.6%  3.67ms    249KiB   88.6%  82.9KiB
 save solution                2   5.92ms   11.0%  2.96ms    211KiB   75.3%   106KiB
 ~I/O~                        3   5.07ms    9.5%  1.69ms   33.8KiB   12.1%  11.3KiB
 get element variables        2   30.9μs    0.1%  15.4μs   3.53KiB    1.3%  1.77KiB
 save mesh                    2    473ns    0.0%   236ns     0.00B    0.0%    0.00B
analyze solution               2   6.74ms   12.6%  3.37ms   22.6KiB    8.0%  11.3KiB
calculate dt                  19   28.6μs    0.1%  1.50μs     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────────

So, still the parallel HDF5 takes almost twice as long as in the serial and in the poor man's version.

I already started to create a simple Trixi.jl-independent example, as also @sloede suggested. I will elaborate on this.

EDIT: I was able to (at least partly) reproduce the issue with an Trixi-independent MWE on different machines. On both machines the performance doesn't increase for parallel HDF5, see the tmpi terminals of the benchmarks below.

My laptop using 4 ranks with tmpi
julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 0                                                                                                               │rank 1
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):   7.225 ms  676.040 ms  ┊ GC (min  max): 0.00%  0.00%                                         │ Range (min  max):   4.151 ms  676.024 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     17.866 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):     17.929 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   48.891 ms ±  78.380 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                         │ Time  (mean ± σ):   48.851 ms ±  78.385 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
  ▄█                               ▂▁                                                                              │     ██                            ▁  ▄                         
▄▇██▆▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▆█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄                                                     │  ▅▅▆██▅▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▅█▁▇█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
7.23 ms       Histogram: log(frequency) by time       203 ms <4.15 ms       Histogram: log(frequency) by time       203 ms <
                                                                                                                   │
Memory estimate: 8.15 KiB, allocs estimate: 106.                                                                    │ Memory estimate: 8.18 KiB, allocs estimate: 108.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):   4.578 ms  502.710 ms  ┊ GC (min  max): 0.00%  0.00%                                         │ Range (min  max):  10.249 ms  502.645 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     11.098 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):     11.115 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   46.678 ms ±  72.112 ms  ┊ GC (mean ± σ):  0.06% ± 0.94%                                         │ Time  (mean ± σ):   46.831 ms ±  72.069 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
 █▆                              ▄                                                                                 │  █                               ▃                             
▄██▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▄▁▁▄▄▄▁▄ ▄                                                     │  █▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▁▄▁▄▁▄▄▁▄ ▄
4.58 ms       Histogram: log(frequency) by time       199 ms <10.2 ms       Histogram: log(frequency) by time       199 ms <
                                                                                                                   │
Memory estimate: 3.06 MiB, allocs estimate: 110.                                                                    │ Memory estimate: 128 bytes, allocs estimate: 2.
                                                                                                                   │
julia>                                                                                                               │julia> 
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────                                                                                                    │
julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 2                                                                                                               │rank 3
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):   5.275 ms  676.031 ms  ┊ GC (min  max): 0.00%  0.00%                                         │ Range (min  max):   3.874 ms  675.990 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     17.804 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):     17.815 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   48.869 ms ±  78.394 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                         │ Time  (mean ± σ):   48.856 ms ±  78.400 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
   █▂                               ▂                                                                              │     ▆█                            ▁  ▃                         
▄▇▁██▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄                                                     │  ▅▁▇██▅▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁█▁▆█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
5.28 ms       Histogram: log(frequency) by time       203 ms <3.87 ms       Histogram: log(frequency) by time       203 ms <
                                                                                                                   │
Memory estimate: 8.18 KiB, allocs estimate: 108.                                                                    │ Memory estimate: 8.18 KiB, allocs estimate: 108.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):  10.250 ms  502.647 ms  ┊ GC (min  max): 0.00%  0.00%                                         │ Range (min  max):  117.345 μs  502.647 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     11.099 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):      11.055 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   46.783 ms ±  72.096 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                         │ Time  (mean ± σ):    46.643 ms ±  72.171 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
█                               ▃                                                                                  │     █                              ▃                            
█▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄▄▁▄▁▄▁▄▄▁▄ ▄                                                     │  ▄▁▁█▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▄▁▁▄▄▄▁▄ ▄
10.3 ms       Histogram: log(frequency) by time       199 ms <117 μs        Histogram: log(frequency) by time        199 ms <
                                                                                                                   │
Memory estimate: 128 bytes, allocs estimate: 2.                                                                     │ Memory estimate: 128 bytes, allocs estimate: 2.
roci using 4 ranks with tmpi
julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 0                                                                                                               │rank 1
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):  3.033 ms   10.288 ms  ┊ GC (min  max): 0.00%  0.00%                                          │ Range (min  max):  3.038 ms   10.280 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     4.310 ms               ┊ GC (median):    0.00%                                                  │ Time  (median):     4.308 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   4.400 ms ± 962.308 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                                          │ Time  (mean ± σ):   4.417 ms ± 976.753 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
     ▁     █▆█▃▆▃▆▄ ▁ ▃                                                                                            │             █▃█ ▅▂▃▃                                          
▆▇▆▄▄█▇▆▆▆▇████████▆█▁█▆▁▆▄▁▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄                                                      │  ▅▅▇▄▄█▇▅▅▅▇████████▅█▄▇▇▁▅▄▁▄▁▁▁▁▁▁▁▄▁▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
3.03 ms         Histogram: frequency by time        8.15 ms <3.04 ms         Histogram: frequency by time        8.16 ms <
                                                                                                                   │
Memory estimate: 7.74 KiB, allocs estimate: 101.                                                                    │ Memory estimate: 7.77 KiB, allocs estimate: 103.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):  2.525 ms  10.106 ms  ┊ GC (min  max): 0.00%  0.00%                                           │ Range (min  max):  2.706 ms  10.110 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     4.325 ms              ┊ GC (median):    0.00%                                                   │ Time  (median):     4.344 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   4.460 ms ±  1.155 ms  ┊ GC (mean ± σ):  0.98% ± 3.21%                                           │ Time  (mean ± σ):   4.481 ms ±  1.126 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
      █       █   ▅▁▁█   ▂                                                                                         │      ▁      ▂▂ ▄█▅ ▇                                         
▃▃▃▁▁▁█▅▅▁▁▆▆▃█▆▆██████▃▆█▁▃▅▃▁▃▃▁▁▁▁▁▅▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▁▁▅ ▃                                                       │  ▃▁▅▁█▅▃█▆█▅██▁██████▆█▅▅▁▅▁▁▃▃▁▃▁▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃
2.53 ms        Histogram: frequency by time        8.17 ms <2.71 ms        Histogram: frequency by time        8.82 ms <
                                                                                                                   │
Memory estimate: 3.06 MiB, allocs estimate: 106.                                                                    │ Memory estimate: 160 bytes, allocs estimate: 3.
                                                                                                                   │
julia>                                                                                                               │julia> 
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                   │
julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 2                                                                                                               │rank 3
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):  3.031 ms  10.281 ms  ┊ GC (min  max): 0.00%  0.00%                                           │ Range (min  max):  3.034 ms  10.281 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     4.302 ms              ┊ GC (median):    0.00%                                                   │ Time  (median):     4.319 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   4.433 ms ±  1.021 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                           │ Time  (mean ± σ):   4.436 ms ±  1.030 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
     ▁     ██▆▃▆▃▆▃   ▄                                                                                            │             █▅▅▃▂▃▃▃   ▂                                     
▆▆▇▄▄█▇▆▇▆▇████████▆▇▄█▁▁▇▁▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄ ▄                                                       │  ▅▅▇▄▅▇▇▄▇▇▇████████▄█▄█▄▁▅▄▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▄
3.03 ms        Histogram: frequency by time        8.18 ms <3.03 ms        Histogram: frequency by time        8.14 ms <
                                                                                                                   │
Memory estimate: 7.77 KiB, allocs estimate: 103.                                                                    │ Memory estimate: 7.77 KiB, allocs estimate: 103.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min  max):  2.699 ms  10.108 ms  ┊ GC (min  max): 0.00%  0.00%                                           │ Range (min  max):  2.704 ms  10.106 ms  ┊ GC (min  max): 0.00%  0.00%
Time  (median):     4.355 ms              ┊ GC (median):    0.00%                                                   │ Time  (median):     4.366 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   4.494 ms ±  1.139 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                           │ Time  (mean ± σ):   4.502 ms ±  1.157 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
            ▂  █▄ ▄                                                                                                │       ▃      ▂  █▄ ▄                                         
▃▃▁▁█▆█▁▆▇▅██▃▇█████▇▅▆▃▃▅▁▃▃▃▁▁▃▁▁▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃                                                       │  ▃▃▁▁██▃▁▆▇▆▇█▃▇█████▆▆▆▃▃▅▁▁▃▃▁▁▃▁▃▁▁▃▁▁▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃
2.7 ms         Histogram: frequency by time        8.81 ms <2.7 ms         Histogram: frequency by time         8.8 ms <
                                                                                                                   │
Memory estimate: 160 bytes, allocs estimate: 3.                                                                     │ Memory estimate: 160 bytes, allocs estimate: 3.

With more MPI ranks the times scale linearly, which indicates that the parallelization doesn't work.

EDIT2: I also tried using chunk and dxpl_mpio=:collective as kwargs to create_dataset as suggested in the docs of HDF5.jl, but both option decreased performance or had no influence even though the data is completely uniformly distributed among the ranks.

@ranocha
Copy link
Member

ranocha commented Apr 18, 2023

Yes, exactly. Thanks!

@codecov
Copy link

codecov bot commented Apr 18, 2023

Codecov Report

Merging #1399 (58cebff) into main (153ef65) will decrease coverage by 2.88%.
The diff coverage is 8.82%.

❗ Current head 58cebff differs from pull request most recent head 3e844c7. Consider uploading reports for the commit 3e844c7 to get more accurate results

@@            Coverage Diff             @@
##             main    #1399      +/-   ##
==========================================
- Coverage   95.70%   92.82%   -2.88%     
==========================================
  Files         360      356       -4     
  Lines       29816    29639     -177     
==========================================
- Hits        28533    27511    -1022     
- Misses       1283     2128     +845     
Flag Coverage Δ
unittests 92.82% <8.82%> (-2.88%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Trixi.jl 71.43% <ø> (+3.25%) ⬆️
src/callbacks_step/save_solution_dg.jl 69.91% <8.82%> (-26.29%) ⬇️

... and 24 files with indirect coverage changes

@JoshuaLampert
Copy link
Member Author

I also looked at flamegraphs on each process, see below a screenshot (with some annotations) with the flamegraphs for 2 ranks (Profile.jl measured the time for 10 runs writing 1 000 000 Float64s each run). A few things I noticed:

  • All relevant parts are actually inside C functions
  • The amount of time spent in h5d_write is comparably small. This doesn't change when I increase the data size
  • On one rank a significant amount of time is spent in h5p_set_fapl_mpio, which is a bit surprising to me. When I used 4 ranks all of them spent around half of the time in this function.

Any more ideas, comments, @ranocha, @sloede?
Screenshot_20230419_130106

@ranocha
Copy link
Member

ranocha commented Apr 19, 2023

Then, I would move to a different machine and check if you see the same behavior, just eliminate the possibility that it is a system specific issue.

Did you already test it on different machines, @JoshuaLampert?

@JoshuaLampert
Copy link
Member Author

Yes, see also my edited comment above.

@ranocha
Copy link
Member

ranocha commented Apr 19, 2023

Thanks, I didn't notice the update there

@sloede
Copy link
Member

sloede commented Apr 19, 2023

How hard would it be to recreate your MWE (parallel_benchmark.jl) in C? Though from what you wrote on [h5p_set_fapl_mpio](https://docs.hdfgroup.org/hdf5/v1_12/group___f_a_p_l.html#gaa0204810c1fea1667d62cf7c176416ff) and what I gather from its docs, it shouldn't really take up any time at all (and it is a non-collective operation...), which leads me to believe that this might be HDF5 related.

Have you encountered the same performance issue when using an HDF5 library not provided by HDF5_jll but overriding the JLL product with your local version?

@JoshuaLampert
Copy link
Member Author

I will try to reproduce the issue in C
In all of my tests I used a local HDF5 installation. The HDF5_JLL version doesn't provide MPI (at least now), right?

@sloede
Copy link
Member

sloede commented Apr 19, 2023

The HDF5_JLL version doesn't provide MPI (at least now), right?

🤦 Right.

@JoshuaLampert
Copy link
Member Author

JoshuaLampert commented Apr 22, 2023

I recreated my MWE in C now. Here, the parallel case with 4 ranks is around 10 % faster than the case where root writes all the data. Compared to julia it executes about 20 % faster (but still within the standard deviation). You can find the timings and the C and julia programs below.

My laptop using 4 ranks

running parallel...
running parallel...
running parallel...
running parallel...
Parallel on rank 1:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.540000 ms

running on_root...
Parallel on rank 2:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.580000 ms

running on_root...
Parallel on rank 0:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.560000 ms

running on_root...
Parallel on rank 3:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.540000 ms

running on_root...
On_root on rank 3:
min: 0.000000 ms
max: 3.000000 ms
mean: 1.180000 ms

On_root on rank 0:
min: 6.000000 ms
max: 870.000000 ms
mean: 44.950000 ms

On_root on rank 1:
min: 0.000000 ms
max: 3.000000 ms
mean: 0.640000 ms

On_root on rank 2:
min: 0.000000 ms
max: 3.000000 ms
mean: 0.920000 ms

C code
#include "hdf5.h"
#include "stdlib.h"
#include <mpi.h>
#include <time.h>
#include <sys/time.h>

#define DIMS 1

void parallel(char *filename, int M, double *A, MPI_Comm comm, MPI_Info info) {
    int myrank, Nproc;
    MPI_Comm_rank(comm, &myrank);
    MPI_Comm_size(comm, &Nproc);

    hid_t plistId = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio(plistId, comm, info);
    hid_t fileId = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plistId);
    H5Pclose(plistId);
    // Create the dataspace for the dataset.
    hsize_t dimsm[1];
    hsize_t dimsf[1];
    dimsm[0] = M;
    dimsf[0] = M * Nproc;
    hid_t mspace = H5Screate_simple(DIMS, dimsm, NULL);
    hid_t fspace = H5Screate_simple(DIMS, dimsf, NULL);

    // Create dataset and select subset to write
    hid_t dset = H5Dcreate(fileId, "A", H5T_NATIVE_DOUBLE, fspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    hsize_t start[] = {M * myrank};
    hsize_t count[] = {M};
    H5Sselect_hyperslab(fspace, H5S_SELECT_SET, start, NULL, count, NULL);

    // Create a property list for collective data transfer
    //plistId = H5Pcreate(H5P_DATASET_XFER);
    // Write data
    H5Dwrite(dset, H5T_NATIVE_DOUBLE, mspace, fspace, H5P_DEFAULT, A);

    H5Dclose(dset);
    H5Sclose(fspace);
    H5Sclose(mspace);
    H5Fclose(fileId);
    //H5Pclose(plistId);
}

void on_root(char *filename, int M, double *A, MPI_Comm comm, MPI_Info info) {
    int myrank, Nproc;
    MPI_Comm_rank(comm, &myrank);
    MPI_Comm_size(comm, &Nproc);
    int root = 0;

    // Send data to root
    int *displs = (int *)malloc(Nproc * sizeof(int));
    int *recvcounts = (int *)malloc(Nproc * sizeof(int));
    for (int i = 0; i < Nproc; i++) {
        displs[i] = M * i;
        recvcounts[i] = M;
    }
    double *rbuf;
    if (myrank == root)
        rbuf = (double *)malloc(Nproc * M * sizeof(double));
    MPI_Gatherv(A, M, MPI_DOUBLE, rbuf, recvcounts, displs, MPI_DOUBLE, root, comm);

    // Write data on root
    if (myrank == root) {
        hid_t fileId = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
        // Create the dataspace for the dataset.
        hsize_t dimsm[1];
        hsize_t dimsf[1];
        dimsm[0] = M * Nproc;
        dimsf[0] = M * Nproc;
        hid_t mspace = H5Screate_simple(DIMS, dimsm, NULL);
        hid_t fspace = H5Screate_simple(DIMS, dimsf, NULL);

        // Create datset
        hid_t dset = H5Dcreate(fileId, "A", H5T_NATIVE_DOUBLE, fspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

        // Write data
        H5Dwrite(dset, H5T_NATIVE_DOUBLE, mspace, fspace, H5P_DEFAULT, rbuf);

        H5Dclose(dset);
        H5Sclose(fspace);
        H5Sclose(mspace);
        H5Fclose(fileId);
    }
}

double min(double *arr, int size) {
    double min = arr[0];
    for (int i = 1; i < size; i++) {
        if (arr[i] < min)
            min = arr[i];
    }
    return min;
}

double max(double *arr, int size) {
    double max = arr[0];
    for (int i = 1; i < size; i++) {
        if (arr[i] > max)
            max = arr[i];
    }
    return max;
}

double mean(double *arr, int size) {
    double sum = 0.0;
    for (int i = 0; i < size; i++) {
        sum += arr[i];
    }
    double mean = sum / size;
    return mean;
}

double get_time_ms(void) {
    struct timeval tv;

    gettimeofday(&tv, NULL);
    return (double)(tv.tv_usec / 1000) + (double)(tv.tv_sec * 1000);
}

int main (int argc, char **argv) {
    MPI_Comm comm  = MPI_COMM_WORLD;
    MPI_Info info  = MPI_INFO_NULL;
    MPI_Init(&argc, &argv);
    int myrank;
    MPI_Comm_rank(comm, &myrank);

    int M = 100000;
    double *A = (double *)malloc(sizeof(double) * M);
    time_t t;
    srand((unsigned) time(&t));
    for (int i = 0; i < M; i++)
        A[i] = (double)rand() / (double)RAND_MAX;

    int samples = 100;
    double begin, end;

    double *times_parallel = (double *)malloc(sizeof(double) * samples);
    printf("running parallel...\n");
    for (int i = 0; i < samples; i++) {
        begin = get_time_ms();
        parallel("data_parallel.h5", M, A, comm, info);
        end = get_time_ms();
        MPI_Barrier(comm);
        times_parallel[i] = (double)(end - begin);
    }
    printf("Parallel on rank %d:\n\tmin: %f ms\n\tmax: %f ms\n\tmean: %f ms\n\n", myrank, min(times_parallel, samples), max(times_parallel, samples), mean(times_parallel, samples));
    
    double *times_on_root = (double *)malloc(sizeof(double) * samples);
    printf("running on_root...\n");
    for (int i = 0; i < samples; i++) {
        begin = get_time_ms();
        on_root("data_on_root.h5", M, A, comm, info);
        end = get_time_ms();
        MPI_Barrier(comm);
        times_on_root[i] = (double)(end - begin);
    }
    printf("On_root on rank %d:\n\tmin: %f ms\n\tmax: %f ms\n\tmean: %f ms\n\n", myrank, min(times_on_root, samples), max(times_on_root, samples), mean(times_on_root, samples));

    MPI_Finalize();
    return 0;
}
julia code
using MPI
using HDF5
using BenchmarkTools

MPI.Init()
comm = MPI.COMM_WORLD

Nproc = MPI.Comm_size(comm)
myrank = MPI.Comm_rank(comm)

println("rank $myrank")

M = 100000
A = rand(M)  # local data

# true parallel writing slices
function parallel(filename, A, comm)
  M = size(A, 1)
  myrank = MPI.Comm_rank(comm)
  Nproc = MPI.Comm_size(comm)

  h5open(filename, "w", comm) do file
    # Create dataset
    dset = create_dataset(file, "/A", datatype(eltype(A)), dataspace((M * Nproc,)))

    # Write local data
    slice = myrank * M + 1:(myrank + 1) * M
    dset[slice] = A
  end
end

# poor man's version
function on_root(filename, A, comm)
  M = size(A, 1)
  myrank = MPI.Comm_rank(comm)
  Nproc = MPI.Comm_size(comm)
  root = 0
  # Send data to root
  if myrank != root
    MPI.Gatherv!(A, nothing, root, comm)
  else
    h5open(filename, "w") do file
      # Create dataset
      dset = create_dataset(file, "/A", datatype(eltype(A)), dataspace((M * Nproc,)))

      # Receive data
      recv = Vector{eltype(A)}(undef, (M * Nproc,))
      MPI.Gatherv!(A, MPI.VBuffer(recv, fill(M, Nproc)), root, comm)
      # Write local data
      dset[:] = recv
    end
  end
end

samples = 100
println("running parallel...")
t1 = @benchmark parallel("data_parallel.h5", A, comm) samples=samples
println(IOContext(stdout, :compact => false), t1)

println("running on_root...")
t2 = @benchmark on_root("data_on_root.h5", A, comm) samples=samples
println(IOContext(stdout, :compact => false), t2)

#=using Profile, ProfileView
function doit_parallel(n, A, comm)
  for i in 1:n
    parallel("data_parallel.h5", A, comm)
  end
end
function doit_on_root(n, A, comm)
  for i in 1:n
    on_root("data_on_root.h5", A, comm)
  end
end
Profile.clear(); @profile doit_parallel(10, A, comm); ProfileView.view(windowname="parallel $myrank")
Profile.clear(); @profile doit_on_root(10, A, comm); ProfileView.view(windowname="on_root $myrank");=#

@sloede
Copy link
Member

sloede commented Apr 23, 2023

I recreated my MWE in C now. Here, the parallel case with 4 ranks is around 10 % faster than the case where root writes all the data. Compared to julia it executes about 20 % faster (but still within the standard deviation). You can find the timings and the C and julia programs below.

Thanks! Can you try to run the examples on roci as well? And I wouldn't run them on tmpi, I'd run them as a script to avoid any tmpi influence. You just have to discard the first invocation (since it includes compile time).

But from the current results, do I understand correctly that in the MWE, the parallel HDF5 case is slightly faster than the poor man's version, as one would expect? How is the difference between Julia and C?

@JoshuaLampert
Copy link
Member Author

I also ran the C MWE on roci. Here, the poor man's version is even faster than the parallel one (using 4 ranks: parallel version needs ~3.6 ms and root on poor man's version ~2.8 ms, using 16 ranks: 16.1 ms parallel vs. 11.59 ms poor man's version). Whether I use mpiexecjl or tmpi doesn't have a qualitative impact on the results.
So in total the issue is in julia as well as in C and the code runs a bit faster in C than in julia.

@ranocha
Copy link
Member

ranocha commented Apr 25, 2023

If you don;t have any other suggestions, @sloede, I think we can proceed with this. There does not seem to be a huge difference for small-scale applications while real parallel IO is required for large-scale simulations. Thus, this seems to be a net win to me.

@JoshuaLampert
Copy link
Member Author

Finally, I found the time to implement the parallel I/O for the restart files. Sorry for the delay. This is now ready for a review from my point of view, @sloede, @ranocha.

@JoshuaLampert
Copy link
Member Author

JoshuaLampert commented May 22, 2023

When I tested this, I tried the elixir under examples/tree_3d_dgsem/elixir_advection_restart.jl, but got the error

ERROR: LoadError: type NamedTuple has no field mpi_cache
Stacktrace:
  [1] getproperty
    @ ./Base.jl:37 [inlined]
  [2] nelementsglobal
    @ ~/.julia/dev/Trixi.jl/src/solvers/dg.jl:444 [inlined]
  [3] ndofsglobal
    @ ~/.julia/dev/Trixi.jl/src/solvers/dg.jl:387 [inlined]
[...]

I didn't investigate further, since it also occurs on main and seems to be unrelated to this PR. Is this a known issue or should I create one? I also noticed that there are no MPI tests for the 3d TreeMesh.

@ranocha
Copy link
Member

ranocha commented May 23, 2023

@sloede IIRC, MPI on the TreeMesh was just implemented as a prototype in 2D and not in 3D, correct?

docs/src/parallelization.md Outdated Show resolved Hide resolved
src/Trixi.jl Outdated Show resolved Hide resolved
src/callbacks_step/save_restart_dg.jl Outdated Show resolved Hide resolved
@ranocha
Copy link
Member

ranocha commented May 23, 2023

Do we have some tests for this?

@JoshuaLampert
Copy link
Member Author

I was also thinking about testing the parallel I/O. As long as HDF5_jll.jl doesn't support MPI, we would need to install a
custom HDF5 with MPI support on github-actions and set the environment variables and preferences accordingly. We
could use a similar setup as we have in P4est.jl, I guess.

That would be an option. We could add another CI job running MPI tests on Ubuntu with parallel HDF5 enabled. This shouldn't be too expensive.

Ok, I can try that. Maybe I'll need some help of you to set up the CI job correctly.

Not sure if this is necessary. If we get parallel I/O support from an MPI-enabled HDF5_jll soon, I don't think we need to invest time in a sophisticated CI setup right now. I'd be ok with omitting this but creating an issue for it once JuliaPackaging/Yggdrasil#6551 is merged and HDF5 is updated to use it.

As discussed here I only tested the true parallel I/O locally, but not in CI. For this, I ran examples/tree_2d_dgsem/elixir_advection_restart.jl, examples/p4est_2d_dgsem/elixir_advection_restart.jl and examples/p4est_3d_dgsem/elixir_advection_restart.jl in parallel with parallel HDF5 enabled on 2 and 3 ranks for each elixir.

@ranocha
Copy link
Member

ranocha commented May 24, 2023

Great, thanks! Did you open the issue as discussed there?

@ranocha
Copy link
Member

ranocha commented May 24, 2023

CI looks good. There is just the upstream error

elixir_advection_amr_visualization.jl: Test Failed at /home/runner/work/Trixi.jl/Trixi.jl/test/test_trixi.jl:167
  Expression: occursin(r"^(WARNING: replacing module .+\.\n)*$", stderr_content)
   Evaluated: occursin(r"^(WARNING: replacing module .+\.\n)*$", "WARNING: importing deprecated binding Colors.RGB1 into Plots.\nWARNING: importing deprecated binding Colors.RGB4 into Plots.\n")

at https://github.com/trixi-framework/Trixi.jl/actions/runs/5056931269/jobs/9075043660?pr=1399#step:7:19534

Could you please update the filter for these warning in

Trixi.jl/test/test_trixi.jl

Lines 152 to 153 in 153ef65

"WARNING: importing deprecated binding Colors.RGB1 into PlotUtils.\n",
"WARNING: importing deprecated binding Colors.RGB4 into PlotUtils.\n",

accordingly?

@JoshuaLampert
Copy link
Member Author

Great, thanks! Did you open the issue as discussed there?

Done in #1486.

ranocha
ranocha previously approved these changes May 25, 2023
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Good to go from my side. It would be great if @sloede could have a look, too

@sloede
Copy link
Member

sloede commented May 25, 2023

@sloede IIRC, MPI on the TreeMesh was just implemented as a prototype in 2D and not in 3D, correct?

Correct. The parallel 2D TreeMesh is already implemented so inefficiently (by myself 😬) that I didn't think it would be worthwhile to make it work for 3D.

@JoshuaLampert Could you maybe create an issue that we should add a more meaningful error message in case someone tries to run TreeMesh3D in parallel?

Copy link
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just one suggestion.

docs/src/parallelization.md Show resolved Hide resolved
Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>
@sloede sloede enabled auto-merge (squash) May 25, 2023 11:44
@ranocha ranocha disabled auto-merge May 26, 2023 08:20
@ranocha ranocha merged commit 711d4d8 into trixi-framework:main May 26, 2023
@JoshuaLampert JoshuaLampert deleted the parallel-io branch May 26, 2023 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelization Related to MPI, threading, tasks etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants