Parallel I/O #1399

JoshuaLampert · 2023-04-17T17:36:30Z

This is a first draft for parallel input and output with MPI using parallel HDF5.jl, see #1332. Currently, this only implements parallel writing of the solution files if parallel HDF5 is available. Parallel writing and reading restart files will follow.

Notice that in order to use parallel I/O you need to set MPIPreferences and therefore also the Preference for P4est.jl even if you don't want to use a system-provided p4est or even p4est at all. This is not that of a big problem, I guess, but not the most user-friendly option. However, I don't know yet how to circumvent this issue. I am open for a discussion.

Initial benchmarks indicate that something is still not working properly (although HDF5 successfully recognizes the local hdf5 library with MPI support) since the time for writing the solution files increases significantly (see below for running mpiexecjl -n 4 --project julia --project examples/tree_2d_dgsem/elixir_advection_basic.jl without and with parallel HDF5 support). I am trying to figure out what's the matter. Any ideas are welcome.

Running without parallel HDF5 enabled (as it was before)

────────────────────────────────────────────────────────────────────────────────
          Trixi.jl                    Time                    Allocations      
                             ───────────────────────   ────────────────────────
     Tot / % measured:            5.75s /  82.0%           57.6MiB /  78.9%    

Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
I/O                        3    2.53s   53.7%   843ms   10.7MiB   23.6%  3.57MiB
 save solution            2    1.61s   34.1%   805ms   9.55MiB   21.0%  4.77MiB
 ~I/O~                    3    919ms   19.5%   306ms   1.01MiB    2.2%   343KiB
 get element vari...      2   1.36ms    0.0%   679μs    158KiB    0.3%  78.9KiB
 save mesh                2    642ns    0.0%   321ns     0.00B    0.0%    0.00B
analyze solution           2    2.08s   44.0%   1.04s   34.5MiB   76.1%  17.3MiB
calculate dt              19   93.3ms    2.0%  4.91ms   97.0KiB    0.2%  5.11KiB
rhs!                      91   14.9ms    0.3%   164μs   70.7KiB    0.2%     795B
 finish MPI receive      91   7.41ms    0.2%  81.4μs   17.1KiB    0.0%     192B
 volume integral         91   3.20ms    0.1%  35.1μs     0.00B    0.0%    0.00B
 interface flux          91   1.17ms    0.0%  12.9μs     0.00B    0.0%    0.00B
 ~rhs!~                  91    800μs    0.0%  8.79μs   15.2KiB    0.0%     171B
 start MPI send          91    645μs    0.0%  7.08μs   11.4KiB    0.0%     128B
 surface integral        91    433μs    0.0%  4.76μs     0.00B    0.0%    0.00B
 prolong2interfaces      91    365μs    0.0%  4.01μs     0.00B    0.0%    0.00B
 MPI interface flux      91    308μs    0.0%  3.38μs     0.00B    0.0%    0.00B
 finish MPI send         91    133μs    0.0%  1.47μs   15.6KiB    0.0%     176B
 start MPI receive       91    105μs    0.0%  1.16μs   11.4KiB    0.0%     128B
 prolong2mpiinter...     91   85.3μs    0.0%   937ns     0.00B    0.0%    0.00B
 reset ∂u/∂t             91   70.5μs    0.0%   775ns     0.00B    0.0%    0.00B
 Jacobian                91   60.5μs    0.0%   665ns     0.00B    0.0%    0.00B
 prolong2mpimortars      91   21.6μs    0.0%   238ns     0.00B    0.0%    0.00B
 mortar flux             91   20.4μs    0.0%   224ns     0.00B    0.0%    0.00B
 prolong2mortars         91   19.8μs    0.0%   218ns     0.00B    0.0%    0.00B
 MPI mortar flux         91   17.4μs    0.0%   191ns     0.00B    0.0%    0.00B
 prolong2boundaries      91   15.6μs    0.0%   171ns     0.00B    0.0%    0.00B
 boundary flux           91   10.6μs    0.0%   117ns     0.00B    0.0%    0.00B
 source terms            91   9.10μs    0.0%   100ns     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────

Running with parallel HDF5 enabled

────────────────────────────────────────────────────────────────────────────────
          Trixi.jl                    Time                    Allocations      
                             ───────────────────────   ────────────────────────
     Tot / % measured:            15.9s /  64.5%           58.1MiB /  91.7%    

Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────
I/O                        3    7.39s   71.9%   2.46s   19.2MiB   36.0%  6.39MiB
 save solution            2    6.27s   61.0%   3.13s   18.0MiB   33.7%  8.98MiB
 ~I/O~                    3    1.12s   10.9%   374ms   1.05MiB    2.0%   358KiB
 get element vari...      2   2.16ms    0.0%  1.08ms    158KiB    0.3%  78.9KiB
 save mesh                2    538ns    0.0%   269ns     0.00B    0.0%    0.00B
analyze solution           2    2.29s   22.3%   1.15s   33.9MiB   63.7%  17.0MiB
calculate dt              19    546ms    5.3%  28.7ms   97.0KiB    0.2%  5.11KiB
rhs!                      91   48.2ms    0.5%   529μs   77.8KiB    0.1%     875B
 finish MPI receive      91   35.0ms    0.3%   385μs   21.3KiB    0.0%     240B
 volume integral         91   4.96ms    0.0%  54.5μs     0.00B    0.0%    0.00B
 interface flux          91   2.10ms    0.0%  23.1μs     0.00B    0.0%    0.00B
 ~rhs!~                  91   1.41ms    0.0%  15.5μs   15.2KiB    0.0%     171B
 start MPI send          91   1.09ms    0.0%  12.0μs   11.4KiB    0.0%     128B
 finish MPI send         91    800μs    0.0%  8.79μs   18.5KiB    0.0%     208B
 surface integral        91    716μs    0.0%  7.87μs     0.00B    0.0%    0.00B
 prolong2interfaces      91    633μs    0.0%  6.96μs     0.00B    0.0%    0.00B
 MPI interface flux      91    560μs    0.0%  6.16μs     0.00B    0.0%    0.00B
 start MPI receive       91    323μs    0.0%  3.55μs   11.4KiB    0.0%     128B
 prolong2mpiinter...     91    145μs    0.0%  1.60μs     0.00B    0.0%    0.00B
 Jacobian                91    126μs    0.0%  1.39μs     0.00B    0.0%    0.00B
 reset ∂u/∂t             91    109μs    0.0%  1.20μs     0.00B    0.0%    0.00B
 mortar flux             91   36.6μs    0.0%   403ns     0.00B    0.0%    0.00B
 prolong2mortars         91   36.0μs    0.0%   396ns     0.00B    0.0%    0.00B
 prolong2boundaries      91   34.1μs    0.0%   375ns     0.00B    0.0%    0.00B
 MPI mortar flux         91   29.7μs    0.0%   327ns     0.00B    0.0%    0.00B
 prolong2mpimortars      91   29.2μs    0.0%   321ns     0.00B    0.0%    0.00B
 source terms            91   14.0μs    0.0%   154ns     0.00B    0.0%    0.00B
 boundary flux           91   13.8μs    0.0%   152ns     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────

… into parallel-io

ranocha · 2023-04-18T04:07:13Z

Thanks a lot! Are the results from a fresh run or a later run without compilation?

src/Trixi.jl

src/callbacks_step/save_solution_dg.jl

sloede · 2023-04-18T07:31:39Z

As another test, you could try to isolate the I/O part and only benchmark this code. You could even extract it from Trixi and just use randomly filled arrays as dummy data. This would probably help to isolate the underlying issue. Then, I would move to a different machine and check if you see the same behavior, just eliminate the possibility that it is a system specific issue.

If the overly long write times persist, you can also try to recreate it in C/C++ to see if it is a Julia issue or just an artifact of the HDF5 library that is being used.

JoshuaLampert · 2023-04-18T10:13:03Z

@ranocha, to exclude compilation time I need to use tmpi, right? With 4 ranks using tmpi I get the following results for the second run of tree_2d_dgsem/elixir_advection_basic.jl:

Running without parallel HDF5 enabled

────────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations
                                 ───────────────────────   ────────────────────────                                
       Tot / % measured:             32.6ms /  96.7%            337KiB /  95.2%
                                                                                                                   │
Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
rhs!                          91   13.0ms   41.4%   143μs   77.8KiB   24.2%    875B
 volume integral             91   3.74ms   11.9%  41.1μs     0.00B    0.0%    0.00B
 finish MPI send             91   2.34ms    7.4%  25.7μs   18.5KiB    5.8%     208B
 interface flux              91   1.61ms    5.1%  17.7μs     0.00B    0.0%    0.00B
 finish MPI receive          91   1.51ms    4.8%  16.6μs   21.3KiB    6.6%     240B
 ~rhs!~                      91   1.06ms    3.4%  11.7μs   15.2KiB    4.7%     171B
 start MPI send              91    799μs    2.5%  8.78μs   11.4KiB    3.5%     128B
 surface integral            91    524μs    1.7%  5.76μs     0.00B    0.0%    0.00B
 prolong2interfaces          91    459μs    1.5%  5.05μs     0.00B    0.0%    0.00B
 MPI interface flux          91    396μs    1.3%  4.35μs     0.00B    0.0%    0.00B
 start MPI receive           91    197μs    0.6%  2.16μs   11.4KiB    3.5%     128B
 prolong2mpiinterfaces       91    110μs    0.3%  1.20μs     0.00B    0.0%    0.00B
 reset ∂u/∂t                 91   78.8μs    0.3%   866ns     0.00B    0.0%    0.00B
 Jacobian                    91   76.6μs    0.2%   842ns     0.00B    0.0%    0.00B
 MPI mortar flux             91   40.8μs    0.1%   449ns     0.00B    0.0%    0.00B
 prolong2mpimortars          91   23.1μs    0.1%   254ns     0.00B    0.0%    0.00B
 prolong2boundaries          91   22.0μs    0.1%   242ns     0.00B    0.0%    0.00B
 mortar flux                 91   21.0μs    0.1%   231ns     0.00B    0.0%    0.00B
 prolong2mortars             91   20.3μs    0.1%   224ns     0.00B    0.0%    0.00B
 source terms                91   11.1μs    0.0%   122ns     0.00B    0.0%    0.00B
 boundary flux               91   11.1μs    0.0%   122ns     0.00B    0.0%    0.00B
I/O                            3   12.2ms   38.7%  4.06ms    170KiB   52.9%  56.6KiB
 save solution                2   6.81ms   21.6%  3.40ms    132KiB   41.2%  66.1KiB
 ~I/O~                        3   5.35ms   17.0%  1.78ms   33.9KiB   10.6%  11.3KiB
 get element variables        2   27.5μs    0.1%  13.7μs   3.66KiB    1.1%  1.83KiB
 save mesh                    2    501ns    0.0%   250ns     0.00B    0.0%    0.00B
analyze solution               2   5.15ms   16.3%  2.58ms   69.8KiB   21.7%  34.9KiB
calculate dt                  19   1.13ms    3.6%  59.4μs   3.86KiB    1.2%     208B
────────────────────────────────────────────────────────────────────────────────────

Running with parallel HDF5 enabled

──────────────────────────────────────────────────────────────────────────────────── 
            Trixi.jl                      Time                    Allocations
                                 ───────────────────────   ────────────────────────                                
       Tot / % measured:             35.2ms /  96.2%            276KiB /  94.2%
Section                   ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────────
I/O                            3   15.4ms   45.5%  5.12ms    109KiB   41.8%  36.3KiB
 save solution                2   11.1ms   32.8%  5.55ms   70.9KiB   27.3%  35.5KiB
 ~I/O~                        3   4.25ms   12.6%  1.42ms   34.2KiB   13.1%  11.4KiB
 get element variables        2   25.8μs    0.1%  12.9μs   3.66KiB    1.4%  1.83KiB
 save mesh                    2    536ns    0.0%   268ns     0.00B    0.0%    0.00B
rhs!                          91   14.0ms   41.5%   154μs   77.8KiB   29.9%     875B
 volume integral             91   5.24ms   15.5%  57.6μs     0.00B    0.0%    0.00B
 interface flux              91   1.99ms    5.9%  21.9μs     0.00B    0.0%    0.00B
 finish MPI receive          91   1.37ms    4.0%  15.0μs   21.3KiB    8.2%     240B
 ~rhs!~                      91   1.13ms    3.3%  12.4μs   15.2KiB    5.8%     171B
 start MPI send              91   1.01ms    3.0%  11.1μs   11.4KiB    4.4%     128B
 surface integral            91    863μs    2.6%  9.48μs     0.00B    0.0%    0.00B
 MPI interface flux          91    649μs    1.9%  7.13μs     0.00B    0.0%    0.00B
 prolong2interfaces          91    584μs    1.7%  6.42μs     0.00B    0.0%    0.00B
 finish MPI send             91    492μs    1.5%  5.40μs   18.5KiB    7.1%     208B
 start MPI receive           91    221μs    0.7%  2.43μs   11.4KiB    4.4%     128B
 prolong2mpiinterfaces       91    139μs    0.4%  1.53μs     0.00B    0.0%    0.00B
 reset ∂u/∂t                 91   99.4μs    0.3%  1.09μs     0.00B    0.0%    0.00B
 Jacobian                    91   96.1μs    0.3%  1.06μs     0.00B    0.0%    0.00B
 prolong2boundaries          91   27.4μs    0.1%   301ns     0.00B    0.0%    0.00B

Running serial for comparison

────────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations
       Tot / % measured:             56.0ms /  95.6%            320KiB /  87.6%
Section                   ncalls     time    %tot     avg     alloc    %tot      avg  
───────────────────────────────────────────────────────────────────────────────────
rhs!                          91   35.8ms   66.8%   393μs   9.33KiB    3.3%     105B
 volume integral             91   20.2ms   37.7%   222μs     0.00B    0.0%    0.00B
 interface flux              91   8.66ms   16.2%  95.2μs     0.00B    0.0%    0.00B
 prolong2interfaces          91   2.64ms    4.9%  29.0μs     0.00B    0.0%    0.00B
 surface integral            91   2.61ms    4.9%  28.7μs     0.00B    0.0%    0.00B
 ~rhs!~                      91    873μs    1.6%  9.59μs   9.33KiB    3.3%     105B
 Jacobian                    91    352μs    0.7%  3.87μs     0.00B    0.0%    0.00B
 reset ∂u/∂t                 91    339μs    0.6%  3.72μs     0.00B    0.0%    0.00B
 prolong2mortars             91   29.2μs    0.1%   321ns     0.00B    0.0%    0.00B
 prolong2boundaries          91   28.9μs    0.1%   318ns     0.00B    0.0%    0.00B
 mortar flux                 91   28.4μs    0.1%   312ns     0.00B    0.0%    0.00B
 boundary flux               91   13.6μs    0.0%   150ns     0.00B    0.0%    0.00B
 source terms                91   12.0μs    0.0%   132ns     0.00B    0.0%    0.00B
I/O                            3   11.0ms   20.6%  3.67ms    249KiB   88.6%  82.9KiB
 save solution                2   5.92ms   11.0%  2.96ms    211KiB   75.3%   106KiB
 ~I/O~                        3   5.07ms    9.5%  1.69ms   33.8KiB   12.1%  11.3KiB
 get element variables        2   30.9μs    0.1%  15.4μs   3.53KiB    1.3%  1.77KiB
 save mesh                    2    473ns    0.0%   236ns     0.00B    0.0%    0.00B
analyze solution               2   6.74ms   12.6%  3.37ms   22.6KiB    8.0%  11.3KiB
calculate dt                  19   28.6μs    0.1%  1.50μs     0.00B    0.0%    0.00B
────────────────────────────────────────────────────────────────────────────────────

So, still the parallel HDF5 takes almost twice as long as in the serial and in the poor man's version.

I already started to create a simple Trixi.jl-independent example, as also @sloede suggested. I will elaborate on this.

EDIT: I was able to (at least partly) reproduce the issue with an Trixi-independent MWE on different machines. On both machines the performance doesn't increase for parallel HDF5, see the tmpi terminals of the benchmarks below.

My laptop using 4 ranks with tmpi

julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 0                                                                                                               │rank 1
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):   7.225 ms … 676.040 ms  ┊ GC (min … max): 0.00% … 0.00%                                         │ Range (min … max):   4.151 ms … 676.024 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     17.866 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):     17.929 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   48.891 ms ±  78.380 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                         │ Time  (mean ± σ):   48.851 ms ±  78.385 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
  ▄█                               ▂▁                                                                              │     ██                            ▁  ▄                         
▄▇██▆▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▆█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄                                                     │  ▅▅▆██▅▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▅█▁▇█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
7.23 ms       Histogram: log(frequency) by time       203 ms <                                                     │  4.15 ms       Histogram: log(frequency) by time       203 ms <
                                                                                                                   │
Memory estimate: 8.15 KiB, allocs estimate: 106.                                                                    │ Memory estimate: 8.18 KiB, allocs estimate: 108.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):   4.578 ms … 502.710 ms  ┊ GC (min … max): 0.00% … 0.00%                                         │ Range (min … max):  10.249 ms … 502.645 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     11.098 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):     11.115 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   46.678 ms ±  72.112 ms  ┊ GC (mean ± σ):  0.06% ± 0.94%                                         │ Time  (mean ± σ):   46.831 ms ±  72.069 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
 █▆                              ▄                                                                                 │  █                               ▃                             
▄██▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▄▁▁▄▄▄▁▄ ▄                                                     │  █▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▁▄▁▄▁▄▄▁▄ ▄
4.58 ms       Histogram: log(frequency) by time       199 ms <                                                     │  10.2 ms       Histogram: log(frequency) by time       199 ms <
                                                                                                                   │
Memory estimate: 3.06 MiB, allocs estimate: 110.                                                                    │ Memory estimate: 128 bytes, allocs estimate: 2.
                                                                                                                   │
julia>                                                                                                               │julia> 
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────                                                                                                    │
julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 2                                                                                                               │rank 3
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):   5.275 ms … 676.031 ms  ┊ GC (min … max): 0.00% … 0.00%                                         │ Range (min … max):   3.874 ms … 675.990 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     17.804 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):     17.815 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   48.869 ms ±  78.394 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                         │ Time  (mean ± σ):   48.856 ms ±  78.400 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
   █▂                               ▂                                                                              │     ▆█                            ▁  ▃                         
▄▇▁██▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄                                                     │  ▅▁▇██▅▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁█▁▆█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
5.28 ms       Histogram: log(frequency) by time       203 ms <                                                     │  3.87 ms       Histogram: log(frequency) by time       203 ms <
                                                                                                                   │
Memory estimate: 8.18 KiB, allocs estimate: 108.                                                                    │ Memory estimate: 8.18 KiB, allocs estimate: 108.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):  10.250 ms … 502.647 ms  ┊ GC (min … max): 0.00% … 0.00%                                         │ Range (min … max):  117.345 μs … 502.647 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     11.099 ms               ┊ GC (median):    0.00%                                                 │ Time  (median):      11.055 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   46.783 ms ±  72.096 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                         │ Time  (mean ± σ):    46.643 ms ±  72.171 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
█                               ▃                                                                                  │     █                              ▃                            
█▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄▄▁▄▁▄▁▄▄▁▄ ▄                                                     │  ▄▁▁█▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▄▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▆▁▄▁▁▄▄▄▁▄ ▄
10.3 ms       Histogram: log(frequency) by time       199 ms <                                                     │  117 μs        Histogram: log(frequency) by time        199 ms <
                                                                                                                   │
Memory estimate: 128 bytes, allocs estimate: 2.                                                                     │ Memory estimate: 128 bytes, allocs estimate: 2.
                                                                                                                   │

roci using 4 ranks with tmpi

julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 0                                                                                                               │rank 1
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):  3.033 ms …  10.288 ms  ┊ GC (min … max): 0.00% … 0.00%                                          │ Range (min … max):  3.038 ms …  10.280 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     4.310 ms               ┊ GC (median):    0.00%                                                  │ Time  (median):     4.308 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   4.400 ms ± 962.308 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                                          │ Time  (mean ± σ):   4.417 ms ± 976.753 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
     ▁     █▆█▃▆▃▆▄ ▁ ▃                                                                                            │             █▃█ ▅▂▃▃                                          
▆▇▆▄▄█▇▆▆▆▇████████▆█▁█▆▁▆▄▁▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄                                                      │  ▅▅▇▄▄█▇▅▅▅▇████████▅█▄▇▇▁▅▄▁▄▁▁▁▁▁▁▁▄▁▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
3.03 ms         Histogram: frequency by time        8.15 ms <                                                      │  3.04 ms         Histogram: frequency by time        8.16 ms <
                                                                                                                   │
Memory estimate: 7.74 KiB, allocs estimate: 101.                                                                    │ Memory estimate: 7.77 KiB, allocs estimate: 103.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):  2.525 ms … 10.106 ms  ┊ GC (min … max): 0.00% … 0.00%                                           │ Range (min … max):  2.706 ms … 10.110 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     4.325 ms              ┊ GC (median):    0.00%                                                   │ Time  (median):     4.344 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   4.460 ms ±  1.155 ms  ┊ GC (mean ± σ):  0.98% ± 3.21%                                           │ Time  (mean ± σ):   4.481 ms ±  1.126 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
      █       █   ▅▁▁█   ▂                                                                                         │      ▁      ▂▂ ▄█▅ ▇                                         
▃▃▃▁▁▁█▅▅▁▁▆▆▃█▆▆██████▃▆█▁▃▅▃▁▃▃▁▁▁▁▁▅▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▁▁▅ ▃                                                       │  ▃▁▅▁█▅▃█▆█▅██▁██████▆█▅▅▁▅▁▁▃▃▁▃▁▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃
2.53 ms        Histogram: frequency by time        8.17 ms <                                                       │  2.71 ms        Histogram: frequency by time        8.82 ms <
                                                                                                                   │
Memory estimate: 3.06 MiB, allocs estimate: 106.                                                                    │ Memory estimate: 160 bytes, allocs estimate: 3.
                                                                                                                   │
julia>                                                                                                               │julia> 
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                   │
julia> include("parallel_benchmark.jl")                                                                              │julia> include("parallel_benchmark.jl")
rank 2                                                                                                               │rank 3
running parallel                                                                                                     │running parallel
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):  3.031 ms … 10.281 ms  ┊ GC (min … max): 0.00% … 0.00%                                           │ Range (min … max):  3.034 ms … 10.281 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     4.302 ms              ┊ GC (median):    0.00%                                                   │ Time  (median):     4.319 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   4.433 ms ±  1.021 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                           │ Time  (mean ± σ):   4.436 ms ±  1.030 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
     ▁     ██▆▃▆▃▆▃   ▄                                                                                            │             █▅▅▃▂▃▃▃   ▂                                     
▆▆▇▄▄█▇▆▇▆▇████████▆▇▄█▁▁▇▁▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄ ▄                                                       │  ▅▅▇▄▅▇▇▄▇▇▇████████▄█▄█▄▁▅▄▄▁▁▁▁▁▁▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▄
3.03 ms        Histogram: frequency by time        8.18 ms <                                                       │  3.03 ms        Histogram: frequency by time        8.14 ms <
                                                                                                                   │
Memory estimate: 7.77 KiB, allocs estimate: 103.                                                                    │ Memory estimate: 7.77 KiB, allocs estimate: 103.
running on_root                                                                                                      │running on_root
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                 │BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max):  2.699 ms … 10.108 ms  ┊ GC (min … max): 0.00% … 0.00%                                           │ Range (min … max):  2.704 ms … 10.106 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     4.355 ms              ┊ GC (median):    0.00%                                                   │ Time  (median):     4.366 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   4.494 ms ±  1.139 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%                                           │ Time  (mean ± σ):   4.502 ms ±  1.157 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
                                                                                                                   │
            ▂  █▄ ▄                                                                                                │       ▃      ▂  █▄ ▄                                         
▃▃▁▁█▆█▁▆▇▅██▃▇█████▇▅▆▃▃▅▁▃▃▃▁▁▃▁▁▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃                                                       │  ▃▃▁▁██▃▁▆▇▆▇█▃▇█████▆▆▆▃▃▅▁▁▃▃▁▁▃▁▃▁▁▃▁▁▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃ ▃
2.7 ms         Histogram: frequency by time        8.81 ms <                                                       │  2.7 ms         Histogram: frequency by time         8.8 ms <
                                                                                                                   │
Memory estimate: 160 bytes, allocs estimate: 3.                                                                     │ Memory estimate: 160 bytes, allocs estimate: 3.

With more MPI ranks the times scale linearly, which indicates that the parallelization doesn't work.

EDIT2: I also tried using chunk and dxpl_mpio=:collective as kwargs to create_dataset as suggested in the docs of HDF5.jl, but both option decreased performance or had no influence even though the data is completely uniformly distributed among the ranks.

ranocha · 2023-04-18T10:14:16Z

Yes, exactly. Thanks!

codecov · 2023-04-18T12:03:08Z

Codecov Report

Merging #1399 (58cebff) into main (153ef65) will decrease coverage by 2.88%.
The diff coverage is 8.82%.

❗ Current head 58cebff differs from pull request most recent head 3e844c7. Consider uploading reports for the commit 3e844c7 to get more accurate results

@@            Coverage Diff             @@
##             main    #1399      +/-   ##
==========================================
- Coverage   95.70%   92.82%   -2.88%     
==========================================
  Files         360      356       -4     
  Lines       29816    29639     -177     
==========================================
- Hits        28533    27511    -1022     
- Misses       1283     2128     +845

Flag	Coverage Δ
unittests	`92.82% <8.82%> (-2.88%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/Trixi.jl	`71.43% <ø> (+3.25%)`	⬆️
src/callbacks_step/save_solution_dg.jl	`69.91% <8.82%> (-26.29%)`	⬇️

... and 24 files with indirect coverage changes

JoshuaLampert · 2023-04-19T11:29:48Z

I also looked at flamegraphs on each process, see below a screenshot (with some annotations) with the flamegraphs for 2 ranks (Profile.jl measured the time for 10 runs writing 1 000 000 Float64s each run). A few things I noticed:

All relevant parts are actually inside C functions
The amount of time spent in h5d_write is comparably small. This doesn't change when I increase the data size
On one rank a significant amount of time is spent in h5p_set_fapl_mpio, which is a bit surprising to me. When I used 4 ranks all of them spent around half of the time in this function.

Any more ideas, comments, @ranocha, @sloede?

ranocha · 2023-04-19T12:14:12Z

Then, I would move to a different machine and check if you see the same behavior, just eliminate the possibility that it is a system specific issue.

Did you already test it on different machines, @JoshuaLampert?

JoshuaLampert · 2023-04-19T12:17:44Z

Yes, see also my edited comment above.

ranocha · 2023-04-19T12:57:35Z

Thanks, I didn't notice the update there

sloede · 2023-04-19T13:07:23Z

How hard would it be to recreate your MWE (parallel_benchmark.jl) in C? Though from what you wrote on [h5p_set_fapl_mpio](https://docs.hdfgroup.org/hdf5/v1_12/group___f_a_p_l.html#gaa0204810c1fea1667d62cf7c176416ff) and what I gather from its docs, it shouldn't really take up any time at all (and it is a non-collective operation...), which leads me to believe that this might be HDF5 related.

Have you encountered the same performance issue when using an HDF5 library not provided by HDF5_jll but overriding the JLL product with your local version?

JoshuaLampert · 2023-04-19T13:15:37Z

I will try to reproduce the issue in C
In all of my tests I used a local HDF5 installation. The HDF5_JLL version doesn't provide MPI (at least now), right?

sloede · 2023-04-19T13:25:55Z

The HDF5_JLL version doesn't provide MPI (at least now), right?

🤦 Right.

JoshuaLampert · 2023-04-22T14:17:32Z

I recreated my MWE in C now. Here, the parallel case with 4 ranks is around 10 % faster than the case where root writes all the data. Compared to julia it executes about 20 % faster (but still within the standard deviation). You can find the timings and the C and julia programs below.

My laptop using 4 ranks

running parallel...
running parallel...
running parallel...
running parallel...
Parallel on rank 1:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.540000 ms

running on_root...
Parallel on rank 2:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.580000 ms

running on_root...
Parallel on rank 0:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.560000 ms

running on_root...
Parallel on rank 3:
min: 11.000000 ms
max: 489.000000 ms
mean: 40.540000 ms

running on_root...
On_root on rank 3:
min: 0.000000 ms
max: 3.000000 ms
mean: 1.180000 ms

On_root on rank 0:
min: 6.000000 ms
max: 870.000000 ms
mean: 44.950000 ms

On_root on rank 1:
min: 0.000000 ms
max: 3.000000 ms
mean: 0.640000 ms

On_root on rank 2:
min: 0.000000 ms
max: 3.000000 ms
mean: 0.920000 ms

C code

#include "hdf5.h"
#include "stdlib.h"
#include <mpi.h>
#include <time.h>
#include <sys/time.h>

#define DIMS 1

void parallel(char *filename, int M, double *A, MPI_Comm comm, MPI_Info info) {
    int myrank, Nproc;
    MPI_Comm_rank(comm, &myrank);
    MPI_Comm_size(comm, &Nproc);

    hid_t plistId = H5Pcreate(H5P_FILE_ACCESS);
    H5Pset_fapl_mpio(plistId, comm, info);
    hid_t fileId = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plistId);
    H5Pclose(plistId);
    // Create the dataspace for the dataset.
    hsize_t dimsm[1];
    hsize_t dimsf[1];
    dimsm[0] = M;
    dimsf[0] = M * Nproc;
    hid_t mspace = H5Screate_simple(DIMS, dimsm, NULL);
    hid_t fspace = H5Screate_simple(DIMS, dimsf, NULL);

    // Create dataset and select subset to write
    hid_t dset = H5Dcreate(fileId, "A", H5T_NATIVE_DOUBLE, fspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    hsize_t start[] = {M * myrank};
    hsize_t count[] = {M};
    H5Sselect_hyperslab(fspace, H5S_SELECT_SET, start, NULL, count, NULL);

    // Create a property list for collective data transfer
    //plistId = H5Pcreate(H5P_DATASET_XFER);
    // Write data
    H5Dwrite(dset, H5T_NATIVE_DOUBLE, mspace, fspace, H5P_DEFAULT, A);

    H5Dclose(dset);
    H5Sclose(fspace);
    H5Sclose(mspace);
    H5Fclose(fileId);
    //H5Pclose(plistId);
}

void on_root(char *filename, int M, double *A, MPI_Comm comm, MPI_Info info) {
    int myrank, Nproc;
    MPI_Comm_rank(comm, &myrank);
    MPI_Comm_size(comm, &Nproc);
    int root = 0;

    // Send data to root
    int *displs = (int *)malloc(Nproc * sizeof(int));
    int *recvcounts = (int *)malloc(Nproc * sizeof(int));
    for (int i = 0; i < Nproc; i++) {
        displs[i] = M * i;
        recvcounts[i] = M;
    }
    double *rbuf;
    if (myrank == root)
        rbuf = (double *)malloc(Nproc * M * sizeof(double));
    MPI_Gatherv(A, M, MPI_DOUBLE, rbuf, recvcounts, displs, MPI_DOUBLE, root, comm);

    // Write data on root
    if (myrank == root) {
        hid_t fileId = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
        // Create the dataspace for the dataset.
        hsize_t dimsm[1];
        hsize_t dimsf[1];
        dimsm[0] = M * Nproc;
        dimsf[0] = M * Nproc;
        hid_t mspace = H5Screate_simple(DIMS, dimsm, NULL);
        hid_t fspace = H5Screate_simple(DIMS, dimsf, NULL);

        // Create datset
        hid_t dset = H5Dcreate(fileId, "A", H5T_NATIVE_DOUBLE, fspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

        // Write data
        H5Dwrite(dset, H5T_NATIVE_DOUBLE, mspace, fspace, H5P_DEFAULT, rbuf);

        H5Dclose(dset);
        H5Sclose(fspace);
        H5Sclose(mspace);
        H5Fclose(fileId);
    }
}

double min(double *arr, int size) {
    double min = arr[0];
    for (int i = 1; i < size; i++) {
        if (arr[i] < min)
            min = arr[i];
    }
    return min;
}

double max(double *arr, int size) {
    double max = arr[0];
    for (int i = 1; i < size; i++) {
        if (arr[i] > max)
            max = arr[i];
    }
    return max;
}

double mean(double *arr, int size) {
    double sum = 0.0;
    for (int i = 0; i < size; i++) {
        sum += arr[i];
    }
    double mean = sum / size;
    return mean;
}

double get_time_ms(void) {
    struct timeval tv;

    gettimeofday(&tv, NULL);
    return (double)(tv.tv_usec / 1000) + (double)(tv.tv_sec * 1000);
}

int main (int argc, char **argv) {
    MPI_Comm comm  = MPI_COMM_WORLD;
    MPI_Info info  = MPI_INFO_NULL;
    MPI_Init(&argc, &argv);
    int myrank;
    MPI_Comm_rank(comm, &myrank);

    int M = 100000;
    double *A = (double *)malloc(sizeof(double) * M);
    time_t t;
    srand((unsigned) time(&t));
    for (int i = 0; i < M; i++)
        A[i] = (double)rand() / (double)RAND_MAX;

    int samples = 100;
    double begin, end;

    double *times_parallel = (double *)malloc(sizeof(double) * samples);
    printf("running parallel...\n");
    for (int i = 0; i < samples; i++) {
        begin = get_time_ms();
        parallel("data_parallel.h5", M, A, comm, info);
        end = get_time_ms();
        MPI_Barrier(comm);
        times_parallel[i] = (double)(end - begin);
    }
    printf("Parallel on rank %d:\n\tmin: %f ms\n\tmax: %f ms\n\tmean: %f ms\n\n", myrank, min(times_parallel, samples), max(times_parallel, samples), mean(times_parallel, samples));
    
    double *times_on_root = (double *)malloc(sizeof(double) * samples);
    printf("running on_root...\n");
    for (int i = 0; i < samples; i++) {
        begin = get_time_ms();
        on_root("data_on_root.h5", M, A, comm, info);
        end = get_time_ms();
        MPI_Barrier(comm);
        times_on_root[i] = (double)(end - begin);
    }
    printf("On_root on rank %d:\n\tmin: %f ms\n\tmax: %f ms\n\tmean: %f ms\n\n", myrank, min(times_on_root, samples), max(times_on_root, samples), mean(times_on_root, samples));

    MPI_Finalize();
    return 0;
}

julia code

using MPI
using HDF5
using BenchmarkTools

MPI.Init()
comm = MPI.COMM_WORLD

Nproc = MPI.Comm_size(comm)
myrank = MPI.Comm_rank(comm)

println("rank $myrank")

M = 100000
A = rand(M)  # local data

# true parallel writing slices
function parallel(filename, A, comm)
  M = size(A, 1)
  myrank = MPI.Comm_rank(comm)
  Nproc = MPI.Comm_size(comm)

  h5open(filename, "w", comm) do file
    # Create dataset
    dset = create_dataset(file, "/A", datatype(eltype(A)), dataspace((M * Nproc,)))

    # Write local data
    slice = myrank * M + 1:(myrank + 1) * M
    dset[slice] = A
  end
end

# poor man's version
function on_root(filename, A, comm)
  M = size(A, 1)
  myrank = MPI.Comm_rank(comm)
  Nproc = MPI.Comm_size(comm)
  root = 0
  # Send data to root
  if myrank != root
    MPI.Gatherv!(A, nothing, root, comm)
  else
    h5open(filename, "w") do file
      # Create dataset
      dset = create_dataset(file, "/A", datatype(eltype(A)), dataspace((M * Nproc,)))

      # Receive data
      recv = Vector{eltype(A)}(undef, (M * Nproc,))
      MPI.Gatherv!(A, MPI.VBuffer(recv, fill(M, Nproc)), root, comm)
      # Write local data
      dset[:] = recv
    end
  end
end

samples = 100
println("running parallel...")
t1 = @benchmark parallel("data_parallel.h5", A, comm) samples=samples
println(IOContext(stdout, :compact => false), t1)

println("running on_root...")
t2 = @benchmark on_root("data_on_root.h5", A, comm) samples=samples
println(IOContext(stdout, :compact => false), t2)

#=using Profile, ProfileView
function doit_parallel(n, A, comm)
  for i in 1:n
    parallel("data_parallel.h5", A, comm)
  end
end
function doit_on_root(n, A, comm)
  for i in 1:n
    on_root("data_on_root.h5", A, comm)
  end
end
Profile.clear(); @profile doit_parallel(10, A, comm); ProfileView.view(windowname="parallel $myrank")
Profile.clear(); @profile doit_on_root(10, A, comm); ProfileView.view(windowname="on_root $myrank");=#

sloede · 2023-04-23T05:44:14Z

I recreated my MWE in C now. Here, the parallel case with 4 ranks is around 10 % faster than the case where root writes all the data. Compared to julia it executes about 20 % faster (but still within the standard deviation). You can find the timings and the C and julia programs below.

Thanks! Can you try to run the examples on roci as well? And I wouldn't run them on tmpi, I'd run them as a script to avoid any tmpi influence. You just have to discard the first invocation (since it includes compile time).

But from the current results, do I understand correctly that in the MWE, the parallel HDF5 case is slightly faster than the poor man's version, as one would expect? How is the difference between Julia and C?

JoshuaLampert · 2023-04-25T12:40:06Z

I also ran the C MWE on roci. Here, the poor man's version is even faster than the parallel one (using 4 ranks: parallel version needs ~3.6 ms and root on poor man's version ~2.8 ms, using 16 ranks: 16.1 ms parallel vs. 11.59 ms poor man's version). Whether I use mpiexecjl or tmpi doesn't have a qualitative impact on the results.
So in total the issue is in julia as well as in C and the code runs a bit faster in C than in julia.

ranocha · 2023-04-25T13:01:36Z

If you don;t have any other suggestions, @sloede, I think we can proceed with this. There does not seem to be a huge difference for small-scale applications while real parallel IO is required for large-scale simulations. Thus, this seems to be a net win to me.

JoshuaLampert · 2023-05-22T16:32:13Z

Finally, I found the time to implement the parallel I/O for the restart files. Sorry for the delay. This is now ready for a review from my point of view, @sloede, @ranocha.

JoshuaLampert · 2023-05-22T16:42:49Z

When I tested this, I tried the elixir under examples/tree_3d_dgsem/elixir_advection_restart.jl, but got the error

ERROR: LoadError: type NamedTuple has no field mpi_cache
Stacktrace:
  [1] getproperty
    @ ./Base.jl:37 [inlined]
  [2] nelementsglobal
    @ ~/.julia/dev/Trixi.jl/src/solvers/dg.jl:444 [inlined]
  [3] ndofsglobal
    @ ~/.julia/dev/Trixi.jl/src/solvers/dg.jl:387 [inlined]
[...]

I didn't investigate further, since it also occurs on main and seems to be unrelated to this PR. Is this a known issue or should I create one? I also noticed that there are no MPI tests for the 3d TreeMesh.

ranocha · 2023-05-23T09:42:44Z

@sloede IIRC, MPI on the TreeMesh was just implemented as a prototype in 2D and not in 3D, correct?

docs/src/parallelization.md

src/Trixi.jl

src/callbacks_step/save_restart_dg.jl

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

ranocha · 2023-05-23T11:20:41Z

Do we have some tests for this?

JoshuaLampert · 2023-05-23T11:57:13Z

I was also thinking about testing the parallel I/O. As long as HDF5_jll.jl doesn't support MPI, we would need to install a
custom HDF5 with MPI support on github-actions and set the environment variables and preferences accordingly. We
could use a similar setup as we have in P4est.jl, I guess.

That would be an option. We could add another CI job running MPI tests on Ubuntu with parallel HDF5 enabled. This shouldn't be too expensive.

Ok, I can try that. Maybe I'll need some help of you to set up the CI job correctly.

Not sure if this is necessary. If we get parallel I/O support from an MPI-enabled HDF5_jll soon, I don't think we need to invest time in a sophisticated CI setup right now. I'd be ok with omitting this but creating an issue for it once JuliaPackaging/Yggdrasil#6551 is merged and HDF5 is updated to use it.

As discussed here I only tested the true parallel I/O locally, but not in CI. For this, I ran examples/tree_2d_dgsem/elixir_advection_restart.jl, examples/p4est_2d_dgsem/elixir_advection_restart.jl and examples/p4est_3d_dgsem/elixir_advection_restart.jl in parallel with parallel HDF5 enabled on 2 and 3 ranks for each elixir.

ranocha · 2023-05-24T03:55:02Z

Great, thanks! Did you open the issue as discussed there?

ranocha · 2023-05-24T03:57:38Z

CI looks good. There is just the upstream error

elixir_advection_amr_visualization.jl: Test Failed at /home/runner/work/Trixi.jl/Trixi.jl/test/test_trixi.jl:167
  Expression: occursin(r"^(WARNING: replacing module .+\.\n)*$", stderr_content)
   Evaluated: occursin(r"^(WARNING: replacing module .+\.\n)*$", "WARNING: importing deprecated binding Colors.RGB1 into Plots.\nWARNING: importing deprecated binding Colors.RGB4 into Plots.\n")

at https://github.com/trixi-framework/Trixi.jl/actions/runs/5056931269/jobs/9075043660?pr=1399#step:7:19534

Could you please update the filter for these warning in

Trixi.jl/test/test_trixi.jl

Lines 152 to 153 in 153ef65

    
           "WARNING: importing deprecated binding Colors.RGB1 into PlotUtils.\n", 
        
           "WARNING: importing deprecated binding Colors.RGB4 into PlotUtils.\n",

accordingly?

JoshuaLampert · 2023-05-24T13:53:23Z

Great, thanks! Did you open the issue as discussed there?

Done in #1486.

ranocha

Thanks a lot! Good to go from my side. It would be great if @sloede could have a look, too

sloede · 2023-05-25T07:00:36Z

@sloede IIRC, MPI on the TreeMesh was just implemented as a prototype in 2D and not in 3D, correct?

Correct. The parallel 2D TreeMesh is already implemented so inefficiently (by myself 😬) that I didn't think it would be worthwhile to make it work for 3D.

@JoshuaLampert Could you maybe create an issue that we should add a more meaningful error message in case someone tries to run TreeMesh3D in parallel?

sloede

LGTM overall, just one suggestion.

docs/src/parallelization.md

Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>

JoshuaLampert and others added 6 commits April 17, 2023 15:24

add parallel IO

145950a

Merge branch 'trixi-framework:main' into parallel-io

809ac49

use parallel HDF5 only when enabled

edea3f2

Merge branch 'trixi-framework:main' into parallel-io

8f8935b

Merge branch 'parallel-io' of https://github.com/JoshuaLampert/Trixi.jl…

afaec8a

… into parallel-io

add documentation

c0bd339

JoshuaLampert requested review from ranocha and sloede April 17, 2023 17:36

JoshuaLampert added the parallelization Related to MPI, threading, tasks etc. label Apr 17, 2023

ranocha reviewed Apr 18, 2023

View reviewed changes

src/Trixi.jl Outdated Show resolved Hide resolved

src/callbacks_step/save_solution_dg.jl Outdated Show resolved Hide resolved

src/callbacks_step/save_solution_dg.jl Outdated Show resolved Hide resolved

JoshuaLampert and others added 3 commits April 18, 2023 10:40

Merge branch 'main' into parallel-io

7417ba5

prefix has_parallel with HDF5

7e29a27

put duplicated code in main function

d0da241

Merge branch 'main' into parallel-io

3df3d17

JoshuaLampert requested review from ranocha and sloede May 22, 2023 16:31

ranocha reviewed May 23, 2023

View reviewed changes

docs/src/parallelization.md Outdated Show resolved Hide resolved

src/Trixi.jl Outdated Show resolved Hide resolved

src/callbacks_step/save_restart_dg.jl Outdated Show resolved Hide resolved

JoshuaLampert and others added 4 commits May 23, 2023 12:16

put filename in main function

7e25d13

Apply suggestions from code review

e0f2d35

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

format

68da560

fix typo in comment

57a6099

write nelementsglobal, not nelements

c92fbc4

update filtered warning

ccb0767

JoshuaLampert mentioned this pull request May 24, 2023

Tests for parallel I/O #1486

Closed

ranocha previously approved these changes May 25, 2023

View reviewed changes

JoshuaLampert mentioned this pull request May 25, 2023

Improve error message for TreeMesh with MPI in 3D #1488

Closed

sloede reviewed May 25, 2023

View reviewed changes

docs/src/parallelization.md Show resolved Hide resolved

Update docs/src/parallelization.md

958b294

Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>

JoshuaLampert dismissed ranocha’s stale review via 958b294 May 25, 2023 11:42

Merge branch 'main' into parallel-io

59c54f7

sloede approved these changes May 25, 2023

View reviewed changes

sloede enabled auto-merge (squash) May 25, 2023 11:44

ranocha disabled auto-merge May 26, 2023 08:20

ranocha merged commit 711d4d8 into trixi-framework:main May 26, 2023

JoshuaLampert deleted the parallel-io branch May 26, 2023 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel I/O #1399

Parallel I/O #1399

JoshuaLampert commented Apr 17, 2023

ranocha commented Apr 18, 2023

sloede commented Apr 18, 2023

JoshuaLampert commented Apr 18, 2023 •

edited

Loading

ranocha commented Apr 18, 2023

codecov bot commented Apr 18, 2023 •

edited

Loading

JoshuaLampert commented Apr 19, 2023

ranocha commented Apr 19, 2023

JoshuaLampert commented Apr 19, 2023

ranocha commented Apr 19, 2023

sloede commented Apr 19, 2023

JoshuaLampert commented Apr 19, 2023

sloede commented Apr 19, 2023

JoshuaLampert commented Apr 22, 2023 •

edited

Loading

sloede commented Apr 23, 2023

JoshuaLampert commented Apr 25, 2023

ranocha commented Apr 25, 2023

JoshuaLampert commented May 22, 2023

JoshuaLampert commented May 22, 2023 •

edited

Loading

ranocha commented May 23, 2023

ranocha commented May 23, 2023

JoshuaLampert commented May 23, 2023

ranocha commented May 24, 2023

ranocha commented May 24, 2023

JoshuaLampert commented May 24, 2023

ranocha left a comment

sloede commented May 25, 2023

sloede left a comment

Parallel I/O #1399

Parallel I/O #1399

Conversation

JoshuaLampert commented Apr 17, 2023

ranocha commented Apr 18, 2023

sloede commented Apr 18, 2023

JoshuaLampert commented Apr 18, 2023 • edited Loading

ranocha commented Apr 18, 2023

codecov bot commented Apr 18, 2023 • edited Loading

Codecov Report

JoshuaLampert commented Apr 19, 2023

ranocha commented Apr 19, 2023

JoshuaLampert commented Apr 19, 2023

ranocha commented Apr 19, 2023

sloede commented Apr 19, 2023

JoshuaLampert commented Apr 19, 2023

sloede commented Apr 19, 2023

JoshuaLampert commented Apr 22, 2023 • edited Loading

sloede commented Apr 23, 2023

JoshuaLampert commented Apr 25, 2023

ranocha commented Apr 25, 2023

JoshuaLampert commented May 22, 2023

JoshuaLampert commented May 22, 2023 • edited Loading

ranocha commented May 23, 2023

ranocha commented May 23, 2023

JoshuaLampert commented May 23, 2023

ranocha commented May 24, 2023

ranocha commented May 24, 2023

JoshuaLampert commented May 24, 2023

ranocha left a comment

Choose a reason for hiding this comment

sloede commented May 25, 2023

sloede left a comment

Choose a reason for hiding this comment

JoshuaLampert commented Apr 18, 2023 •

edited

Loading

codecov bot commented Apr 18, 2023 •

edited

Loading

JoshuaLampert commented Apr 22, 2023 •

edited

Loading

JoshuaLampert commented May 22, 2023 •

edited

Loading