-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solver crashes when using ASDF output with a large number of stations #531
Comments
This may be related to the memory leak that I saw when I profiled the performance. It only affects runs very large number of stations. The memory use blows up around 5000 stations. I meant to test the latest version of hdf5 (1.8.17), but I can't remember if I did or not. I seem to remember having issues compiling the library on the local cluster. |
Hi James, I used hdf5 versions 1.8.17 and also 1.10.0 yesterday, but got the same The memory blow up seems to be related to the allocation of very larger Bernhard On 13.10.2016 18:33, James Smith wrote:
Dr. Bernhard Schuberth Phone: +49-89-2180-4201 |
Thanks for testing the latest hdf5. My plan to resolve this issue had been to implement Lion's suggestion of writing out the metadata in serial as a skeleton file and then closing the file, reopening it in parallel and writing the waveforms. I had issues getting the hdf5 library to build with both serial and parallel options when I was working on this a few months ago and so I never got very far in seeing if this idea would work. Another idea would be to create another communicator group for MPI processors, but this is a bit messy and SPECFEM already has at least one of these in the code I believe so maybe not a good idea. |
I have been looking into the first option (writing the metadata in https://github.com/bschuber/specfem3d_globe Bernhard |
Hi Bernhard, |
Hi James, thanks for your reply. There is no immediate rush on this issue right Cheers, Bernhard |
Is this solved? |
(if so please close the issue) |
Hi Dimitri,
I had worked on this problem in my own fork, but only arrived at a
partial (still viable) solution. It works fine to write output for many
thousands of stations, but only when the parameter
WRITE_SEISMOGRAMS_BY_MASTER is set to ".true." (so no parallel I/O,
which actually runs awfully slow: > 20 minutes for ~40,000 Stations).
Using the master proc alone, it takes less than 2 minutes. According to
James Smith, defining the HDF5 structures in memory (which has to be
done collectively for stations and waveforms) is the problematic step
that makes the parallel I/O so slow.
For my code to run, I had to slightly modify the ASDF library.
Therefore, I think it would be good if someone from the ASDF developers
could take a look at my changes and whether there is any possibility to
speed up the parallel I/O. My codes can be found at:
https://github.com/bschuber/specfem3d_globe
https://github.com/bschuber/asdf-library
Best wishes,
Bernhard
…On 18.03.2017 18:06, Dimitri Komatitsch wrote:
(if so please close the issue)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#531 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALAcDupOsx3SFOshp1Q1gFvpl7DHU0ujks5rnA8ugaJpZM4KO7og>.
--
__________________________________
Dr. Bernhard Schuberth
Geophysics
Dept. of Earth and Environmental Sciences
Ludwig-Maximilians-Universität München
Theresienstr. 41
80333 München, Germany
Phone: +49-89-2180-4201
E-Mail: bernhard.schuberth@geophysik.uni-muenchen.de
Info: http://venus.geophysik.uni-muenchen.de/~bernhard
|
Hi Bernhard,
Great, thank you very much. Very useful!
Thus let me cc Matthieu and Lion (please do not hesitate to forward to
other people if needed).
Best wishes,
Dimitri.
…On 03/21/2017 12:44 PM, bschuber wrote:
Hi Dimitri,
I had worked on this problem in my own fork, but only arrived at a
partial (still viable) solution. It works fine to write output for many
thousands of stations, but only when the parameter
WRITE_SEISMOGRAMS_BY_MASTER is set to ".true." (so no parallel I/O,
which actually runs awfully slow: > 20 minutes for ~40,000 Stations).
Using the master proc alone, it takes less than 2 minutes. According to
James Smith, defining the HDF5 structures in memory (which has to be
done collectively for stations and waveforms) is the problematic step
that makes the parallel I/O so slow.
For my code to run, I had to slightly modify the ASDF library.
Therefore, I think it would be good if someone from the ASDF developers
could take a look at my changes and whether there is any possibility to
speed up the parallel I/O. My codes can be found at:
https://github.com/bschuber/specfem3d_globe
https://github.com/bschuber/asdf-library
Best wishes,
Bernhard
On 18.03.2017 18:06, Dimitri Komatitsch wrote:
>
> (if so please close the issue)
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
>
<#531 (comment)>,
> or mute the thread
>
<https://github.com/notifications/unsubscribe-auth/ALAcDupOsx3SFOshp1Q1gFvpl7DHU0ujks5rnA8ugaJpZM4KO7og>.
>
--
__________________________________
Dr. Bernhard Schuberth
Geophysics
Dept. of Earth and Environmental Sciences
Ludwig-Maximilians-Universität München
Theresienstr. 41
80333 München, Germany
Phone: +49-89-2180-4201
E-Mail: ***@***.***
Info: http://venus.geophysik.uni-muenchen.de/~bernhard
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#531 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFjDKWSHNYikaYs_ZsBtK00YLDFYYlW3ks5rn7glgaJpZM4KO7og>.
--
Dimitri Komatitsch, CNRS Research Director (DR CNRS)
Laboratory of Mechanics and Acoustics, Marseille, France
http://komatitsch.free.fr
|
Got a workaround from @krischer. @Jas11 is working on it in SeismicData/asdf-library#19 |
Perfect! Please remember to close the Git issue when the problem is fixed.
…On 04/05/2017 07:45 PM, Matthieu Lefebvre wrote:
Got a workaround from @krischer <https://github.com/krischer>. @Jas11
<https://github.com/jas11> is working on it in
SeismicData/asdf-library#19
<SeismicData/asdf-library#19>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#531 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFjDKQ0_VKF0xvW_UYJtBU7iz7r6TlBAks5rs9NDgaJpZM4KO7og>.
--
Dimitri Komatitsch, CNRS Research Director (DR CNRS)
Laboratory of Mechanics and Acoustics, Marseille, France
http://komatitsch.free.fr
|
Hi all, Please let me know if/when I can close this Git issue. Thanks, |
No answer, I assume this is now fixed. Closing it. |
The solver seems to have problems in handling the ASDF output for very large numbers of stations. Simulations with 2000 stations in DATA/STATIONS run fine, while simulations with 5000 stations fail with HDF5 error messages (see below). A typical use case would be ~50,000 stations and it would be great if ASDF output could be used for that. A somewhat smaller example STATIONS file with ~7000 stations is attached, for which the error also showed up.
STATIONS_example.txt
Environment in which the error occured:
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 430:
#000: H5G.c line 314 in H5Gcreate2(): unable to create group
major: Symbol table
minor: Unable to initialize object
#1: H5Gint.c line 194 in H5G__create_named(): unable to create and link to group
major: Symbol table
minor: Unable to initialize object
#2: H5L.c line 1638 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#3: H5L.c line 1882 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#4: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#5: H5Gtraverse.c line 596 in H5G_traverse_real(): can't look up component
major: Symbol table
minor: Object not found
#6: H5Gobj.c line 1145 in H5G__obj_lookup(): can't locate object
major: Symbol table
minor: Object not found
#7: H5Gdense.c line 574 in H5G__dense_lookup(): unable to locate link in name index
major: Symbol table
minor: Unable to insert object
#8: H5B2.c line 504 in H5B2_find(): unable to protect B-tree leaf node
major: B-Tree node
minor: Unable to protect metadata
#9: H5B2int.c line 1821 in H5B2_protect_leaf(): unable to protect B-tree leaf node
major: B-Tree node
minor: Unable to protect metadata
#10: H5AC.c line 1320 in H5AC_protect(): H5C_protect() failed.
major: Object cache
minor: Unable to protect metadata
#11: H5C.c line 3574 in H5C_protect(): can't load entry
major: Object cache
minor: Unable to load metadata into cache
#12: H5C.c line 7954 in H5C_load_entry(): unable to load entry
major: Object cache
minor: Unable to load metadata into cache
#13: H5B2cache.c line 874 in H5B2__cache_leaf_load(): wrong B-tree leaf node signature
major: B-Tree node
minor: Unable to load metadata into cache
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
The text was updated successfully, but these errors were encountered: