Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solver crashes when using ASDF output with a large number of stations #531

Closed
bschuber opened this issue Oct 5, 2016 · 14 comments
Closed
Labels

Comments

@bschuber
Copy link

bschuber commented Oct 5, 2016

The solver seems to have problems in handling the ASDF output for very large numbers of stations. Simulations with 2000 stations in DATA/STATIONS run fine, while simulations with 5000 stations fail with HDF5 error messages (see below). A typical use case would be ~50,000 stations and it would be great if ASDF output could be used for that. A somewhat smaller example STATIONS file with ~7000 stations is attached, for which the error also showed up.
STATIONS_example.txt

Environment in which the error occured:

  • SPECFEM3D_GLOBE devel (commit 65089ee)
  • Compilers: Intel (15.0), Intel MPI (5.1)
  • HDF5 version 1.8.14
  • asdf-library master (commit dd669db0da634c9e1573d09e41d1fd0186032fc5)

HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 430:
#000: H5G.c line 314 in H5Gcreate2(): unable to create group
major: Symbol table
minor: Unable to initialize object
#1: H5Gint.c line 194 in H5G__create_named(): unable to create and link to group
major: Symbol table
minor: Unable to initialize object
#2: H5L.c line 1638 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#3: H5L.c line 1882 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#4: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#5: H5Gtraverse.c line 596 in H5G_traverse_real(): can't look up component
major: Symbol table
minor: Object not found
#6: H5Gobj.c line 1145 in H5G__obj_lookup(): can't locate object
major: Symbol table
minor: Object not found
#7: H5Gdense.c line 574 in H5G__dense_lookup(): unable to locate link in name index
major: Symbol table
minor: Unable to insert object
#8: H5B2.c line 504 in H5B2_find(): unable to protect B-tree leaf node
major: B-Tree node
minor: Unable to protect metadata
#9: H5B2int.c line 1821 in H5B2_protect_leaf(): unable to protect B-tree leaf node
major: B-Tree node
minor: Unable to protect metadata
#10: H5AC.c line 1320 in H5AC_protect(): H5C_protect() failed.
major: Object cache
minor: Unable to protect metadata
#11: H5C.c line 3574 in H5C_protect(): can't load entry
major: Object cache
minor: Unable to load metadata into cache
#12: H5C.c line 7954 in H5C_load_entry(): unable to load entry
major: Object cache
minor: Unable to load metadata into cache
#13: H5B2cache.c line 874 in H5B2__cache_leaf_load(): wrong B-tree leaf node signature
major: B-Tree node
minor: Unable to load metadata into cache
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 461:
#000: H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value

@ghost
Copy link

ghost commented Oct 13, 2016

This may be related to the memory leak that I saw when I profiled the performance. It only affects runs very large number of stations. The memory use blows up around 5000 stations. I meant to test the latest version of hdf5 (1.8.17), but I can't remember if I did or not. I seem to remember having issues compiling the library on the local cluster.

@bschuber
Copy link
Author

Hi James,

I used hdf5 versions 1.8.17 and also 1.10.0 yesterday, but got the same
error as with 1.8.14.

The memory blow up seems to be related to the allocation of very larger
arrays (e.g., station_grps_gather, station_xml_gather, data_ids)
depending on NPROC.In other words, the error shows up for different
numbers of Stations when when running less or more processes.

Bernhard

On 13.10.2016 18:33, James Smith wrote:

This may be related to the memory leak that I saw when I profiled the
performance. It only affects runs very large number of stations. The
memory use blows up around 5000 stations. I meant to test the latest
version of hdf5 (1.8.17), but I can't remember if I did or not. I seem
to remember having issues compiling the library on the local cluster.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#531 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALAcDlUnRp3rnuKXg2LFOIHpjia6_8d6ks5qzl1CgaJpZM4KO7og.


Dr. Bernhard Schuberth
Geophysics
Dept. of Earth and Environmental Sciences
Ludwig-Maximilians-Universität München
Theresienstr. 41
80333 München, Germany

Phone: +49-89-2180-4201
E-Mail: bernhard.schuberth@geophysik.uni-muenchen.de
Info: http://venus.geophysik.uni-muenchen.de/~bernhard

@ghost
Copy link

ghost commented Oct 14, 2016

Thanks for testing the latest hdf5.

My plan to resolve this issue had been to implement Lion's suggestion of writing out the metadata in serial as a skeleton file and then closing the file, reopening it in parallel and writing the waveforms. I had issues getting the hdf5 library to build with both serial and parallel options when I was working on this a few months ago and so I never got very far in seeing if this idea would work.

Another idea would be to create another communicator group for MPI processors, but this is a bit messy and SPECFEM already has at least one of these in the code I believe so maybe not a good idea.

@bschuber
Copy link
Author

I have been looking into the first option (writing the metadata in
serial by the the master proc). By now, I have a version that runs with
~40000 Stations (so about 120,000 seismograms), but it is awfully slow
(it took more than 20 minutes to write in parallel). I therefore also
implemented the serial WRITE_SEISMOGRAMS_BY_MASTER version, which is
much faster (less than 2 minutes). I have tried to figure out why the
parallel writing with HDF5 is so slow, and have come to the conclusion
that it is not the actual writing of the data that takes so long, but
the opening of the many different HDF5 objects (e.g., the station groups
and waveform datasets). I have then tried to get an increase in IO
performance using the H5Pset_all_coll_metadata_ops flag. This, however,
is only available in HDF5 version 1.10 and write_output_ASDF() does not
run with this version on the HPC system I am working on (LRZ SuperMUC)
at all. The code stops after the first call to the ASDF library
complaining about a wrong file identifier. Whether this is a problem of
the system or the code itself, I do not know. Maybe, someone can take a
look at this, who is more familiar with HDF5. If someone wants to try,
my version of specfem3d_globe with the new ASDF output and the
corresponding adapted asdf-library can be found on github:

https://github.com/bschuber/specfem3d_globe
https://github.com/bschuber/asdf-library

Bernhard

@ghost
Copy link

ghost commented Nov 16, 2016

Hi Bernhard,
Thanks for doing this. I profiled the parallel writer and also found the writing to be pretty quick but defining everything in memory (which has to be done collectively for stations and waveforms) was slow. I will try and look into this issue again, but probably wont' have time until next year.
I have a feeling this may require redefining the container for performance reasons for the solver.
James

@bschuber
Copy link
Author

Hi James,

thanks for your reply. There is no immediate rush on this issue right
now, as I have a running version and with
WRITE_SEISMOGRAMS_BY_MASTER=.true. the speed of IO is in a reasonable range.

Cheers,

Bernhard

@komatits
Copy link
Contributor

Is this solved?

@komatits
Copy link
Contributor

(if so please close the issue)

@komatits komatits added the bug label Mar 18, 2017
@bschuber
Copy link
Author

bschuber commented Mar 21, 2017 via email

@komatits
Copy link
Contributor

komatits commented Mar 22, 2017 via email

@mpbl
Copy link

mpbl commented Apr 5, 2017

Got a workaround from @krischer. @Jas11 is working on it in SeismicData/asdf-library#19

@komatits
Copy link
Contributor

komatits commented Apr 5, 2017 via email

@komatits
Copy link
Contributor

komatits commented Jun 30, 2017

Hi all,

Please let me know if/when I can close this Git issue.
cc'ing @mpbl

Thanks,
Dimitri.

@komatits
Copy link
Contributor

komatits commented Jul 6, 2017

No answer, I assume this is now fixed. Closing it.

@komatits komatits closed this as completed Jul 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants