Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Green function checkpointing on large setups risks unusable gf file #73

Open
Thomas-Ulrich opened this issue Jul 17, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@Thomas-Ulrich
Copy link
Collaborator

Describe the bug
I'm running BP5.toml based on this branch #72 (at commit ee87ac9)
which is a few commits on top of #59

I changed res_f to 5 to have a very small mesh to test.
Im BP5.toml, I add:

[gf_checkpoint]
prefix = "GreensFunctions/bp6_hf250"
freq_cputime = 0.01

So that green functions are checkpointed every new green function.
Generally it works.
But it also happened several times that it was not able to restart.
E.g. job killed during generation of GF:

num_nodes: 1 ntasks: 48

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------


Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 8856 x 5904
GF loaded was created with commsize matching current (48).
load_discrete_greens_operator() 1.95e+00 (sec)
  status: loaded 7 / pending 5897
partial_assemble_discrete_greens_function() [7 , 5904)
Computing Green's function 7 / 5904
write_discrete_greens_operator():matrix 3.47e+00 (sec)
  status: computed 8 / pending 5896
write_discrete_greens_operator():facets 8.07e-03 (sec)
Computing Green's function 8 / 5904
write_discrete_greens_operator():matrix 3.39e+00 (sec)
  status: computed 9 / pending 5895
write_discrete_greens_operator():facets 6.91e-03 (sec)
Computing Green's function 9 / 5904
slurmstepd: error: *** STEP 3451849.0 ON i01r01c04s04 CANCELLED AT 2024-07-17T10:05:43 ***
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end

Next job failing:

num_nodes: 1 ntasks: 48

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------


Multigrid P-levels: 1 2 
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 8856 x 5904
GF loaded was created with commsize matching current (48).
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Read from file failed
[0]PETSC ERROR: Read past end of file
[0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR:   Option left: name:-mg_coarse_ksp_rtol value: 1.0e-1 source: command line
[0]PETSC ERROR:   Option left: name:-mg_coarse_ksp_type value: cg source: command line
[0]PETSC ERROR:   Option left: name:-mg_coarse_pc_type value: gamg source: command line
[0]PETSC ERROR:   Option left: name:-mg_levels_ksp_max_it value: 4 source: command line
[0]PETSC ERROR:   Option left: name:-mg_levels_ksp_type value: cg source: command line
[0]PETSC ERROR:   Option left: name:-mg_levels_pc_type value: bjacobi source: command line
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.20.1, Oct 31, 2023 
[0]PETSC ERROR: --petsc on a  named i01r01c04s04 by di73yeq4 Wed Jul 17 10:07:21 2024
[0]PETSC ERROR: Configure options --prefix=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/petsc/3.20.1-gcc-12.2.0-vlbrevt --with-ssl=0 --download-c2html=0 --download-sowing=0 --download-hwloc=0 --with-make-exec=make --with-cc=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mpi/2021.9.0-gcc-xizuusf/mpi/2021.9.0/bin/mpiicc --with-cxx=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mpi/2021.9.0-gcc-xizuusf/mpi/2021.9.0/bin/mpiicpc --with-fc=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mpi/2021.9.0-gcc-xizuusf/mpi/2021.9.0/bin/mpiifort --with-precision=double --with-scalar-type=real --with-shared-libraries=1 --with-debugging=0 --with-openmp=0 --with-64-bit-indices=1 --with-blaslapack-lib="/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_cdft_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_intel_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_sequential.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so /usr/lib64/libpthread.so /usr/lib64/libm.so /usr/lib64/libdl.so" --with-avx-512-kernels --with-memalign=64 --with-x=0 --with-sycl=0 --with-clanguage=C --with-cuda=0 --with-hip=0 --with-metis=1 --with-metis-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/metis/5.1.0-gcc-kougmmh/include --with-metis-lib=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/metis/5.1.0-gcc-kougmmh/lib/libmetis.so --with-hypre=1 --with-hypre-include=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/hypre/develop-gcc-12.2.0-ngxzdup/include --with-hypre-lib=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/hypre/develop-gcc-12.2.0-ngxzdup/lib/libHYPRE.so --with-parmetis=1 --with-parmetis-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/parmetis/4.0.3-gcc-nypuwzn/include --with-parmetis-lib=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/parmetis/4.0.3-gcc-nypuwzn/lib/libparmetis.so --with-kokkos=0 --with-kokkos-kernels=0 --with-superlu_dist=1 --with-superlu_dist-include=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/superlu-dist/develop-gcc-12.2.0-z2v2xhr/include --with-superlu_dist-lib=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/superlu-dist/develop-gcc-12.2.0-z2v2xhr/lib/libsuperlu_dist.so --with-ptscotch=0 --with-suitesparse=0 --with-hdf5=1 --with-hdf5-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/include --with-hdf5-lib="/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_hl_fortran.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_hl_f90cstub.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_hl.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_fortran.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_f90cstub.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5.so" --with-zlib=1 --with-zlib-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/zlib/1.2.13-gcc-p5ywc53/include --with-zlib-lib=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/zlib/1.2.13-gcc-p5ywc53/lib/libz.so --with-mumps=1 --with-mumps-include=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/include --with-mumps-lib="/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libdmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libzmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libsmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libcmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libmumps_common.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libpord.so" --with-trilinos=0 --with-fftw=0 --with-valgrind=0 --with-gmp=0 --with-libpng=0 --with-giflib=0 --with-mpfr=0 --with-netcdf=0 --with-pnetcdf=0 --with-moab=0 --with-random123=0 --with-exodusii=0 --with-cgns=0 --with-memkind=0 --with-p4est=0 --with-saws=0 --with-yaml=0 --with-hwloc=0 --with-libjpeg=0 --with-scalapack=1 --with-scalapack-lib="/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_cdft_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_intel_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_sequential.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so /usr/lib64/libpthread.so /usr/lib64/libm.so /usr/lib64/libdl.so" --with-strumpack=0 --with-mmg=0 --with-parmmg=0 --with-tetgen=0
[0]PETSC ERROR: #1 PetscBinaryRead() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/sys/fileio/sysio.c:327
[0]PETSC ERROR: #2 PetscViewerBinaryWriteReadAll() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/sys/classes/viewer/impls/binary/binv.c:1076
[0]PETSC ERROR: #3 PetscViewerBinaryReadAll() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/sys/classes/viewer/impls/binary/binv.c:1118
[0]PETSC ERROR: #4 MatLoad_Dense_Binary() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/mat/impls/dense/seq/dense.c:1408
[0]PETSC ERROR: #5 MatLoad_MPIDense() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/mat/impls/dense/mpi/mpidense.c:1900
[0]PETSC ERROR: #6 MatLoad() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/mat/interface/matrix.c:1339
[0]PETSC ERROR: #7 load_discrete_greens_operator() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-tandem-tscp-omrqpkb5k5ca6s67eap67wcvpa5xijea/spack-src/app/form/SeasQDDiscreteGreenOperator.cpp:512
terminate called after throwing an instance of 'tndm::petsc_error'

I noticed similar issues on kernelpanic.

Expected behavior
the green function generation should have started again.

To Reproduce
Steps to reproduce the behavior:
spack intstalled on supermuc NG with:
spack install -j 30 tandem@tscp polynomial_degree=2 domain_dimension=3

Here is a list of the dependencies of tandem, and there specs:

di73yeq4@login03:/hppfs/work/pn49ha/di73yeq4/tandem/examples/tandem/3d> spack spec -I  tandem@tscp polynomial_degree=2 domain_dimension=3


Input spec
--------------------------------
 -   tandem@tscp domain_dimension=3 polynomial_degree=2

Concretized
--------------------------------
 -   tandem@tscp%gcc@12.2.0~cuda~ipo~libxsmm~python~rocm build_system=cmake build_type=Release domain_dimension=3 generator=make min_quadrature_order=0 polynomial_degree=2 arch=linux-sles15-skylake_avx512
[^]      ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]          ^ncurses@6.4%gcc@12.2.0~symlinks+termlib abi=none build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^pkgconf@1.8.0%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]              ^ca-certificates-mozilla@2023-01-10%gcc@12.2.0 build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                  ^berkeley-db@18.1.40%gcc@12.2.0+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc arch=linux-sles15-skylake_avx512
[^]      ^eigen@3.4.0%gcc@12.2.0~ipo build_system=cmake build_type=RelWithDebInfo generator=make arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                      ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^intel-oneapi-mpi@2021.9.0%gcc@12.2.0+envmods~external-libfabric~generic-names~ilp64 build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^lua@5.4.4%gcc@12.2.0~pcfile+shared build_system=makefile fetcher=curl arch=linux-sles15-skylake_avx512
[^]          ^curl@8.0.1%gcc@12.2.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-skylake_avx512
[^]          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^unzip@6.0%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]      ^metis@5.1.0%gcc@12.2.0~gdb+int64~ipo~real64+shared build_system=cmake build_type=Release generator=make patches=4991da9,93a7903,b1225da arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^parmetis@4.0.3%gcc@12.2.0~gdb+int64~ipo+shared build_system=cmake build_type=Release generator=make patches=4f89253,50ed208,704b84f arch=linux-sles15-skylake_avx512
[+]      ^petsc@3.20.1%gcc@12.2.0~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre+int64~jpeg+knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi+mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws+scalapack+shared~strumpack~suite-sparse+superlu-dist~sycl~tetgen~trilinos~valgrind build_system=generic clanguage=C memalign=32 arch=linux-sles15-skylake_avx512
[^]          ^diffutils@3.9%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libiconv@1.17%gcc@12.2.0 build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[^]          ^hdf5@1.10.9%gcc@12.2.0+cxx+fortran+hl~ipo~java+mpi+shared+szip+threadsafe+tools api=default build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]              ^libaec@1.0.6%gcc@12.2.0~ipo+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[+]          ^hypre@develop%gcc@12.2.0~caliper~complex~cuda~debug+fortran~gptune+int64~internal-superlu~magma~mixedint+mpi~openmp~rocm+shared~superlu-dist~sycl~umpire~unified-memory build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^intel-oneapi-mkl@2023.1.0%gcc@12.2.0+cluster+envmods~ilp64+shared build_system=generic threads=none arch=linux-sles15-skylake_avx512
[^]              ^intel-oneapi-tbb@2021.9.0%gcc@12.2.0+envmods build_system=generic arch=linux-sles15-skylake_avx512
[+]          ^mumps@5.5.1%gcc@12.2.0~blr_mt+complex+double+float~incfort~int64+metis+mpi~openmp+parmetis~ptscotch~scotch+shared build_system=generic patches=373d736 arch=linux-sles15-skylake_avx512
[^]          ^python@3.10.10%gcc@12.2.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060 arch=linux-sles15-skylake_avx512
[^]              ^bzip2@1.0.8%gcc@12.2.0~debug~pic+shared build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^expat@2.5.0%gcc@12.2.0+libbsd build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libbsd@0.11.7%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                      ^libmd@1.0.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gettext@0.21.1%gcc@12.2.0+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libxml2@2.10.3%gcc@12.2.0~python build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^tar@1.30%gcc@12.2.0 build_system=autotools zip=pigz arch=linux-sles15-skylake_avx512
[^]                      ^pigz@2.7%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]              ^libffi@3.4.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libxcrypt@4.4.33%gcc@12.2.0~obsolete_api build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^sqlite@3.40.1%gcc@12.2.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^util-linux-uuid@2.38.1%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^xz@5.4.1%gcc@12.2.0~pic build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[+]          ^superlu-dist@develop%gcc@12.2.0~cuda+int64~ipo~openmp+parmetis~rocm+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]      ^zlib@1.2.13%gcc@12.2.0+optimize+pic+shared build_system=makefile arch=linux-sles15-skylake_avx512
@Thomas-Ulrich Thomas-Ulrich added the bug Something isn't working label Jul 17, 2024
@Thomas-Ulrich
Copy link
Collaborator Author

I think the problem is when the job crashed while writing the green functions.
(it is probably overwriting the old file).
Note that the mesh is tiny, but write_discrete_greens_operator takes ages (several seconds)

num_nodes: 1 ntasks: 48

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------


Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 8856 x 5904
partial_assemble_discrete_greens_function() [0 , 5904)
Computing Green's function 0 / 5904
write_discrete_greens_operator():matrix 3.54e+00 (sec)
  status: computed 1 / pending 5903
write_discrete_greens_operator():facets 1.59e-02 (sec)
Computing Green's function 1 / 5904
write_discrete_greens_operator():matrix 3.36e+00 (sec)
  status: computed 2 / pending 5902
write_discrete_greens_operator():facets 6.75e-03 (sec)

@Thomas-Ulrich
Copy link
Collaborator Author

E.g. of timing:

  Total time:      4.29e+00 sec
  Open file:       4.90e-05 sec
  Write commsize:  2.88e-01 sec
  Write current_gf:2.17e-06 sec
  MatView:         3.69e+00 sec
  Close file:      3.16e-01 sec
  Print status:    5.60e-05 sec
  Write facet:     1.42e-03 sec

@Thomas-Ulrich
Copy link
Collaborator Author

ok, I guess the problem is that the full green function (including the zeros) needs to be written at each call.

@Thomas-Ulrich Thomas-Ulrich changed the title Sporadic green function checkpointing error on 3D when restarting Slow Green function checkpointing on large setups risks unusable gf file Jul 17, 2024
@Thomas-Ulrich
Copy link
Collaborator Author

Here is an example of BP5 with the default mesh.
Checkpointing 152Gb in 19min !!!

num_nodes: 6 ntasks: 288

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------


Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 167796 x 111864
partial_assemble_discrete_greens_function() [0 , 111864)
Computing Green's function 0 / 111864
write_discrete_greens_operator():matrix 1.14e+03 (sec)
  status: computed 1 / pending 111863
write_discrete_greens_operator():facets 1.18e-02 (sec)
Computing Green's function 1 / 111864

@Thomas-Ulrich
Copy link
Collaborator Author

Thomas-Ulrich commented Jul 17, 2024

Ok, it seems I fixed one of the problem with this simple commit:

739b36d

Now checkpointing is much faster!

Multigrid P-levels: 1 2 
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 167796 x 111864
partial_assemble_discrete_greens_function() [0 , 111864)
Computing Green's function 0 / 111864
write_discrete_greens_operator():matrix 1.55e+01 (sec)
  status: computed 1 / pending 111863
write_discrete_greens_operator():facets 8.62e-03 (sec)
Computing Green's function 1 / 111864
write_discrete_greens_operator():matrix 1.65e+01 (sec)
  status: computed 2 / pending 111862
write_discrete_greens_operator():facets 8.93e-03 (sec)
Computing Green's function 2 / 111864
write_discrete_greens_operator():matrix 1.56e+01 (sec)
  status: computed 3 / pending 111861
write_discrete_greens_operator():facets 1.10e-02 (sec)
Computing Green's function 3 / 111864
write_discrete_greens_operator():matrix 1.62e+01 (sec)

Thomas-Ulrich added a commit that referenced this issue Jul 17, 2024
@Thomas-Ulrich
Copy link
Collaborator Author

Thomas-Ulrich commented Jul 17, 2024

and with cfd7a25
I fixed the rest of the issue.

Thomas-Ulrich added a commit that referenced this issue Jul 17, 2024
Thomas-Ulrich added a commit that referenced this issue Jul 23, 2024
Thomas-Ulrich added a commit that referenced this issue Jul 23, 2024
Thomas-Ulrich added a commit that referenced this issue Oct 22, 2024
Thomas-Ulrich added a commit that referenced this issue Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant