Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F(a) and F(b) must have different sign on first time step of BP5 #74

Closed
Thomas-Ulrich opened this issue Jul 17, 2024 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@Thomas-Ulrich
Copy link
Collaborator

Thomas-Ulrich commented Jul 17, 2024

Describe the bug
Using tandem p2 and bp5 example for the repository, I get an error at first time step:

di73yeq4@login03:/hppfs/work/pn49ha/di73yeq4/tandem/examples/tandem/3d> head 3451988.tandem.out -n 200
num_nodes: 4 ntasks: 192

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------


Multigrid P-levels: 1 2 
TS ts_checkpoint.storage_type limited
TS ts_checkpoint.save_directory checkpoint
TS ts_checkpoint.freq_step 1000
TS ts_checkpoint.freq_cputime 3.0000e+01
TS ts_checkpoint.freq_physical_time 1.0000e+10
TS ts_checkpoint.storage_limited_size 2
[checkpoint] directory created
DOFs (domain): 1891590
DOFs (fault): 167796
Mesh size: 71.6532
sigma_n = 11.0811
|tau| = 13525.3
psi = -0.220103
L = 0
U = 2924.74
F(L) = 13525.3
sigma_n = 196.612
|tau| = 26418.9
psi = -0.993655
L = 0
U = 5712.89
F(L) = 26418.9
F(U) = 1.61031e-12
sigma_n = 54.621
|tau| = 105097
psi = -6.47109
L = 0
U = 22726.5
F(L) = 105097
F(U) = 5.31919e-12
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 41.6383
|tau| = 13866.2
psi = -0.204948
L = 0
U = 2998.47
F(L) = 13866.2
F(U) = 7.89669e-14
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 19.8669
|tau| = 14586.5
psi = -0.25234
L = 0
U = 3154.22
F(L) = 14586.5
F(U) = 6.96785e-13
  what():  F(a) and F(b) must have different sign.
F(U) = 8.03797e-13
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 58.7748
|tau| = 16364
psi = -0.525257
L = 0
U = 3538.6
F(L) = 16364
F(U) = 7.50726e-13
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 51.8792
|tau| = 15802.3
psi = -0.306186
L = 0
U = 3417.13
F(L) = 15802.3
F(U) = 1.0331e-12
  what():  F(a) and F(b) must have different sign.
terminate called after throwing an instance of 'std::logic_error'
  what():  F(a) and F(b) must have different sign.
terminate called after throwing an instance of 'std::logic_error'
  what():  F(a) and F(b) must have different sign.
terminate called after throwing an instance of 'std::logic_error'
  what():  F(a) and F(b) must have different sign.
  what():  F(a) and F(b) must have different sign.
  what():  F(a) and F(b) must have different sign.
srun: error: i01r01c05s07: task 134: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=3451988.0
slurmstepd: error: *** STEP 3451988.0 ON i01r01c05s05 CANCELLED AT 2024-07-17T11:37:51 ***
[148]PETSC ERROR: ------------------------------------------------------------------------

Expected behavior
no error
To Reproduce
Steps to reproduce the behavior:

I'm running BP5.toml based on this branch #72 (at commit ee87ac9)
which is a few commits on top of #59

spack installed on supermuc NG with:

spack install -j 30 tandem@tscp polynomial_degree=2 domain_dimension=3

Here is a list of the dependencies of tandem, and there specs:

di73yeq4@login03:/hppfs/work/pn49ha/di73yeq4/tandem/examples/tandem/3d> spack spec -I  tandem@tscp polynomial_degree=2 domain_dimension=3


Input spec
--------------------------------
 -   tandem@tscp domain_dimension=3 polynomial_degree=2

Concretized
--------------------------------
 -   tandem@tscp%gcc@12.2.0~cuda~ipo~libxsmm~python~rocm build_system=cmake build_type=Release domain_dimension=3 generator=make min_quadrature_order=0 polynomial_degree=2 arch=linux-sles15-skylake_avx512
[^]      ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]          ^ncurses@6.4%gcc@12.2.0~symlinks+termlib abi=none build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^pkgconf@1.8.0%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]              ^ca-certificates-mozilla@2023-01-10%gcc@12.2.0 build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                  ^berkeley-db@18.1.40%gcc@12.2.0+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc arch=linux-sles15-skylake_avx512
[^]      ^eigen@3.4.0%gcc@12.2.0~ipo build_system=cmake build_type=RelWithDebInfo generator=make arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                      ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^intel-oneapi-mpi@2021.9.0%gcc@12.2.0+envmods~external-libfabric~generic-names~ilp64 build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^lua@5.4.4%gcc@12.2.0~pcfile+shared build_system=makefile fetcher=curl arch=linux-sles15-skylake_avx512
[^]          ^curl@8.0.1%gcc@12.2.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-skylake_avx512
[^]          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^unzip@6.0%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]      ^metis@5.1.0%gcc@12.2.0~gdb+int64~ipo~real64+shared build_system=cmake build_type=Release generator=make patches=4991da9,93a7903,b1225da arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^parmetis@4.0.3%gcc@12.2.0~gdb+int64~ipo+shared build_system=cmake build_type=Release generator=make patches=4f89253,50ed208,704b84f arch=linux-sles15-skylake_avx512
[+]      ^petsc@3.20.1%gcc@12.2.0~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre+int64~jpeg+knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi+mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws+scalapack+shared~strumpack~suite-sparse+superlu-dist~sycl~tetgen~trilinos~valgrind build_system=generic clanguage=C memalign=32 arch=linux-sles15-skylake_avx512
[^]          ^diffutils@3.9%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libiconv@1.17%gcc@12.2.0 build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[^]          ^hdf5@1.10.9%gcc@12.2.0+cxx+fortran+hl~ipo~java+mpi+shared+szip+threadsafe+tools api=default build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]              ^libaec@1.0.6%gcc@12.2.0~ipo+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[+]          ^hypre@develop%gcc@12.2.0~caliper~complex~cuda~debug+fortran~gptune+int64~internal-superlu~magma~mixedint+mpi~openmp~rocm+shared~superlu-dist~sycl~umpire~unified-memory build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^intel-oneapi-mkl@2023.1.0%gcc@12.2.0+cluster+envmods~ilp64+shared build_system=generic threads=none arch=linux-sles15-skylake_avx512
[^]              ^intel-oneapi-tbb@2021.9.0%gcc@12.2.0+envmods build_system=generic arch=linux-sles15-skylake_avx512
[+]          ^mumps@5.5.1%gcc@12.2.0~blr_mt+complex+double+float~incfort~int64+metis+mpi~openmp+parmetis~ptscotch~scotch+shared build_system=generic patches=373d736 arch=linux-sles15-skylake_avx512
[^]          ^python@3.10.10%gcc@12.2.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060 arch=linux-sles15-skylake_avx512
[^]              ^bzip2@1.0.8%gcc@12.2.0~debug~pic+shared build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^expat@2.5.0%gcc@12.2.0+libbsd build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libbsd@0.11.7%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                      ^libmd@1.0.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gettext@0.21.1%gcc@12.2.0+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libxml2@2.10.3%gcc@12.2.0~python build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^tar@1.30%gcc@12.2.0 build_system=autotools zip=pigz arch=linux-sles15-skylake_avx512
[^]                      ^pigz@2.7%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]              ^libffi@3.4.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libxcrypt@4.4.33%gcc@12.2.0~obsolete_api build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^sqlite@3.40.1%gcc@12.2.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^util-linux-uuid@2.38.1%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^xz@5.4.1%gcc@12.2.0~pic build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[+]          ^superlu-dist@develop%gcc@12.2.0~cuda+int64~ipo~openmp+parmetis~rocm+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]      ^zlib@1.2.13%gcc@12.2.0+optimize+pic+shared build_system=makefile arch=linux-sles15-skylake_avx512

launched with:

#!/bin/bash
# Job Name and Files (also --job-name)
#SBATCH -J tandem
#Output and error (also --output, --error):
#SBATCH -o ./%j.%x.out
#SBATCH -e ./%j.%x.out

#Initial working directory:
#SBATCH --chdir=./

#Notification and type
#SBATCH --mail-type=END
#SBATCH --mail-user=thomas.ulrich@lmu.de
#SBATCH --no-requeue

#Setup of execution environment
#SBATCH --export=ALL
#SBATCH --account=pn49ha

#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#EAR may impact code performance
#SBATCH --ear=off

##SBATCH --nodes=20 --partition=general --time=00:35:00
#SBATCH --nodes=4 --partition=test --time=00:30:00 
#--exclude="i01r01c[01-02]s[01-12]"

module load slurm_setup

export MP_SINGLE_THREAD=yes
export OMP_NUM_THREADS=1
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS

echo 'num_nodes:' $SLURM_JOB_NUM_NODES 'ntasks:' $SLURM_NTASKS
ulimit -Ss 2097152

srun tandem bp5.toml  --mg_strategy twolevel --mg_coarse_level 1  --petsc -ksp_max_it 400 -pc_type mg -mg_levels_ksp_max_it 4 -mg_levels_ksp_type cg -mg_levels_pc_type bjacobi -ksp_rtol 1.0e-6 -mg_coarse_pc_type gamg -mg_coarse_ksp_type cg -mg_coarse_ksp_rtol 1.0e-1 -ksp_type gcr -log_view                                            
@Thomas-Ulrich Thomas-Ulrich added the bug Something isn't working label Jul 17, 2024
@Thomas-Ulrich
Copy link
Collaborator Author

Thomas-Ulrich commented Sep 11, 2024

I've added some additional error log:

diff --git a/app/localoperator/DieterichRuinaAgeing.h b/app/localoperator/DieterichRuinaAgeing.h
index 5d4b5b6..019edf0 100644
--- a/app/localoperator/DieterichRuinaAgeing.h
+++ b/app/localoperator/DieterichRuinaAgeing.h
@@ -106,7 +106,11 @@ public:
                     V = zeroIn(a, b, fF);
                 } catch (std::exception const&) {
                     std::cout << "sigma_n = " << snAbs << std::endl
+                              << "-sn = " << -sn << std::endl
+                              << "SnPre = " << p_[index].get<SnPre>() << std::endl
                               << "|tau| = " << tauAbs << std::endl
+                              << "|tau_inc| = " << norm(tau) << std::endl
+                              << "|TauPre| = " << norm(p_[index].get<TauPre>()) << std::endl
                               << "psi = " << psi << std::endl
                               << "L = " << a << std::endl
                               << "U = " << b << std::endl

And they show tau_ini is probably correct.

sigma_n = 28.5945
-sn = 3.59447
SnPre = 25
|tau| = 7012.44
|tau_inc| = 6991.29
|TauPre| = 21.1481
psi = -0.790723
L = 0
sigma_n = 80.8889
-sn = 55.8889
SnPre = 25

Also tested v1.0, same issue. (both p1 and p2).
Also tested Nico's setup.

@Thomas-Ulrich
Copy link
Collaborator Author

This was because I was not setting the Petsc parameters for the TS file !
Maybe we could catch this missing parameter in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant