Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large file tests hang at run_diskless2.sh #2315

Closed
edwardhartnett opened this issue Apr 28, 2022 · 15 comments
Closed

large file tests hang at run_diskless2.sh #2315

edwardhartnett opened this issue Apr 28, 2022 · 15 comments
Assignees
Milestone

Comments

@edwardhartnett
Copy link
Contributor

@DennisHeimbigner when trying to run large file tests (i.e. with --enable-large-file-tests) the test run_diskless2.sh hangs. It's been like this for more than an hour.

This is on a powerful multi-core machine with plenty of memory, so if the test can't run on this machine, it's too hard! ;-)

ed@mikado:~/netcdf-c/nc_test$ bash -x ./run_diskless2.sh 
+ test x = x
++ pwd
+ srcdir=/home/ed/netcdf-c/nc_test
+ . ../test_common.sh
++ TOPSRCDIR=/home/ed/netcdf-c
++ TOPBUILDDIR=/home/ed/netcdf-c
++ FP_ISCMAKE=
++ FP_ISMSVC=
++ FP_WINVERMAJOR=0
++ FP_WINVERBUILD=0
++ FP_ISCYGWIN=
++ FP_ISMINGW=
++ FP_ISMSYS=
++ FP_ISOSX=
++ FP_ISREGEDIT=yes
++ FP_USEPLUGINS=yes
++ FP_ISREGEDIT=yes
++ FEATURE_HDF5=yes
++ FEATURE_HDF5=yes
++ FEATURE_S3TESTS=no
++ FEATURE_NCZARR_ZIP=no
++ FEATURE_FILTERTESTS=yes
++ set -e
++ test x = x1
+++ uname
++ system=Linux
++ test xLinux = x
++ top_srcdir=/home/ed/netcdf-c
++ top_builddir=/home/ed/netcdf-c
++ test x/home/ed/netcdf-c/nc_test = x
+++ pwd
++ builddir=/home/ed/netcdf-c/nc_test
++ execdir=/home/ed/netcdf-c/nc_test
+++ basename /home/ed/netcdf-c/nc_test
++ thisdir=nc_test
+++ pwd
++ WD=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
+++ pwd
++ srcdir=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c
+++ pwd
++ top_srcdir=/home/ed/netcdf-c
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
+++ pwd
++ builddir=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c
+++ pwd
++ top_builddir=/home/ed/netcdf-c
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
+++ pwd
++ execdir=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
++ export srcdir top_srcdir builddir top_builddir execdir
++ test -e /home/ed/netcdf-c/ncdump/ncdump.exe
++ ext=
++ export NCDUMP=/home/ed/netcdf-c/ncdump/ncdump
++ NCDUMP=/home/ed/netcdf-c/ncdump/ncdump
++ export NCCOPY=/home/ed/netcdf-c/ncdump/nccopy
++ NCCOPY=/home/ed/netcdf-c/ncdump/nccopy
++ export NCGEN=/home/ed/netcdf-c/ncgen/ncgen
++ NCGEN=/home/ed/netcdf-c/ncgen/ncgen
++ export NCGEN3=/home/ed/netcdf-c/ncgen3/ncgen3
++ NCGEN3=/home/ed/netcdf-c/ncgen3/ncgen3
++ export NCPATHCVT=/home/ed/netcdf-c/ncdump/ncpathcvt
++ NCPATHCVT=/home/ed/netcdf-c/ncdump/ncpathcvt
++ ncgen3c0=/home/ed/netcdf-c/ncgen3/c0.cdl
++ ncgenc0=/home/ed/netcdf-c/ncgen/c0.cdl
++ ncgenc04=/home/ed/netcdf-c/ncgen/c0_4.cdl
++ test x = xyes
++ test x = xyes
++ cd /home/ed/netcdf-c/nc_test
+ set -e
+ test x/home/ed/netcdf-c/nc_test = x
+ . ../test_common.sh
++ TOPSRCDIR=/home/ed/netcdf-c
++ TOPBUILDDIR=/home/ed/netcdf-c
++ FP_ISCMAKE=
++ FP_ISMSVC=
++ FP_WINVERMAJOR=0
++ FP_WINVERBUILD=0
++ FP_ISCYGWIN=
++ FP_ISMINGW=
++ FP_ISMSYS=
++ FP_ISOSX=
++ FP_ISREGEDIT=yes
++ FP_USEPLUGINS=yes
++ FP_ISREGEDIT=yes
++ FEATURE_HDF5=yes
++ FEATURE_HDF5=yes
++ FEATURE_S3TESTS=no
++ FEATURE_NCZARR_ZIP=no
++ FEATURE_FILTERTESTS=yes
++ set -e
++ test x = x1
+++ uname
++ system=Linux
++ test xLinux = x
++ top_srcdir=/home/ed/netcdf-c
++ top_builddir=/home/ed/netcdf-c
++ test x/home/ed/netcdf-c/nc_test = x
+++ pwd
++ builddir=/home/ed/netcdf-c/nc_test
++ execdir=/home/ed/netcdf-c/nc_test
+++ basename /home/ed/netcdf-c/nc_test
++ thisdir=nc_test
+++ pwd
++ WD=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
+++ pwd
++ srcdir=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c
+++ pwd
++ top_srcdir=/home/ed/netcdf-c
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
+++ pwd
++ builddir=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c
+++ pwd
++ top_builddir=/home/ed/netcdf-c
++ cd /home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
+++ pwd
++ execdir=/home/ed/netcdf-c/nc_test
++ cd /home/ed/netcdf-c/nc_test
++ export srcdir top_srcdir builddir top_builddir execdir
++ test -e /home/ed/netcdf-c/ncdump/ncdump.exe
++ ext=
++ export NCDUMP=/home/ed/netcdf-c/ncdump/ncdump
++ NCDUMP=/home/ed/netcdf-c/ncdump/ncdump
++ export NCCOPY=/home/ed/netcdf-c/ncdump/nccopy
++ NCCOPY=/home/ed/netcdf-c/ncdump/nccopy
++ export NCGEN=/home/ed/netcdf-c/ncgen/ncgen
++ NCGEN=/home/ed/netcdf-c/ncgen/ncgen
++ export NCGEN3=/home/ed/netcdf-c/ncgen3/ncgen3
++ NCGEN3=/home/ed/netcdf-c/ncgen3/ncgen3
++ export NCPATHCVT=/home/ed/netcdf-c/ncdump/ncpathcvt
++ NCPATHCVT=/home/ed/netcdf-c/ncdump/ncpathcvt
++ ncgen3c0=/home/ed/netcdf-c/ncgen3/c0.cdl
++ ncgenc0=/home/ed/netcdf-c/ncgen/c0.cdl
++ ncgenc04=/home/ed/netcdf-c/ncgen/c0_4.cdl
++ test x = xyes
++ test x = xyes
++ cd /home/ed/netcdf-c/nc_test
++ uname -p
+ CPU=x86_64
++ uname
+ OS=Linux
+ SIZE=500000000
+ FILE4=tst_diskless4.nc
+ rm -fr ref_tst_diskless4.cdl
+ cat
+ echo ''

+ rm -f tst_diskless4.nc
+ ./tst_diskless4 500000000 create

*** Create file
ok.
+ /home/ed/netcdf-c/ncdump/ncdump -h tst_diskless4.nc
+ diff -w - ref_tst_diskless4.cdl
+ echo ''

+ rm -f tst_diskless4.nc
+ ./tst_diskless4 500000000 creatediskless

*** Create file diskless
ok.
+ /home/ed/netcdf-c/ncdump/ncdump -h tst_diskless4.nc
+ diff -w - ref_tst_diskless4.cdl
+ echo ''

+ ./tst_diskless4 500000000 open

*** Open file
ok.
+ echo ''

+ ./tst_diskless4 500000000 opendiskless

*** Open file diskless

@edwardhartnett
Copy link
Contributor Author

When I comment out running run_diskless2.sh, then all tests pass.

@DennisHeimbigner
Copy link
Collaborator

The purpose of that test is to create a large in-memory file of size 500 megabytes.
So I think the issue is not the number of processors, but rather the amount of
virtual memory available.

@edwardhartnett
Copy link
Contributor Author

I have ~57 GB of available memory:

free -g
              total        used        free      shared  buff/cache   available
Mem:             62           4          27           0          31          57
Swap:            93           0          93

@DennisHeimbigner
Copy link
Collaborator

Refresh my memory; will each processor try to allocate the memory or will only
one processor do it?

@edwardhartnett
Copy link
Contributor Author

This is a sequential test so is only running on one processor...

@DennisHeimbigner
Copy link
Collaborator

Well I will suppress this test if running parallel. Hope that will fix the problem.

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Apr 28, 2022
## Include <getopt.h> in various utilities
re: Unidata#2303
As noted, some utilities are using getopt() without including
getopt.h, so add as needed.

## Turn off run_diskless2.sh when ENABLE_PARALLEL is true
re: Unidata#2315
Ed notes that this test hangs when running parallel.  The test
is attempting to create a very large in-memory file, which is
the proximate cause. But no idea what's the underlying cause.
@DennisHeimbigner
Copy link
Collaborator

Fixed by PR #2316 ?

@dopplershift
Copy link
Member

I would suggest that while #2316 addresses the fact that the test suite hangs on this configuration, it would be good to leave this issue open since no root cause has been identified. For all we know this could be due to some nasty bug lurking somewhere. Unless I missed something and we'd expect a test running on a single processor should fail in a parallel configuration.

@WardF
Copy link
Member

WardF commented Apr 28, 2022

Agreed that this has an underlying issue that will need to be resolved.

@WardF WardF self-assigned this Apr 28, 2022
@WardF WardF added this to the 4.9.1 milestone Apr 28, 2022
@DennisHeimbigner
Copy link
Collaborator

Since this test is allocating a 500mbyte block of virtual memory, my suspicion
is that it is something like the one processor creating the block has to somehow
pass (or copy) it to all the other processors before it can continue.
Thoughts from Ed of Wei-King would be welcome.

@edwardhartnett
Copy link
Contributor Author

I don't think so. This is not a parallel I/O problem - it hangs in sequential mode too.

@DennisHeimbigner does this test work for you on your machine?

@dopplershift
Copy link
Member

dopplershift commented Apr 29, 2022

is that it is something like the one processor creating the block has to somehow
pass (or copy) it to all the other processors before it can continue.

Even if it was running it parallel, I would be completely and utterly shocked if that necessitated an actual copy rather than being handled by virtual addressing. And even if it did copy, with today's memory bandwidth, that should take less than a second.

@DennisHeimbigner
Copy link
Collaborator

I pass. I have no other ideas about what might be happening.
Perhaps someone can apply a debugger or do some profiling to
find out.

@wkliao
Copy link
Contributor

wkliao commented Apr 30, 2022

This should be resolved in PR #2319

@WardF WardF modified the milestones: 4.9.1, 4.9.2 Feb 13, 2023
@WardF WardF modified the milestones: 4.9.2, 4.9.3 May 16, 2023
@edwardhartnett
Copy link
Contributor Author

This is fixed. I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants