Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed on parallel reads #72

Closed
vpipkt opened this issue Jan 1, 2020 · 15 comments · Fixed by #76
Closed

Assertion failed on parallel reads #72

vpipkt opened this issue Jan 1, 2020 · 15 comments · Fixed by #76
Assignees
Labels
bug Something isn't working

Comments

@vpipkt
Copy link

vpipkt commented Jan 1, 2020

Describe the bug

I am developing a job in RasterFrames that uses this library and I (sometimes) receive several illegal argument messages, then finally a failed assertion error, process crash and core dump. The job is reading many JP2 (sentinel 2 PDS) files in parallel. The error seems to happen on non-deteriministic files. I have tried reading the files mentioned in the errors and the errors do not occur outside the context of the larger job, and I believe, running on many cores.

A sample of the illegal argument messages:

[116 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (171800590) in GDALRasterBand::GetLockedBlockRef()

[117 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (-494424126) in GDALRasterBand::TryGetLockedBlockRef()

[118 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (-494424126) in GDALRasterBand::GetLockedBlockRef()

[119 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (22130851) in GDALRasterBand::TryGetLockedBlockRef()

[120 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (22130851) in GDALRasterBand::GetLockedBlockRef()

The assertion error is:

java: locked_dataset.hpp:119: locked_dataset& locked_dataset::operator=(locked_dataset&&): Assertion `m_use_count == 0' failed.

I suspect the error and process crash have been foretold in this code comment.

To Reproduce

Steps to reproduce the behavior:

  1. Launch ec2 instance
    1. amazon linux 2 AMI,
    2. perhaps c5.9xlarge, Reproduced on ~16 or more cores
    3. in eu-central-1 region (same as PDS location).
    4. Probably need 50-100 GB storage to collect the core dump
    5. Set up with IAM Role (or other permissions) to read S3 PDS bucket (requester pays)
  2. Install docker
  3. docker pull s22s/rasterframes-notebook:0.9.0-RC1
  4. docker run -p 8888:8888 -p 4040:4040 -v /home/ec2-user/:/home/jovyan/work s22s/rasterframes-notebook:0.9.0-RC1
  5. Upload / scp file minimal.py.zip and unzip
  6. From within the jupyter notebook's terminal: $ python minimal.py
  7. Error may take several minutes to occur.

Expected behavior

I expect the job to complete with all JP2 files read without error in a job that reads one file, or many files. As mentioned above, have tried small single-file jobs on some of the files mentioned in error messages. The job completes fine for these.

Environment

  • Docker container: s22s/rasterframes-notebook:0.9.0-RC1
  • Running on EC2 instance c5.9xlarge, 36 cores on the spark driver.
  • JVM version: 1.8.0_222 (Private Build)
  • GDAL version: 2.4.2

Core Files

I have captured a core dump that is 18GB. We'll have to figure out how to share it.

@jamesmcclain
Copy link
Member

Thank you for the detailed report!

Would it be possible for you to put the core file on S3 somewhere and share the location with me?

@geotrellis geotrellis deleted a comment from vpipkt Jan 2, 2020
@vpipkt
Copy link
Author

vpipkt commented Jan 3, 2020

The python script references this file. A dataframe full of S3 URI's for sentinel 2 PDS JP2 files.

large_catalog.csv.zip

@jamesmcclain jamesmcclain added the bug Something isn't working label Jan 3, 2020
@jamesmcclain
Copy link
Member

The python script references this file. A dataframe full of S3 URI's for sentinel 2 PDS JP2 files.

large_catalog.csv.zip

Thanks much.

@pomadchin
Copy link
Member

pomadchin commented Jan 16, 2020

I still could not reproduce the issue, however, if [116 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." and assertion issues are connected somehow - this can be also a driver / vsicurl issue;

GDAL 2.4.3 and 2.4.4 contained lots of fixes related to jp2k fixes and VSICURL multhreading issues.

It could help a lot if you could rerun the same with GDAL 2.4.4 /cc @vpipkt

Or probably you have already tried it?

P.S. for sure it is not a resolution to this issue, but just a random thought that may help you / us in futher bug investigation

@vpipkt
Copy link
Author

vpipkt commented Jan 16, 2020

@pomadchin I have not already tried GDAL 2.4.4, that is excellent to know.

Another note is that this bug is very much non-deterministic. I think the procedure above is pretty reliable about producing the bug. But the environment specifics seem to matter a lot.

If my hunch is correct about it, we may be able to tweak the spark job to make it more (for debugging) or less (for getting the job to run) likely to happen.

I will post again with some more investigation around gdal 2.4.4 and those job tweaks.

@pomadchin
Copy link
Member

@vpipkt thanks! I'm working towards making this bug more deterministic and to find a way to make this debugging cycle less complicated.

@vpipkt
Copy link
Author

vpipkt commented Jan 16, 2020

Yes just reproduced again in the exact configuration described above. However I realize that presents challenges for debugging and development ...

@vpipkt
Copy link
Author

vpipkt commented Jan 16, 2020

I'm currently trying to reproduce with GDAL 2.4.4... hurdle here is getting GDAL upgraded ...

I also tried changing the partitioning scheme in the spark job to try to co-locate reads of the same file to the same executor / core. My attempt did not help, with GDAL 2.4.2 I get the same error for either partitioning strategy.

@pomadchin
Copy link
Member

pomadchin commented Jan 16, 2020

@vpipkt I appreciate your help with that; If you could run the same with GDAL 2.4.4 and reproduce it it could be awesome.

I remember that I had some similar issues with rasterio + vsis3 (it randomly could throw same / similar exceptions and it was impossible to reread the file in the same session, and the only workaround was to reboot or to copy files onto a local fs and work with them locally)

@vpipkt
Copy link
Author

vpipkt commented Jan 16, 2020

Output from minimal.py with this container, upgraded to GDAL 2.4.4 and pip install --user pyrasterframes==0.9.0RC1

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
{'scalaVersion': '2.11.12', 'sbtVersion': '1.2.8', 'name': 'core', 'rfSparkVersion': '2.4.4', 'rfGeoMesaVersion': '2.2.1', 'GDAL': 'GDAL 2.4.4, released 2020/01/08', 'rfGeoTrellisVersion': '3.1.0', 'version': '0.9.0-RC1'}
Number unique dates in the dataframe: 1012
[Stage 1:===>                                                  (65 + 36) / 1000]
java: locked_dataset.hpp:119: locked_dataset& locked_dataset::operator=(locked_dataset&&): Assertion `m_use_count == 0' failed.

A second attempt also failed in the same manner after much longer compute and finishing ~154 tasks.

So GDAL 2.4.4 is no magic bullet.

@pomadchin
Copy link
Member

@vpipkt thanks for checking that!

@pomadchin
Copy link
Member

pomadchin commented Jan 18, 2020

@jamesmcclain I think our theory about a bad behaviour in case LRU cache is much smaller than the amount of threads is kinda correct link to a modifed code; only init(1) is important here. The idea of this experiment is that LRU cache size is 1 and we spawn lots of threads;

P.S. I compiled it with -ggdb3 flag and used the test container that is used for tests.

docker run -it --rm \
      -v $(pwd):/workdir \
      -e CC=gcc -e CXX=g++ \
      -e CFLAGS="-Wall -Wno-sign-compare -Werror -O0 -ggdb3 -DSO_FINI -D_GNU_SOURCE" \
      -e JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" \
      jamesmcclain/gdal-build-environment:4 make -j4 -C src/experiments/thread pattern oversubscribe || exit -1
# gdb pattern
(gdb) run /tmp/c41078a1.tif 100 1000
Starting program: /workdir/src/experiments/thread/pattern /tmp/c41078a1.tif 100 1000
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
lg_steps = 100
n = 1000
[New Thread 0x7fa563c27700 (LWP 21)]
[New Thread 0x7fa563426700 (LWP 22)]
[New Thread 0x7fa562c25700 (LWP 23)]
[New Thread 0x7fa562424700 (LWP 24)]
[New Thread 0x7fa561c23700 (LWP 25)]
[New Thread 0x7fa561422700 (LWP 26)]
[New Thread 0x7fa5609b7700 (LWP 27)]
[New Thread 0x7fa547fff700 (LWP 28)]
[New Thread 0x7fa5477fe700 (LWP 29)]
[New Thread 0x7fa546ffd700 (LWP 30)]
[New Thread 0x7fa5467fc700 (LWP 31)]
[New Thread 0x7fa545ffb700 (LWP 32)]
[New Thread 0x7fa5457fa700 (LWP 33)]
[New Thread 0x7fa544ff9700 (LWP 34)]
[New Thread 0x7fa52bfff700 (LWP 35)]
[New Thread 0x7fa52b7fe700 (LWP 36)]

Thread 7 "pattern" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fa561422700 (LWP 26)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) thread 7
[Switching to thread 7 (Thread 0x7fa561422700 (LWP 26))]
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	in ../sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fa565722801 in __GI_abort () at abort.c:79
#2  0x00007fa56571239a in __assert_fail_base (fmt=0x7fa5658997d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fa565f66ce7 "m_use_count == 0", file=file@entry=0x7fa565f66cd4 "locked_dataset.hpp", 
    line=line@entry=119, 
    function=function@entry=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:92
#3  0x00007fa565712412 in __GI___assert_fail (assertion=0x7fa565f66ce7 "m_use_count == 0", 
    file=0x7fa565f66cd4 "locked_dataset.hpp", line=119, 
    function=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:101
#4  0x00007fa565f45ca8 in locked_dataset::operator= (this=0x5604706d35b0, rhs=...) at locked_dataset.hpp:119
#5  0x00007fa565f4792c in flat_lru_cache::insert (this=0x560470722070, tag=2793181351846702632, key=...) at flat_lru_cache.hpp:282
#6  0x00007fa565f47672 in flat_lru_cache::get (this=0x560470722070, key=..., copies=-4) at flat_lru_cache.hpp:229
#7  0x00007fa565f43b6b in get_crs_wkt (token=11549000030553240900, dataset=0, attempts=1048576, copies=-4, crs=0x7fa54c000b20 "", 
    max_size=65536) at bindings.cpp:544
#8  0x000056046efb4460 in reader (argv1=0x7ffca2dc7599) at pattern.cpp:92
#9  0x00007fa565cf26db in start_thread (arg=0x7fa561422700) at pthread_create.c:463
#10 0x00007fa56580388f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

To summarize the error:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fa565722801 in __GI_abort () at abort.c:79
#2  0x00007fa56571239a in __assert_fail_base (fmt=0x7fa5658997d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fa565f66ce7 "m_use_count == 0", file=file@entry=0x7fa565f66cd4 "locked_dataset.hpp", 
    line=line@entry=119, 
    function=function@entry=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:92
#3  0x00007fa565712412 in __GI___assert_fail (assertion=0x7fa565f66ce7 "m_use_count == 0", 
    file=0x7fa565f66cd4 "locked_dataset.hpp", line=119, 
    function=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:101
#4  0x00007fa565f45ca8 in locked_dataset::operator= (this=0x5604706d35b0, rhs=...) at locked_dataset.hpp:119
#5  0x00007fa565f4792c in flat_lru_cache::insert (this=0x560470722070, tag=2793181351846702632, key=...) at flat_lru_cache.hpp:282
#6  0x00007fa565f47672 in flat_lru_cache::get (this=0x560470722070, key=..., copies=-4) at flat_lru_cache.hpp:229
#7  0x00007fa565f43b6b in get_crs_wkt (token=11549000030553240900, dataset=0, attempts=1048576, copies=-4, crs=0x7fa54c000b20 "", 
    max_size=65536) at bindings.cpp:544

It looks very much similar to what we got from @vpipkt core dump:

// gdal-warp-bindings-33.61199eb.jar 
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f87b2c5c801 in __GI_abort () at abort.c:79
#2  0x00007f87b2c4c39a in __assert_fail_base (fmt=0x7f87b2dd37d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7f79665eb27f "m_use_count == 0", file=file@entry=0x7f79665eb26c "locked_dataset.hpp", 
    line=line@entry=119, 
    function=function@entry=0x7f79665eb3a0 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:92
#3  0x00007f87b2c4c412 in __GI___assert_fail (assertion=0x7f79665eb27f "m_use_count == 0", 
    file=0x7f79665eb26c "locked_dataset.hpp", line=119, 
    function=0x7f79665eb3a0 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:101
#4  0x00007f79665ca376 in locked_dataset::operator= (this=0x7f78798240c0, rhs=...) at locked_dataset.hpp:119
#5  0x00007f79665cbf14 in flat_lru_cache::insert (this=0x7f7879820ce0, tag=9497314517541510089, key=...) at flat_lru_cache.hpp:282
#6  0x00007f79665cbc5a in flat_lru_cache::get (this=0x7f7879820ce0, key=..., copies=-4) at flat_lru_cache.hpp:229
#7  0x00007f79665c85d9 in get_band_nodata (token=6327110207382174371, dataset=0, attempts=1048576, copies=-4, band_number=1, 
    nodata=0x7f77289a8290, success=0x7f7728113bd0) at bindings.cpp:541
#8  0x00007f79665c5361 in Java_com_azavea_gdal_GDALWarp_get_1band_1nodata (env=0x7f783403d1e0, obj=0x7f78d91e24a0, 
    token=6327110207382174371, dataset=0, attempts=1048576, band=1, _nodata=0x7f78d91e2490, _success=0x7f78d91e2498)
    at com_azavea_gdal_GDALWarp.c:315

@jamesmcclain
Copy link
Member

Okay, that is great information. It looks like active datasets are being evicted.

@jamesmcclain
Copy link
Member

jamesmcclain commented Jan 21, 2020

@pomadchin

It looks like the TRYLOCK in lock_for_deletion may be causing the function to erroneously return true sometimes.

Use of that macro there is clearly an error. We should look for similar errors elsewhere in the code and try to verify that fixing that fixes the reported issue. This code should also pay attention to the return value of the insert method.

@vpipkt
Copy link
Author

vpipkt commented Jan 22, 2020

For duplication with PR #76 applied, use image here https://hub.docker.com/layers/vpipkt/rasterframes-notebook/0.9.0-SNAPSHOT/images/sha256-81a664df8882cae8058850a90acb6f247c8c17cb0403ed416dd71f16c26840db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants