Assertion failed on parallel reads #72

vpipkt · 2020-01-01T16:28:09Z

Describe the bug

I am developing a job in RasterFrames that uses this library and I (sometimes) receive several illegal argument messages, then finally a failed assertion error, process crash and core dump. The job is reading many JP2 (sentinel 2 PDS) files in parallel. The error seems to happen on non-deteriministic files. I have tried reading the files mentioned in the errors and the errors do not occur outside the context of the larger job, and I believe, running on many cores.

A sample of the illegal argument messages:

[116 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (171800590) in GDALRasterBand::GetLockedBlockRef()

[117 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (-494424126) in GDALRasterBand::TryGetLockedBlockRef()

[118 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (-494424126) in GDALRasterBand::GetLockedBlockRef()

[119 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (22130851) in GDALRasterBand::TryGetLockedBlockRef()

[120 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." /vsis3/sentinel-s2-l2a/tiles/22/L/FP/2019/8/14/0/R60m/B12.jp2, band 1: Illegal nBlockXOff value (22130851) in GDALRasterBand::GetLockedBlockRef()

The assertion error is:

java: locked_dataset.hpp:119: locked_dataset& locked_dataset::operator=(locked_dataset&&): Assertion `m_use_count == 0' failed.

I suspect the error and process crash have been foretold in this code comment.

To Reproduce

Steps to reproduce the behavior:

Launch ec2 instance
1. amazon linux 2 AMI,
2. perhaps c5.9xlarge, Reproduced on ~16 or more cores
3. in eu-central-1 region (same as PDS location).
4. Probably need 50-100 GB storage to collect the core dump
5. Set up with IAM Role (or other permissions) to read S3 PDS bucket (requester pays)
Install docker
docker pull s22s/rasterframes-notebook:0.9.0-RC1
docker run -p 8888:8888 -p 4040:4040 -v /home/ec2-user/:/home/jovyan/work s22s/rasterframes-notebook:0.9.0-RC1
Upload / scp file minimal.py.zip and unzip
From within the jupyter notebook's terminal: $ python minimal.py
Error may take several minutes to occur.

Expected behavior

I expect the job to complete with all JP2 files read without error in a job that reads one file, or many files. As mentioned above, have tried small single-file jobs on some of the files mentioned in error messages. The job completes fine for these.

Environment

Docker container: s22s/rasterframes-notebook:0.9.0-RC1
Running on EC2 instance c5.9xlarge, 36 cores on the spark driver.
JVM version: 1.8.0_222 (Private Build)
GDAL version: 2.4.2

Core Files

I have captured a core dump that is 18GB. We'll have to figure out how to share it.

The text was updated successfully, but these errors were encountered:

jamesmcclain · 2020-01-01T19:33:20Z

Thank you for the detailed report!

Would it be possible for you to put the core file on S3 somewhere and share the location with me?

vpipkt · 2020-01-03T14:38:28Z

The python script references this file. A dataframe full of S3 URI's for sentinel 2 PDS JP2 files.

large_catalog.csv.zip

jamesmcclain · 2020-01-03T16:27:37Z

The python script references this file. A dataframe full of S3 URI's for sentinel 2 PDS JP2 files.

large_catalog.csv.zip

Thanks much.

pomadchin · 2020-01-16T00:12:29Z

I still could not reproduce the issue, however, if [116 of 1000] FAILURE(3) CPLE_IllegalArg(5) "Illegal argument." and assertion issues are connected somehow - this can be also a driver / vsicurl issue;

GDAL 2.4.3 and 2.4.4 contained lots of fixes related to jp2k fixes and VSICURL multhreading issues.

It could help a lot if you could rerun the same with GDAL 2.4.4 /cc @vpipkt

Or probably you have already tried it?

P.S. for sure it is not a resolution to this issue, but just a random thought that may help you / us in futher bug investigation

vpipkt · 2020-01-16T14:30:45Z

@pomadchin I have not already tried GDAL 2.4.4, that is excellent to know.

Another note is that this bug is very much non-deterministic. I think the procedure above is pretty reliable about producing the bug. But the environment specifics seem to matter a lot.

If my hunch is correct about it, we may be able to tweak the spark job to make it more (for debugging) or less (for getting the job to run) likely to happen.

I will post again with some more investigation around gdal 2.4.4 and those job tweaks.

pomadchin · 2020-01-16T14:41:40Z

@vpipkt thanks! I'm working towards making this bug more deterministic and to find a way to make this debugging cycle less complicated.

vpipkt · 2020-01-16T14:47:15Z

Yes just reproduced again in the exact configuration described above. However I realize that presents challenges for debugging and development ...

vpipkt · 2020-01-16T15:41:09Z

I'm currently trying to reproduce with GDAL 2.4.4... hurdle here is getting GDAL upgraded ...

I also tried changing the partitioning scheme in the spark job to try to co-locate reads of the same file to the same executor / core. My attempt did not help, with GDAL 2.4.2 I get the same error for either partitioning strategy.

pomadchin · 2020-01-16T15:43:21Z

@vpipkt I appreciate your help with that; If you could run the same with GDAL 2.4.4 and reproduce it it could be awesome.

I remember that I had some similar issues with rasterio + vsis3 (it randomly could throw same / similar exceptions and it was impossible to reread the file in the same session, and the only workaround was to reboot or to copy files onto a local fs and work with them locally)

vpipkt · 2020-01-16T20:51:17Z

Output from minimal.py with this container, upgraded to GDAL 2.4.4 and pip install --user pyrasterframes==0.9.0RC1

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
{'scalaVersion': '2.11.12', 'sbtVersion': '1.2.8', 'name': 'core', 'rfSparkVersion': '2.4.4', 'rfGeoMesaVersion': '2.2.1', 'GDAL': 'GDAL 2.4.4, released 2020/01/08', 'rfGeoTrellisVersion': '3.1.0', 'version': '0.9.0-RC1'}
Number unique dates in the dataframe: 1012
[Stage 1:===>                                                  (65 + 36) / 1000]
java: locked_dataset.hpp:119: locked_dataset& locked_dataset::operator=(locked_dataset&&): Assertion `m_use_count == 0' failed.

A second attempt also failed in the same manner after much longer compute and finishing ~154 tasks.

So GDAL 2.4.4 is no magic bullet.

pomadchin · 2020-01-16T20:56:38Z

@vpipkt thanks for checking that!

pomadchin · 2020-01-18T21:39:50Z

@jamesmcclain I think our theory about a bad behaviour in case LRU cache is much smaller than the amount of threads is kinda correct link to a modifed code; only init(1) is important here. The idea of this experiment is that LRU cache size is 1 and we spawn lots of threads;

P.S. I compiled it with -ggdb3 flag and used the test container that is used for tests.

docker run -it --rm \
      -v $(pwd):/workdir \
      -e CC=gcc -e CXX=g++ \
      -e CFLAGS="-Wall -Wno-sign-compare -Werror -O0 -ggdb3 -DSO_FINI -D_GNU_SOURCE" \
      -e JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" \
      jamesmcclain/gdal-build-environment:4 make -j4 -C src/experiments/thread pattern oversubscribe || exit -1

# gdb pattern
(gdb) run /tmp/c41078a1.tif 100 1000
Starting program: /workdir/src/experiments/thread/pattern /tmp/c41078a1.tif 100 1000
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
lg_steps = 100
n = 1000
[New Thread 0x7fa563c27700 (LWP 21)]
[New Thread 0x7fa563426700 (LWP 22)]
[New Thread 0x7fa562c25700 (LWP 23)]
[New Thread 0x7fa562424700 (LWP 24)]
[New Thread 0x7fa561c23700 (LWP 25)]
[New Thread 0x7fa561422700 (LWP 26)]
[New Thread 0x7fa5609b7700 (LWP 27)]
[New Thread 0x7fa547fff700 (LWP 28)]
[New Thread 0x7fa5477fe700 (LWP 29)]
[New Thread 0x7fa546ffd700 (LWP 30)]
[New Thread 0x7fa5467fc700 (LWP 31)]
[New Thread 0x7fa545ffb700 (LWP 32)]
[New Thread 0x7fa5457fa700 (LWP 33)]
[New Thread 0x7fa544ff9700 (LWP 34)]
[New Thread 0x7fa52bfff700 (LWP 35)]
[New Thread 0x7fa52b7fe700 (LWP 36)]

Thread 7 "pattern" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fa561422700 (LWP 26)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) thread 7
[Switching to thread 7 (Thread 0x7fa561422700 (LWP 26))]
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	in ../sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fa565722801 in __GI_abort () at abort.c:79
#2  0x00007fa56571239a in __assert_fail_base (fmt=0x7fa5658997d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fa565f66ce7 "m_use_count == 0", file=file@entry=0x7fa565f66cd4 "locked_dataset.hpp", 
    line=line@entry=119, 
    function=function@entry=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:92
#3  0x00007fa565712412 in __GI___assert_fail (assertion=0x7fa565f66ce7 "m_use_count == 0", 
    file=0x7fa565f66cd4 "locked_dataset.hpp", line=119, 
    function=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:101
#4  0x00007fa565f45ca8 in locked_dataset::operator= (this=0x5604706d35b0, rhs=...) at locked_dataset.hpp:119
#5  0x00007fa565f4792c in flat_lru_cache::insert (this=0x560470722070, tag=2793181351846702632, key=...) at flat_lru_cache.hpp:282
#6  0x00007fa565f47672 in flat_lru_cache::get (this=0x560470722070, key=..., copies=-4) at flat_lru_cache.hpp:229
#7  0x00007fa565f43b6b in get_crs_wkt (token=11549000030553240900, dataset=0, attempts=1048576, copies=-4, crs=0x7fa54c000b20 "", 
    max_size=65536) at bindings.cpp:544
#8  0x000056046efb4460 in reader (argv1=0x7ffca2dc7599) at pattern.cpp:92
#9  0x00007fa565cf26db in start_thread (arg=0x7fa561422700) at pthread_create.c:463
#10 0x00007fa56580388f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

To summarize the error:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fa565722801 in __GI_abort () at abort.c:79
#2  0x00007fa56571239a in __assert_fail_base (fmt=0x7fa5658997d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fa565f66ce7 "m_use_count == 0", file=file@entry=0x7fa565f66cd4 "locked_dataset.hpp", 
    line=line@entry=119, 
    function=function@entry=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:92
#3  0x00007fa565712412 in __GI___assert_fail (assertion=0x7fa565f66ce7 "m_use_count == 0", 
    file=0x7fa565f66cd4 "locked_dataset.hpp", line=119, 
    function=0x7fa565f66e00 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:101
#4  0x00007fa565f45ca8 in locked_dataset::operator= (this=0x5604706d35b0, rhs=...) at locked_dataset.hpp:119
#5  0x00007fa565f4792c in flat_lru_cache::insert (this=0x560470722070, tag=2793181351846702632, key=...) at flat_lru_cache.hpp:282
#6  0x00007fa565f47672 in flat_lru_cache::get (this=0x560470722070, key=..., copies=-4) at flat_lru_cache.hpp:229
#7  0x00007fa565f43b6b in get_crs_wkt (token=11549000030553240900, dataset=0, attempts=1048576, copies=-4, crs=0x7fa54c000b20 "", 
    max_size=65536) at bindings.cpp:544

It looks very much similar to what we got from @vpipkt core dump:

// gdal-warp-bindings-33.61199eb.jar 
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f87b2c5c801 in __GI_abort () at abort.c:79
#2  0x00007f87b2c4c39a in __assert_fail_base (fmt=0x7f87b2dd37d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7f79665eb27f "m_use_count == 0", file=file@entry=0x7f79665eb26c "locked_dataset.hpp", 
    line=line@entry=119, 
    function=function@entry=0x7f79665eb3a0 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:92
#3  0x00007f87b2c4c412 in __GI___assert_fail (assertion=0x7f79665eb27f "m_use_count == 0", 
    file=0x7f79665eb26c "locked_dataset.hpp", line=119, 
    function=0x7f79665eb3a0 <locked_dataset::operator=(locked_dataset&&)::__PRETTY_FUNCTION__> "locked_dataset& locked_dataset::operator=(locked_dataset&&)") at assert.c:101
#4  0x00007f79665ca376 in locked_dataset::operator= (this=0x7f78798240c0, rhs=...) at locked_dataset.hpp:119
#5  0x00007f79665cbf14 in flat_lru_cache::insert (this=0x7f7879820ce0, tag=9497314517541510089, key=...) at flat_lru_cache.hpp:282
#6  0x00007f79665cbc5a in flat_lru_cache::get (this=0x7f7879820ce0, key=..., copies=-4) at flat_lru_cache.hpp:229
#7  0x00007f79665c85d9 in get_band_nodata (token=6327110207382174371, dataset=0, attempts=1048576, copies=-4, band_number=1, 
    nodata=0x7f77289a8290, success=0x7f7728113bd0) at bindings.cpp:541
#8  0x00007f79665c5361 in Java_com_azavea_gdal_GDALWarp_get_1band_1nodata (env=0x7f783403d1e0, obj=0x7f78d91e24a0, 
    token=6327110207382174371, dataset=0, attempts=1048576, band=1, _nodata=0x7f78d91e2490, _success=0x7f78d91e2498)
    at com_azavea_gdal_GDALWarp.c:315

jamesmcclain · 2020-01-20T00:15:16Z

Okay, that is great information. It looks like active datasets are being evicted.

jamesmcclain · 2020-01-21T16:43:52Z

@pomadchin

It looks like the TRYLOCK in lock_for_deletion may be causing the function to erroneously return true sometimes.

Use of that macro there is clearly an error. We should look for similar errors elsewhere in the code and try to verify that fixing that fixes the reported issue. This code should also pay attention to the return value of the insert method.

vpipkt · 2020-01-22T20:41:45Z

For duplication with PR #76 applied, use image here https://hub.docker.com/layers/vpipkt/rasterframes-notebook/0.9.0-SNAPSHOT/images/sha256-81a664df8882cae8058850a90acb6f247c8c17cb0403ed416dd71f16c26840db

geotrellis deleted a comment from vpipkt Jan 2, 2020

jamesmcclain added the bug Something isn't working label Jan 3, 2020

echeipesh assigned pomadchin Jan 15, 2020

pomadchin mentioned this issue Jan 21, 2020

Fix locked_dataset::lock_for_deletion() function behavior #76

Merged

pomadchin closed this as completed in #76 Jan 22, 2020

vpipkt mentioned this issue Jan 23, 2020

Bump to latest relase of gdal-warp-bindings locationtech/geotrellis#3177

Closed

5 tasks

pomadchin mentioned this issue Aug 15, 2024

GDALRasterSource will read corrupt jp2 successfully locationtech/geotrellis#3547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failed on parallel reads #72

Assertion failed on parallel reads #72

vpipkt commented Jan 1, 2020

jamesmcclain commented Jan 1, 2020

vpipkt commented Jan 3, 2020

jamesmcclain commented Jan 3, 2020

pomadchin commented Jan 16, 2020 •

edited

Loading

vpipkt commented Jan 16, 2020

pomadchin commented Jan 16, 2020

vpipkt commented Jan 16, 2020

vpipkt commented Jan 16, 2020

pomadchin commented Jan 16, 2020 •

edited

Loading

vpipkt commented Jan 16, 2020

pomadchin commented Jan 16, 2020

pomadchin commented Jan 18, 2020 •

edited

Loading

jamesmcclain commented Jan 20, 2020

jamesmcclain commented Jan 21, 2020 •

edited

Loading

vpipkt commented Jan 22, 2020

Assertion failed on parallel reads #72

Assertion failed on parallel reads #72

Comments

vpipkt commented Jan 1, 2020

Describe the bug

To Reproduce

Expected behavior

Environment

Core Files

jamesmcclain commented Jan 1, 2020

vpipkt commented Jan 3, 2020

jamesmcclain commented Jan 3, 2020

pomadchin commented Jan 16, 2020 • edited Loading

vpipkt commented Jan 16, 2020

pomadchin commented Jan 16, 2020

vpipkt commented Jan 16, 2020

vpipkt commented Jan 16, 2020

pomadchin commented Jan 16, 2020 • edited Loading

vpipkt commented Jan 16, 2020

pomadchin commented Jan 16, 2020

pomadchin commented Jan 18, 2020 • edited Loading

jamesmcclain commented Jan 20, 2020

jamesmcclain commented Jan 21, 2020 • edited Loading

vpipkt commented Jan 22, 2020

pomadchin commented Jan 16, 2020 •

edited

Loading

pomadchin commented Jan 16, 2020 •

edited

Loading

pomadchin commented Jan 18, 2020 •

edited

Loading

jamesmcclain commented Jan 21, 2020 •

edited

Loading