-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failed on parallel reads #72
Comments
Thank you for the detailed report! Would it be possible for you to put the core file on S3 somewhere and share the location with me? |
The python script references this file. A dataframe full of S3 URI's for sentinel 2 PDS JP2 files. |
Thanks much. |
I still could not reproduce the issue, however, if GDAL 2.4.3 and 2.4.4 contained lots of fixes related to jp2k fixes and VSICURL multhreading issues. It could help a lot if you could rerun the same with GDAL 2.4.4 /cc @vpipkt Or probably you have already tried it? P.S. for sure it is not a resolution to this issue, but just a random thought that may help you / us in futher bug investigation |
@pomadchin I have not already tried GDAL 2.4.4, that is excellent to know. Another note is that this bug is very much non-deterministic. I think the procedure above is pretty reliable about producing the bug. But the environment specifics seem to matter a lot. If my hunch is correct about it, we may be able to tweak the spark job to make it more (for debugging) or less (for getting the job to run) likely to happen. I will post again with some more investigation around gdal 2.4.4 and those job tweaks. |
@vpipkt thanks! I'm working towards making this bug more deterministic and to find a way to make this debugging cycle less complicated. |
Yes just reproduced again in the exact configuration described above. However I realize that presents challenges for debugging and development ... |
I'm currently trying to reproduce with GDAL 2.4.4... hurdle here is getting GDAL upgraded ... I also tried changing the partitioning scheme in the spark job to try to co-locate reads of the same file to the same executor / core. My attempt did not help, with GDAL 2.4.2 I get the same error for either partitioning strategy. |
@vpipkt I appreciate your help with that; If you could run the same with GDAL 2.4.4 and reproduce it it could be awesome. I remember that I had some similar issues with |
Output from
A second attempt also failed in the same manner after much longer compute and finishing ~154 tasks. So GDAL 2.4.4 is no magic bullet. |
@vpipkt thanks for checking that! |
@jamesmcclain I think our theory about a bad behaviour in case LRU cache is much smaller than the amount of threads is kinda correct link to a modifed code; only init(1) is important here. The idea of this experiment is that LRU cache size is 1 and we spawn lots of threads; P.S. I compiled it with docker run -it --rm \
-v $(pwd):/workdir \
-e CC=gcc -e CXX=g++ \
-e CFLAGS="-Wall -Wno-sign-compare -Werror -O0 -ggdb3 -DSO_FINI -D_GNU_SOURCE" \
-e JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" \
jamesmcclain/gdal-build-environment:4 make -j4 -C src/experiments/thread pattern oversubscribe || exit -1
To summarize the error:
It looks very much similar to what we got from @vpipkt core dump:
|
Okay, that is great information. It looks like active datasets are being evicted. |
It looks like the TRYLOCK in lock_for_deletion may be causing the function to erroneously return true sometimes. Use of that macro there is clearly an error. We should look for similar errors elsewhere in the code and try to verify that fixing that fixes the reported issue. This code should also pay attention to the return value of the insert method. |
For duplication with PR #76 applied, use image here https://hub.docker.com/layers/vpipkt/rasterframes-notebook/0.9.0-SNAPSHOT/images/sha256-81a664df8882cae8058850a90acb6f247c8c17cb0403ed416dd71f16c26840db |
Describe the bug
I am developing a job in RasterFrames that uses this library and I (sometimes) receive several illegal argument messages, then finally a failed assertion error, process crash and core dump. The job is reading many JP2 (sentinel 2 PDS) files in parallel. The error seems to happen on non-deteriministic files. I have tried reading the files mentioned in the errors and the errors do not occur outside the context of the larger job, and I believe, running on many cores.
A sample of the illegal argument messages:
The assertion error is:
I suspect the error and process crash have been foretold in this code comment.
To Reproduce
Steps to reproduce the behavior:
docker pull s22s/rasterframes-notebook:0.9.0-RC1
docker run -p 8888:8888 -p 4040:4040 -v /home/ec2-user/:/home/jovyan/work s22s/rasterframes-notebook:0.9.0-RC1
$ python minimal.py
Expected behavior
I expect the job to complete with all JP2 files read without error in a job that reads one file, or many files. As mentioned above, have tried small single-file jobs on some of the files mentioned in error messages. The job completes fine for these.
Environment
Core Files
I have captured a core dump that is 18GB. We'll have to figure out how to share it.
The text was updated successfully, but these errors were encountered: