-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LGBM_BoosterPredictForMatSingleRow crashes the JVM with EXCEPTION_ACCESS_VIOLATION #80
Comments
Hi @jornd13, kudos for the complete hs_err log! There are some strange things I observe there:
java_class_path (initial): .java_class_path (initial): .;../l/deeplearning4j-core-1.0.0-M2.jar;../l/javax.activation-1.2.0.jar;../l/deeplearning4j-datasets-1.0.0-M2.jar;../l/deeplearning4j-datavec-iterators-1.0.0-M2.jar;../l/deeplearning4j-modelimport-1.0.0-M2.jar;../l/gson-2.8.0.jar;../l/hdf5-platform-1.12.1-1.5.7.jar;../l/hdf5-1.12.1-1.5.7.jar;../l/hdf5-1.12.1-1.5.7-windows-x86.jar;../l/hdf5-1.12.1-1.5.7-windows-x86_64.jar;../l/slf4j-api-1.7.21.jar;../l/deeplearning4j-nn-1.0.0-M2.jar;../l/deeplearning4j-utility-iterators-1.0.0-M2.jar;../l/nd4j-common-1.0.0-M2.jar;../l/guava-1.0.0-M2.jar;../l/fastutil-6.5.7.jar;../l/commons-io-2.7.jar;../l/commons-compress-1.21.jar;../l/nd4j-api-1.0.0-M2.jar;../l/byteunits-0.9.1.jar;../l/commons-collections4-4.1.jar;../l/flatbuffers-java-1.12.0.jar;../l/protobuf-1.0.0-M2.jar;../l/commons-net-3.1.jar;../l/neoitertools-1.0.0.jar;../l/commons-lang3-3.6.jar;../l/jackson-1.0.0-M2.jar;../l/datavec-api-1.0.0-M2.jar;../l/freemarker-2.3.23.jar;../l/stream-2.9.8.jar;../l/opencsv-2.3.jar;../l/t-digest-3.2.jar;../l/datavec-data-image-1.0.0-M2.jar;../l/jai-imageio-core-1.3.0.jar;../l/imageio-jpeg-3.1.1.jar;../l/imageio-core-3.1.1.jar;../l/imageio-metadata-3.1.1.jar;../l/common-lang-3.1.1.jar;../l/common-io-3.1.1.jar;../l/common-image-3.1.1.jar;../l/imageio-tiff-3.1.1.jar;../l/imageio-psd-3.1.1.jar;../l/imageio-bmp-3.1.1.jar;../l/javacv-1.5.7.jar;../l/openblas-0.3.19-1.5.7.jar;../l/ffmpeg-5.0-1.5.7.jar;../l/flycapture-2.13.3.31-1.5.7.jar;../l/libdc1394-2.2.6-1.5.7.jar;../l/libfreenect-0.5.7-1.5.7.jar;../l/libfreenect2-0.2.0-1.5.7.jar;../l/librealsense-1.12.4-1.5.7.jar;../l/librealsense2-2.50.0-1.5.7.jar;../l/videoinput-0.200-1.5.7.jar;../l/artoolkitplus-2.3.1-1.5.7.jar;../l/flandmark-1.07-1.5.7.jar;../l/leptonica-1.82.0-1.5.7.jar;../l/tesseract-5.0.1-1.5.7.jar;../l/openblas-platform-0.3.19-1.5.7.jar;../l/openblas-0.3.19-1.5.7-windows-x86.jar;../l/openblas-0.3.19-1.5.7-windows-x86_64.jar;../l/leptonica-platform-1.82.0-1.5.7.jar;../l/leptonica-1.82.0-1.5.7-windows-x86.jar;
It would be great if you make a reproducer for this case which can be [semi]publically shared so I can take a look. |
Thank you! I dug in some more and the issue revolves around this warning that is thrown before the JVM crashes: Somehow the wrong dll gets loaded or is missing entirely. If this rings a bell please let me know. I'm not even sure what triggers loading of this native linalg library. It's not in my Java code I suppose. Training runs just fine. Inference is invoked in this simple snippet: public double classifyInstance(Instance instance) throws Exception { The Java package com.github.fommil.netlib is not even required by metarank, is it? (not being imported) Even if I include it in my POM via ... I get the same error at runtime. It is an old library btw. There is a newer version by dev.ludovic.netlib, but it does not solve the issue either. The key to debug this is knowing what makes it even want to have that library at runtime? Apr 28, 2024 10:14:15 AM com.github.fommil.netlib.ARPACK I will keep searching nevertheless, but appreciate any further hints! Thanks a lot. |
The problem does not seem to be related to the ARPACK libs, but revolves around the MSVCP140.dll being used at runtime (see below). I don't have experience with JNI and it seems to be some compatibility issue like you indicated. Is there any documentation on JNI usage for LightGBM and requirements in terms of dll versions etc.? A fatal error has been detected by the Java Runtime Environment: EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000007feee273278, pid=40024, tid=0x000000000000b568 JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12) (the same issue arises when using Java 11 like in the previous post) |
The main problem is that we don't build the win64 binaries for the |
Thank you. It's definitely extremely hard to get to the bottom of this. In the meanwhile I have busted a whole bunch of prior assumptions:
To me it means that somehow the system/JVM state changes in non-trivial ways in between training and inference such that the same call then eventually crashes the JVM. Very odd indeed. Does any of this help you to narrow down further? Thanks a lot! *) |
It turns out that this issue is completely unrelated to incompatibility of native libs. Instead, I think I discovered a major issue with the lightgbm4j API. Instead of throwing a useful exception, the entire JVM crahes when trying to invoke inference (predictForMatSingleRow(...)) on a closed booster (booster.close()). @shuttie I'd suggest to rework that part of the API accordingly. It's hard to imagine that this has not been an issue for others so far - maybe they simply remained quiet about it. |
Can you please make a clean reproducer for the issue?
|
I am not sure how to do that exactly. It's probably not necessary either, since you can take your own snippet code from the main wikipage and append 2 simple lines at the end, and it will crash on Windows - no dependency on data or libraries. I have done this for you: LGBMDataset train = LGBMDataset.createFromFile("cancer.csv", "header=true label=name:Classification", null); for (int i=0; i<10; i++) { /** added lines to trigger inference in a trivial way */ |
So please try the new version, and report if it fixes your issue. |
Thanks a lot! Yes, use-after-close is fixed now. Closing the issue. |
Training works fine now, but the application crashes outside the JVM at inference time with EXCEPTION_ACCESS_VIOLATION (0xc0000005) when calling LGBM_BoosterPredictForMatSingleRow.
System/Java version:
JRE version: Java(TM) SE Runtime Environment (17.0.1+12) (build 17.0.1+12-LTS-39)
Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.1+12-LTS-39, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)
Error:
EXCEPTION_ACCESS_VIOLATION (0xc0000005), data execution prevention violation at address 0x000000000000000f
Details (log attached)
hs_err_pid55304.log
:
--------------- T H R E A D ---------------
Current thread (0x000002d6fc0f5c90): JavaThread "Thread-4" [_thread_in_native, id=54640, stack(0x000000e7ab280000,0x000000e7ab300000)]
Stack: [0x000000e7ab280000,0x000000e7ab300000], sp=0x000000e7ab2fe138, free space=504k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C 0x000000000000000f
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_BoosterPredictForMatSingleRow(JJIIIIIILjava/lang/String;JJ)I+0
j com.microsoft.ml.lightgbm.lightgbmlib.LGBM_BoosterPredictForMatSingleRow(Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;IIIIIILjava/lang/String;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_long_long;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_double;)I+30
j io.github.metarank.lightgbm4j.LGBMBooster.predictForMatSingleRow([FLcom/microsoft/ml/lightgbm/PredictionType;)D+88
j weka.classifiers.functions.LGBMClassifier.classifyInstance(Lweka/core/Instance;)D+49
j knowledgeConnector.externalMLMethods.WekaClassifierAdapter.classifyInstance(Lweka/classifiers/Classifier;Lweka/core/Instance;ZI)D+107
j util.TrainEvaluation.evaluateClassificationQuality(Ljava/util/Vector;ZLweka/classifiers/Classifier;Lweka/core/Instances;IZZZ)Ljava/util/Map;+1232
j control.MFNController.evaluateClones(Ljava/util/List;Lweka/core/Instances;Ljava/util/Vector;ZZLjava/util/List;ZZ)[D+402
j control.MFNController.evaluateClones(Ljava/util/List;Lweka/core/Instances;Ljava/util/Vector;ZZLjava/util/List;Z)[D+13
j control.MFNController.generatePQReportWekaClassifier(Ljava/lang/String;Ljava/util/Vector;Ljava/util/Map;Ljava/lang/String;ZZZIZZLjava/util/List;)I+1710
j control.ParameterSpaceExplorationAgent.run()V+9109
v ~StubRoutines::call_stub
siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), data execution prevention violation at address 0x000000000000000f
The text was updated successfully, but these errors were encountered: