Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LGBM_BoosterPredictForMatSingleRow crashes the JVM with EXCEPTION_ACCESS_VIOLATION #80

Closed
jornd13 opened this issue Apr 20, 2024 · 10 comments

Comments

@jornd13
Copy link

jornd13 commented Apr 20, 2024

Training works fine now, but the application crashes outside the JVM at inference time with EXCEPTION_ACCESS_VIOLATION (0xc0000005) when calling LGBM_BoosterPredictForMatSingleRow.

System/Java version:

JRE version: Java(TM) SE Runtime Environment (17.0.1+12) (build 17.0.1+12-LTS-39)

Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.1+12-LTS-39, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)

Error:
EXCEPTION_ACCESS_VIOLATION (0xc0000005), data execution prevention violation at address 0x000000000000000f

Details (log attached)
hs_err_pid55304.log
:

--------------- T H R E A D ---------------

Current thread (0x000002d6fc0f5c90): JavaThread "Thread-4" [_thread_in_native, id=54640, stack(0x000000e7ab280000,0x000000e7ab300000)]

Stack: [0x000000e7ab280000,0x000000e7ab300000], sp=0x000000e7ab2fe138, free space=504k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C 0x000000000000000f

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_BoosterPredictForMatSingleRow(JJIIIIIILjava/lang/String;JJ)I+0
j com.microsoft.ml.lightgbm.lightgbmlib.LGBM_BoosterPredictForMatSingleRow(Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_void;IIIIIILjava/lang/String;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_long_long;Lcom/microsoft/ml/lightgbm/SWIGTYPE_p_double;)I+30
j io.github.metarank.lightgbm4j.LGBMBooster.predictForMatSingleRow([FLcom/microsoft/ml/lightgbm/PredictionType;)D+88
j weka.classifiers.functions.LGBMClassifier.classifyInstance(Lweka/core/Instance;)D+49
j knowledgeConnector.externalMLMethods.WekaClassifierAdapter.classifyInstance(Lweka/classifiers/Classifier;Lweka/core/Instance;ZI)D+107
j util.TrainEvaluation.evaluateClassificationQuality(Ljava/util/Vector;ZLweka/classifiers/Classifier;Lweka/core/Instances;IZZZ)Ljava/util/Map;+1232
j control.MFNController.evaluateClones(Ljava/util/List;Lweka/core/Instances;Ljava/util/Vector;ZZLjava/util/List;ZZ)[D+402
j control.MFNController.evaluateClones(Ljava/util/List;Lweka/core/Instances;Ljava/util/Vector;ZZLjava/util/List;Z)[D+13
j control.MFNController.generatePQReportWekaClassifier(Ljava/lang/String;Ljava/util/Vector;Ljava/util/Map;Ljava/lang/String;ZZZIZZLjava/util/List;)I+1710
j control.ParameterSpaceExplorationAgent.run()V+9109
v ~StubRoutines::call_stub

siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), data execution prevention violation at address 0x000000000000000f

@shuttie
Copy link
Contributor

shuttie commented Apr 22, 2024

Hi @jornd13, kudos for the complete hs_err log! There are some strange things I observe there:

java_class_path (initial): .

java_class_path (initial): .;../l/deeplearning4j-core-1.0.0-M2.jar;../l/javax.activation-1.2.0.jar;../l/deeplearning4j-datasets-1.0.0-M2.jar;../l/deeplearning4j-datavec-iterators-1.0.0-M2.jar;../l/deeplearning4j-modelimport-1.0.0-M2.jar;../l/gson-2.8.0.jar;../l/hdf5-platform-1.12.1-1.5.7.jar;../l/hdf5-1.12.1-1.5.7.jar;../l/hdf5-1.12.1-1.5.7-windows-x86.jar;../l/hdf5-1.12.1-1.5.7-windows-x86_64.jar;../l/slf4j-api-1.7.21.jar;../l/deeplearning4j-nn-1.0.0-M2.jar;../l/deeplearning4j-utility-iterators-1.0.0-M2.jar;../l/nd4j-common-1.0.0-M2.jar;../l/guava-1.0.0-M2.jar;../l/fastutil-6.5.7.jar;../l/commons-io-2.7.jar;../l/commons-compress-1.21.jar;../l/nd4j-api-1.0.0-M2.jar;../l/byteunits-0.9.1.jar;../l/commons-collections4-4.1.jar;../l/flatbuffers-java-1.12.0.jar;../l/protobuf-1.0.0-M2.jar;../l/commons-net-3.1.jar;../l/neoitertools-1.0.0.jar;../l/commons-lang3-3.6.jar;../l/jackson-1.0.0-M2.jar;../l/datavec-api-1.0.0-M2.jar;../l/freemarker-2.3.23.jar;../l/stream-2.9.8.jar;../l/opencsv-2.3.jar;../l/t-digest-3.2.jar;../l/datavec-data-image-1.0.0-M2.jar;../l/jai-imageio-core-1.3.0.jar;../l/imageio-jpeg-3.1.1.jar;../l/imageio-core-3.1.1.jar;../l/imageio-metadata-3.1.1.jar;../l/common-lang-3.1.1.jar;../l/common-io-3.1.1.jar;../l/common-image-3.1.1.jar;../l/imageio-tiff-3.1.1.jar;../l/imageio-psd-3.1.1.jar;../l/imageio-bmp-3.1.1.jar;../l/javacv-1.5.7.jar;../l/openblas-0.3.19-1.5.7.jar;../l/ffmpeg-5.0-1.5.7.jar;../l/flycapture-2.13.3.31-1.5.7.jar;../l/libdc1394-2.2.6-1.5.7.jar;../l/libfreenect-0.5.7-1.5.7.jar;../l/libfreenect2-0.2.0-1.5.7.jar;../l/librealsense-1.12.4-1.5.7.jar;../l/librealsense2-2.50.0-1.5.7.jar;../l/videoinput-0.200-1.5.7.jar;../l/artoolkitplus-2.3.1-1.5.7.jar;../l/flandmark-1.07-1.5.7.jar;../l/leptonica-1.82.0-1.5.7.jar;../l/tesseract-5.0.1-1.5.7.jar;../l/openblas-platform-0.3.19-1.5.7.jar;../l/openblas-0.3.19-1.5.7-windows-x86.jar;../l/openblas-0.3.19-1.5.7-windows-x86_64.jar;../l/leptonica-platform-1.82.0-1.5.7.jar;../l/leptonica-1.82.0-1.5.7-windows-x86.jar;

  • I often seen such issues when you have multiple JNI libraries loaded at the same time which use different msvcrt runtimes. You have one which is a bit suspicious: 0x000000006b3c0000 - 0x000000006b993000 C:\Users\joern\AppData\Local\Temp\jniloader3739708092842027418netlib-native_ref-win-x86_64.dll

  • so I guess it's again related to some of your dependencies, but I can't say which one without having access to code.

It would be great if you make a reproducer for this case which can be [semi]publically shared so I can take a look.

@jornd13
Copy link
Author

jornd13 commented Apr 28, 2024

Thank you! I dug in some more and the issue revolves around this warning that is thrown before the JVM crashes:
"WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK"

Somehow the wrong dll gets loaded or is missing entirely. If this rings a bell please let me know.

I'm not even sure what triggers loading of this native linalg library. It's not in my Java code I suppose. Training runs just fine. Inference is invoked in this simple snippet:

public double classifyInstance(Instance instance) throws Exception {
int nFeat = instance.numAttributes() - 1;
float[] input = new float[nFeat];
for (int i = 0; i < nFeat; i++) {
input[i] = (float) instance.value(i);
}
//double[] predArr = booster.predictForMat(input, 1, nFeat, true, PredictionType.C_API_PREDICT_NORMAL);
double pred = booster.predictForMatSingleRow(input, PredictionType.C_API_PREDICT_NORMAL);
//double pred = booster.predictForMatSingleRow(input);
return pred;
}

The Java package com.github.fommil.netlib is not even required by metarank, is it? (not being imported)

Even if I include it in my POM via

com.github.fommil.netlib
all
1.1.2
pom

... I get the same error at runtime. It is an old library btw. There is a newer version by dev.ludovic.netlib, but it does not solve the issue either.

The key to debug this is knowing what makes it even want to have that library at runtime?

Apr 28, 2024 10:14:15 AM com.github.fommil.netlib.ARPACK
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK
Apr 28, 2024 10:14:15 AM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded C:\Users***\AppData\Local\Temp\jniloader8955788243747193699netlib-native_ref-win-x86_64.dll

I will keep searching nevertheless, but appreciate any further hints! Thanks a lot.

@jornd13
Copy link
Author

jornd13 commented Apr 29, 2024

The problem does not seem to be related to the ARPACK libs, but revolves around the MSVCP140.dll being used at runtime (see below). I don't have experience with JNI and it seems to be some compatibility issue like you indicated. Is there any documentation on JNI usage for LightGBM and requirements in terms of dll versions etc.?

A fatal error has been detected by the Java Runtime Environment:

EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000007feee273278, pid=40024, tid=0x000000000000b568

JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode windows-amd64 compressed oops)
Problematic frame:
C [MSVCP140.dll+0x13278]

(the same issue arises when using Java 11 like in the previous post)

@shuttie
Copy link
Contributor

shuttie commented Apr 30, 2024

The main problem is that we don't build the win64 binaries for the lightgbm.dll by ourselves, but just bundle the ones made by upstream LightGBM project. For me it seems to be some sort of compatibility issue between ARPACK and LightGBM dlls built for different versions of MSVCP.

@jornd13
Copy link
Author

jornd13 commented May 3, 2024

Thank you. It's definitely extremely hard to get to the bottom of this. In the meanwhile I have busted a whole bunch of prior assumptions:

  • it is not related to the NativeSystemARPACK lib, the netlib-native_ref-win-x86_64 or any of these dlls. Even when they are not present, it crashes at the same position (see extracted native libs at runtime below *)).
    -- at training and inference time the loaded native dlls are exactly the same, so that cannot be the issue.
  • the Java version for that matter is not the problem (both Java 8 and 11 fail identically)
  • it is not even related to the particular method call of booster.predictForMat(...); this works just fine if I call it at training time earlier in the application flow - then it works just fine.

To me it means that somehow the system/JVM state changes in non-trivial ways in between training and inference such that the same call then eventually crashes the JVM. Very odd indeed.

Does any of this help you to narrow down further? Thanks a lot!

*)
I:\Programme\Java\jdk1.8.0_151\jre\bin\zip.dll
I:\Programme\Java\jdk1.8.0_151\jre\bin\awt.dll
I:\Programme\Java\jdk1.8.0_151\jre\bin\fontmanager.dll
I:\Programme\Java\jdk1.8.0_151\jre\bin\net.dll
I:\Programme\Java\jdk1.8.0_151\jre\bin\nio.dll
I:\Programme\Java\jdk1.8.0_151\jre\bin\t2k.dll
C:\Users*\AppData\Local\Temp\lib_lightgbm.dll
C:\Users*\AppData\Local\Temp\lib_lightgbm_swig.dll

@jornd13
Copy link
Author

jornd13 commented May 17, 2024

It turns out that this issue is completely unrelated to incompatibility of native libs.

Instead, I think I discovered a major issue with the lightgbm4j API. Instead of throwing a useful exception, the entire JVM crahes when trying to invoke inference (predictForMatSingleRow(...)) on a closed booster (booster.close()).

@shuttie I'd suggest to rework that part of the API accordingly. It's hard to imagine that this has not been an issue for others so far - maybe they simply remained quiet about it.

@shuttie
Copy link
Contributor

shuttie commented May 17, 2024

Can you please make a clean reproducer for the issue?

  • it can be shared in public - so I can reproduce the issue locally. As for now you're the only one having access to the problematic code.
  • it has no dependencies on private packages, and optionally should have minimal amount of these. In a perfect case it should only depend on LightGBM4j.
  • it doesn't require any private datasets.

@jornd13
Copy link
Author

jornd13 commented May 17, 2024

I am not sure how to do that exactly. It's probably not necessary either, since you can take your own snippet code from the main wikipage and append 2 simple lines at the end, and it will crash on Windows - no dependency on data or libraries. I have done this for you:

LGBMDataset train = LGBMDataset.createFromFile("cancer.csv", "header=true label=name:Classification", null);
LGBMDataset test = LGBMDataset.createFromFile("cancer-test.csv", "header=true label=name:Classification", train);
LGBMBooster booster = LGBMBooster.create(train, "objective=binary label=name:Classification");
booster.addValidData(test);

for (int i=0; i<10; i++) {
booster.updateOneIter();
double[] evalTrain = booster.getEval(0);
double[] evalTest = booster.getEval(1);
System.out.println("train: " + eval[0] + " test: " + );
}
booster.close();

/** added lines to trigger inference in a trivial way */
float[] input = new float[2];
double pred = booster.predictForMatSingleRow(input, PredictionType.C_API_PREDICT_NORMAL);

@shuttie
Copy link
Contributor

shuttie commented Jun 3, 2024

@jornd13

So please try the new version, and report if it fixes your issue.

@jornd13
Copy link
Author

jornd13 commented Jun 22, 2024

Thanks a lot! Yes, use-after-close is fixed now. Closing the issue.

@jornd13 jornd13 closed this as completed Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants