-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmlspark lightgbm performs poorly when dealing with unbalanced label dataset compared with native lightgbm #1276
Comments
hi @chulminkw , can you please try: |
Also another problem that might occur only in distributed case but not single node - if your minority class appears only in a single partition, the performance may be worse than if it is distributed across the partitions, so it may be good to ensure the data is stratified by class across all partitions. |
Hello Ilya Matiach First, the reason I didn't use the latest synapse ml named com.microsoft.azure:synapseml_2.12:0.9.4 was that it showed an error in initiating LightGBMClassifier. I installed library for the coordinates use: com.microsoft.azure:synapseml_2.12:0.9.4 the error is "java.lang.NoSuchMethodError: spray.json.BasicFormats.$init$(Lspray/json/BasicFormats;)" when I do lgbm_classifier = LightGBMClassifier(featuresCol="features", labelCol="TARGET" following is the source code and the error messages vector_assembler = VectorAssembler(inputCols=santander_columns, outputCol="features") lgbm_classifier = LightGBMClassifier(featuresCol="features", labelCol="TARGET" lgbm_param_grid = ParamGridBuilder().addGrid(lgbm_classifier.learningRate, [0.01, 0.1]).build() lgbm_cv = CrossValidator(estimator=lgbm_classifier, evaluator=roc_eval, estimatorParamMaps=lgbm_param_grid, numFolds=3, seed=2021) ===============================================> java.lang.NoSuchMethodError: spray.json.BasicFormats.$init$(Lspray/json/BasicFormats;)VPy4JJavaError Traceback (most recent call last) /databricks/spark/python/pyspark/init.py in wrapper(self, *args, **kwargs) /local_disk0/spark-941f58d5-9ccf-4ca7-835c-77d57fc48fab/userFiles-0ad410fd-d119-4877-b572-cc282b56aa35/addedFile5753353993912334267synapseml_lightgbm_2_12_0_9_4-a075c.jar/synapse/ml/lightgbm/LightGBMClassifier.py in init(self, java_obj, baggingFraction, baggingFreq, baggingSeed, binSampleCount, boostFromAverage, boostingType, categoricalSlotIndexes, categoricalSlotNames, chunkSize, defaultListenPort, driverListenPort, dropRate, earlyStoppingRound, featureFraction, featuresCol, featuresShapCol, fobj, improvementTolerance, initScoreCol, isProvideTrainingMetric, isUnbalance, labelCol, lambdaL1, lambdaL2, leafPredictionCol, learningRate, matrixType, maxBin, maxBinByFeature, maxDeltaStep, maxDepth, maxDrop, metric, minDataInLeaf, minGainToSplit, minSumHessianInLeaf, modelString, negBaggingFraction, numBatches, numIterations, numLeaves, numTasks, numThreads, objective, parallelism, posBaggingFraction, predictionCol, probabilityCol, rawPredictionCol, repartitionByGroupingColumn, skipDrop, slotNames, thresholds, timeout, topK, uniformDrop, useBarrierExecutionMode, useSingleDatasetMode, validationIndicatorCol, verbosity, weightCol, xgboostDartMode) /databricks/spark/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args) /databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in call(self, *args) /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) /databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) Py4JJavaError: An error occurred while calling None.com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier. Second, I tried your recommendation with com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-43-54379bf7-SNAPSHOT In kaggle credit card fraud dataset with the extreme unbalanced case,. the recall is extremely good which is more than 0.95, but the precision is extremely bad which is lower than 0.1. So f1 score and roc auc is lower than previous case with isUnbalance=False In kaggle santander dataset, isUnbalance=True doesn't show much improvement. it showed roc-auc 0.72 I would like to test my code with the latest version with the useSingleDatasetMode=True and numThreads. Can you please look into the error messages ? |
hi @chulminkw it looks like the error is not in synapseml but in a dependency, the spray-json package: synapseml depends on spray-json 1.3.2: https://github.com/microsoft/SynapseML/blob/master/build.sbt#L32 actually, I also see here 1.3.4, but I'm not sure if this is used or how it is used: Line 2 in f00272e
I'm actually not sure why this is not working for you. I wonder if somehow it's picking up a bad version, or there is something else going on. Maybe you can try the latest coordinates: https://mmlspark.azureedge.net/maven For this PR that was just recently merged: I actually tested that build on databricks and it was working fine. No idea why it's failing. |
Hello, Ilya Matiach, thanks for your effort. com.microsoft.azure:synapseml_2.12:0.9.4-16-ff2aa020-SNAPSHOT is also not working to me. I will continue testing that build in another version of databricks cluster. |
also adding @mhamilton723 , it looks like we may need to do a special upgrade to work with latest databricks/spark version. I'm guessing our package is still on an older version and that may be why it's not working with Databricks 10 ML spark 3.2.0 |
Nothing verified though. I need to try out the latest version still to see if I repro the same error. |
Thanks for the info. I'll also look into it. |
Hello,
I 'm testing some model with mmlspark lightgbm on databricks 10.0 ML(SPARK 3.2, Scala 2.12)
Installed mmlspark version is com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-43-54379bf7-SNAPSHOT
I'm using credit card fraud dataset from kaggle ( https://www.kaggle.com/mlg-ulb/creditcardfraud )
The dataset is heavily unbalanced in target label. Only less than 1 percent of the data is fraud and labeled as 1 and more than 99 percent is labeled as 0.
When I trained the dataset with native lightgbm packages, the performance was very good.
precision was 0.95, recall was 0.763 and roc-auc is 0.978
I only used a few hyperparameters(by the way when boost_from_average=True, the performance was poor on unbalanced dataset)
lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=64, boost_from_average=False)
but when I use mmlspark lightgbm, model performance runs poorly.
lgbm_classifier = LightGBMClassifier(featuresCol="features", labelCol="Class"
, numIterations=1000
, boostFromAverage=False)
lgbm_model = lgbm_classifier.fit(train_over_sdf_vectorized)
lgbm_predictions = lgbm_model.transform(test_sdf_vectorized)
precision: 0.0009, recall: 0.0189, roc_auc: 0.4895
I've tested with various hyperparameter settings with hyperopt package for a couple of days. But the result was almost same.
I also tested with another unbalanced dataset such as kaggle's santander dataset (https://www.kaggle.com/c/santander-customer-satisfaction)
the result was not that bad like credit card fraud dataset(the dataset is less unbalanced than credit card dataset)
However, when tested with native python lightgbm, it showed roc-auc 0.8448. But with mmlspark lightgbm it showed roc-auc 0.71
It seems like mmlspark lightgbm shows heavy performance degradation when dealing with heavily unbalanced dataset compared with native lightgbm.
If you have any recommendation to correct this, that would be big help to me
Best regards.
The text was updated successfully, but these errors were encountered: