Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmlspark lightgbm performs poorly when dealing with unbalanced label dataset compared with native lightgbm #1276

Closed
chulminkw opened this issue Nov 26, 2021 · 9 comments

Comments

@chulminkw
Copy link

chulminkw commented Nov 26, 2021

Hello,

I 'm testing some model with mmlspark lightgbm on databricks 10.0 ML(SPARK 3.2, Scala 2.12)

Installed mmlspark version is com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-43-54379bf7-SNAPSHOT

I'm using credit card fraud dataset from kaggle ( https://www.kaggle.com/mlg-ulb/creditcardfraud )
The dataset is heavily unbalanced in target label. Only less than 1 percent of the data is fraud and labeled as 1 and more than 99 percent is labeled as 0.

When I trained the dataset with native lightgbm packages, the performance was very good.
precision was 0.95, recall was 0.763 and roc-auc is 0.978
I only used a few hyperparameters(by the way when boost_from_average=True, the performance was poor on unbalanced dataset)

lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=64, boost_from_average=False)

but when I use mmlspark lightgbm, model performance runs poorly.

lgbm_classifier = LightGBMClassifier(featuresCol="features", labelCol="Class"
, numIterations=1000
, boostFromAverage=False)

lgbm_model = lgbm_classifier.fit(train_over_sdf_vectorized)
lgbm_predictions = lgbm_model.transform(test_sdf_vectorized)

precision: 0.0009, recall: 0.0189, roc_auc: 0.4895

I've tested with various hyperparameter settings with hyperopt package for a couple of days. But the result was almost same.

I also tested with another unbalanced dataset such as kaggle's santander dataset (https://www.kaggle.com/c/santander-customer-satisfaction)
the result was not that bad like credit card fraud dataset(the dataset is less unbalanced than credit card dataset)
However, when tested with native python lightgbm, it showed roc-auc 0.8448. But with mmlspark lightgbm it showed roc-auc 0.71

It seems like mmlspark lightgbm shows heavy performance degradation when dealing with heavily unbalanced dataset compared with native lightgbm.

If you have any recommendation to correct this, that would be big help to me

Best regards.

@chulminkw chulminkw changed the title mmlspark lightgbm performs poorly when dealing with unbalanced label dataset compared native lightgbm mmlspark lightgbm performs poorly when dealing with unbalanced label dataset compared with native lightgbm Nov 26, 2021
@imatiach-msft
Copy link
Contributor

hi @chulminkw , can you please try:
1.) using very latest synapseml package, com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-43-54379bf7-SNAPSHOT is older than the new synapseml packages and still has the old mmlspark name
2.) setting parameters:
isUnbalance=True
useSingleDatasetMode=True
numThreads=(number of cores on machines)-1
This currently gives fastest performance and best accuracy in our benchmarking
Please let me know if you don't see any improvement. I can also take a look and try to reproduce the issue.

@imatiach-msft
Copy link
Contributor

Also another problem that might occur only in distributed case but not single node - if your minority class appears only in a single partition, the performance may be worse than if it is distributed across the partitions, so it may be good to ensure the data is stratified by class across all partitions.

@chulminkw
Copy link
Author

chulminkw commented Dec 6, 2021

Hello Ilya Matiach
Thanks for your reply. I didn't notice your reply until yesterday and sorry for the late update

First, the reason I didn't use the latest synapse ml named com.microsoft.azure:synapseml_2.12:0.9.4 was that it showed an error in initiating LightGBMClassifier. I installed library for the coordinates use: com.microsoft.azure:synapseml_2.12:0.9.4
in databricks community edition with 10.0 ML Apache Spark 3.2.0 and Scala 12

the error is "java.lang.NoSuchMethodError: spray.json.BasicFormats.$init$(Lspray/json/BasicFormats;)" when I do lgbm_classifier = LightGBMClassifier(featuresCol="features", labelCol="TARGET"
, numIterations=50
, boostFromAverage=False
)

following is the source code and the error messages
==================================>
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from synapse.ml.lightgbm import LightGBMClassifier

vector_assembler = VectorAssembler(inputCols=santander_columns, outputCol="features")

lgbm_classifier = LightGBMClassifier(featuresCol="features", labelCol="TARGET"
, numIterations=50
, boostFromAverage=False
)

lgbm_param_grid = ParamGridBuilder().addGrid(lgbm_classifier.learningRate, [0.01, 0.1]).build()
roc_eval = BinaryClassificationEvaluator(labelCol='TARGET', metricName='areaUnderROC' )

lgbm_cv = CrossValidator(estimator=lgbm_classifier, evaluator=roc_eval, estimatorParamMaps=lgbm_param_grid, numFolds=3, seed=2021)

===============================================>

java.lang.NoSuchMethodError: spray.json.BasicFormats.$init$(Lspray/json/BasicFormats;)V

Py4JJavaError Traceback (most recent call last)
in
9 vector_assembler = VectorAssembler(inputCols=santander_columns, outputCol="features")
10
---> 11 lgbm_classifier = LightGBMClassifier(featuresCol="features", labelCol="TARGET"
12 , numIterations=50
13 ,isUnbalance=True

/databricks/spark/python/pyspark/init.py in wrapper(self, *args, **kwargs)
112 raise TypeError("Method %s forces keyword arguments." % func.name)
113 self._input_kwargs = kwargs
--> 114 return func(self, **kwargs)
115 return wrapper
116

/local_disk0/spark-941f58d5-9ccf-4ca7-835c-77d57fc48fab/userFiles-0ad410fd-d119-4877-b572-cc282b56aa35/addedFile5753353993912334267synapseml_lightgbm_2_12_0_9_4-a075c.jar/synapse/ml/lightgbm/LightGBMClassifier.py in init(self, java_obj, baggingFraction, baggingFreq, baggingSeed, binSampleCount, boostFromAverage, boostingType, categoricalSlotIndexes, categoricalSlotNames, chunkSize, defaultListenPort, driverListenPort, dropRate, earlyStoppingRound, featureFraction, featuresCol, featuresShapCol, fobj, improvementTolerance, initScoreCol, isProvideTrainingMetric, isUnbalance, labelCol, lambdaL1, lambdaL2, leafPredictionCol, learningRate, matrixType, maxBin, maxBinByFeature, maxDeltaStep, maxDepth, maxDrop, metric, minDataInLeaf, minGainToSplit, minSumHessianInLeaf, modelString, negBaggingFraction, numBatches, numIterations, numLeaves, numTasks, numThreads, objective, parallelism, posBaggingFraction, predictionCol, probabilityCol, rawPredictionCol, repartitionByGroupingColumn, skipDrop, slotNames, thresholds, timeout, topK, uniformDrop, useBarrierExecutionMode, useSingleDatasetMode, validationIndicatorCol, verbosity, weightCol, xgboostDartMode)
283 super(LightGBMClassifier, self).init()
284 if java_obj is None:
--> 285 self._java_obj = self._new_java_obj("com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier", self.uid)
286 else:
287 self._java_obj = java_obj

/databricks/spark/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
64 java_obj = getattr(java_obj, name)
65 java_args = [_py2java(sc, arg) for arg in args]
---> 66 return java_obj(*java_args)
67
68 @staticmethod

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in call(self, *args)
1566
1567 answer = self._gateway_client.send_command(command)
-> 1568 return_value = get_return_value(
1569 answer, self._gateway_client, None, self._fqn)
1570

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling None.com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.
: java.lang.NoSuchMethodError: spray.json.BasicFormats.$init$(Lspray/json/BasicFormats;)V
at com.microsoft.azure.synapse.ml.logging.LogJsonProtocol$.(BasicLogging.scala:18)
at com.microsoft.azure.synapse.ml.logging.LogJsonProtocol$.(BasicLogging.scala)
at com.microsoft.azure.synapse.ml.logging.BasicLogging.logBase(BasicLogging.scala:31)
at com.microsoft.azure.synapse.ml.logging.BasicLogging.logBase$(BasicLogging.scala:30)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logBase(LightGBMClassifier.scala:26)
at com.microsoft.azure.synapse.ml.logging.BasicLogging.logClass(BasicLogging.scala:41)
at com.microsoft.azure.synapse.ml.logging.BasicLogging.logClass$(BasicLogging.scala:40)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logClass(LightGBMClassifier.scala:26)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.(LightGBMClassifier.scala:29)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:250)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)

Second, I tried your recommendation with com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-43-54379bf7-SNAPSHOT
but this version has no parameters named useSingleDatasetMode=True and numThreads. Maybe I should change it to the latest version. However I tested my code with only isUnbalance=True setting.

In kaggle credit card fraud dataset with the extreme unbalanced case,. the recall is extremely good which is more than 0.95, but the precision is extremely bad which is lower than 0.1. So f1 score and roc auc is lower than previous case with isUnbalance=False

In kaggle santander dataset, isUnbalance=True doesn't show much improvement. it showed roc-auc 0.72

I would like to test my code with the latest version with the useSingleDatasetMode=True and numThreads.

Can you please look into the error messages ?

@imatiach-msft
Copy link
Contributor

hi @chulminkw it looks like the error is not in synapseml but in a dependency, the spray-json package:
https://github.com/spray/spray-json

synapseml depends on spray-json 1.3.2:

https://github.com/microsoft/SynapseML/blob/master/build.sbt#L32

actually, I also see here 1.3.4, but I'm not sure if this is used or how it is used:

"io.spray" %% "spray-json" % "1.3.4",

I'm actually not sure why this is not working for you. I wonder if somehow it's picking up a bad version, or there is something else going on.

Maybe you can try the latest coordinates:
com.microsoft.azure:synapseml_2.12:0.9.4-16-ff2aa020-SNAPSHOT

https://mmlspark.azureedge.net/maven

For this PR that was just recently merged:
#1282

I actually tested that build on databricks and it was working fine. No idea why it's failing.

@chulminkw
Copy link
Author

Hello, Ilya Matiach, thanks for your effort.

com.microsoft.azure:synapseml_2.12:0.9.4-16-ff2aa020-SNAPSHOT is also not working to me.

I will continue testing that build in another version of databricks cluster.
By the way, can you please tell me what version of cluster of databricks you tested?

@chulminkw chulminkw reopened this Dec 7, 2021
@imatiach-msft
Copy link
Contributor

I was just recently using this on that build for benchmarking:

image

Hmm, I guess that is an older version of Databricks runtime. Maybe I need to try the very latest to repro this issue.

@imatiach-msft
Copy link
Contributor

also adding @mhamilton723 , it looks like we may need to do a special upgrade to work with latest databricks/spark version. I'm guessing our package is still on an older version and that may be why it's not working with Databricks 10 ML spark 3.2.0

@imatiach-msft
Copy link
Contributor

Nothing verified though. I need to try out the latest version still to see if I repro the same error.

@chulminkw
Copy link
Author

Thanks for the info.

I'll also look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants