Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in accuracy between distributed scala.spark.xgboost4j and single-node xgboost #5977

Closed
PivovarA opened this issue Aug 4, 2020 · 7 comments

Comments

@PivovarA
Copy link

PivovarA commented Aug 4, 2020

I noticed a difference in accuracy between distributed implementation of xgboost and single-node.
Moreover, the difference in accuracy changes with an increase in the number of estimators.
I would like to clarify if this behavior is expected and what could be the root cause of it?
I've prepared prepared a simple reproducer just to shows difference and how accuracy changes.

Steps to reproduce

>>> import pandas as pd
>>> pd.read_csv("/Users/aleksandr/Documents/xgboost_bench/boston.csv")
     Unnamed: 0     CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO       B  LSTAT  target
0             0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0     15.3  396.90   4.98    24.0
1             1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0     17.8  396.90   9.14    21.6
2             2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0     17.8  392.83   4.03    34.7
3             3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0     18.7  394.63   2.94    33.4
4             4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0     18.7  396.90   5.33    36.2
..          ...      ...   ...    ...   ...    ...    ...   ...     ...  ...    ...      ...     ...    ...     ...
501         501  0.06263   0.0  11.93   0.0  0.573  6.593  69.1  2.4786  1.0  273.0     21.0  391.99   9.67    22.4
502         502  0.04527   0.0  11.93   0.0  0.573  6.120  76.7  2.2875  1.0  273.0     21.0  396.90   9.08    20.6
503         503  0.06076   0.0  11.93   0.0  0.573  6.976  91.0  2.1675  1.0  273.0     21.0  396.90   5.64    23.9
504         504  0.10959   0.0  11.93   0.0  0.573  6.794  89.3  2.3889  1.0  273.0     21.0  393.45   6.48    22.0
505         505  0.04741   0.0  11.93   0.0  0.573  6.030  80.8  2.5050  1.0  273.0     21.0  396.90   7.88    11.9

[506 rows x 15 columns]
>>> df = pd.read_csv("boston.csv")
>>> X = df[df.columns[:-1]]
>>> y = df["target"]
>>> import xgboost as xgb
>>> from xgboost.sklearn import XGBRegressor
>>> model = XGBRegressor(n_estimators=20, max_depth=10, tree_method="hist")
>>> model.fit(X, y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=10,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=20, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='hist',
             validate_parameters=1, verbosity=None)
>>> from sklearn.metrics import mean_squared_error
>>> prediction = model.predict(X)
>>> mean_squared_error(y, prediction)
0.08756953747034339

I am using a simple boston dataset. It can be downloaded via

from sklearn.datasets import load_boston
X, y = load_boston (return_X_y = True)

And on scala:

scala>   val spark = SparkSession.builder().getOrCreate()
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6c1e2161

scala>   import spark.implicits._
import spark.implicits._

scala>   val df = spark.read.format("csv").option("header", "true").load("file://boston.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, CRIM: string ... 13 more fields]

scala>   df.show()
CSV file: file://boston.csv
+---+-------+----+-----+----+-----+-----+-----+------+---+-----+-------+------+-----+------+
|_c0|   CRIM|  ZN|INDUS|CHAS|  NOX|   RM|  AGE|   DIS|RAD|  TAX|PTRATIO|     B|LSTAT|target|
+---+-------+----+-----+----+-----+-----+-----+------+---+-----+-------+------+-----+------+
|  0|0.00632|18.0| 2.31| 0.0|0.538|6.575| 65.2|  4.09|1.0|296.0|   15.3| 396.9| 4.98|  24.0|
|  1|0.02731| 0.0| 7.07| 0.0|0.469|6.421| 78.9|4.9671|2.0|242.0|   17.8| 396.9| 9.14|  21.6|
|  2|0.02729| 0.0| 7.07| 0.0|0.469|7.185| 61.1|4.9671|2.0|242.0|   17.8|392.83| 4.03|  34.7|
|  3|0.03237| 0.0| 2.18| 0.0|0.458|6.998| 45.8|6.0622|3.0|222.0|   18.7|394.63| 2.94|  33.4|
|  4|0.06905| 0.0| 2.18| 0.0|0.458|7.147| 54.2|6.0622|3.0|222.0|   18.7| 396.9| 5.33|  36.2|
|  5|0.02985| 0.0| 2.18| 0.0|0.458| 6.43| 58.7|6.0622|3.0|222.0|   18.7|394.12| 5.21|  28.7|
|  6|0.08829|12.5| 7.87| 0.0|0.524|6.012| 66.6|5.5605|5.0|311.0|   15.2| 395.6|12.43|  22.9|
|  7|0.14455|12.5| 7.87| 0.0|0.524|6.172| 96.1|5.9505|5.0|311.0|   15.2| 396.9|19.15|  27.1|
|  8|0.21124|12.5| 7.87| 0.0|0.524|5.631|100.0|6.0821|5.0|311.0|   15.2|386.63|29.93|  16.5|
|  9|0.17004|12.5| 7.87| 0.0|0.524|6.004| 85.9|6.5921|5.0|311.0|   15.2|386.71| 17.1|  18.9|
| 10|0.22489|12.5| 7.87| 0.0|0.524|6.377| 94.3|6.3467|5.0|311.0|   15.2|392.52|20.45|  15.0|
| 11|0.11747|12.5| 7.87| 0.0|0.524|6.009| 82.9|6.2267|5.0|311.0|   15.2| 396.9|13.27|  18.9|
| 12|0.09378|12.5| 7.87| 0.0|0.524|5.889| 39.0|5.4509|5.0|311.0|   15.2| 390.5|15.71|  21.7|
| 13|0.62976| 0.0| 8.14| 0.0|0.538|5.949| 61.8|4.7075|4.0|307.0|   21.0| 396.9| 8.26|  20.4|
| 14|0.63796| 0.0| 8.14| 0.0|0.538|6.096| 84.5|4.4619|4.0|307.0|   21.0|380.02|10.26|  18.2|
| 15|0.62739| 0.0| 8.14| 0.0|0.538|5.834| 56.5|4.4986|4.0|307.0|   21.0|395.62| 8.47|  19.9|
| 16|1.05393| 0.0| 8.14| 0.0|0.538|5.935| 29.3|4.4986|4.0|307.0|   21.0|386.85| 6.58|  23.1|
| 17| 0.7842| 0.0| 8.14| 0.0|0.538| 5.99| 81.7|4.2579|4.0|307.0|   21.0|386.75|14.67|  17.5|
| 18|0.80271| 0.0| 8.14| 0.0|0.538|5.456| 36.6|3.7965|4.0|307.0|   21.0|288.99|11.69|  20.2|
| 19| 0.7258| 0.0| 8.14| 0.0|0.538|5.727| 69.5|3.7965|4.0|307.0|   21.0|390.95|11.28|  18.2|
+---+-------+----+-----+----+-----+-----+-----+------+---+-----+-------+------+-----+------+
only showing top 20 rows

scala>   val result = df.select(df.columns.map(c=> col(c).cast(DoubleType)): _*).drop("_c0")
result: org.apache.spark.sql.DataFrame = [CRIM: double, ZN: double ... 12 more fields]

scala>   val nFeatures = result.columns.size - 1
nFeatures: Int = 13

scala>   val vectorAssembler = new VectorAssembler().setInputCols(result.columns.take(nFeatures)).setOutputCol("features")
vectorAssembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_98e500861f3f

scala>   val xgbInput = vectorAssembler.transform(result).select("features", result.columns.last)
xgbInput: org.apache.spark.sql.DataFrame = [features: vector, target: double]

scala> println(xgbInput.select("features").first)
[[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98]]

scala>   val xgbParam = Map(
     |     "tree_method" -> "hist",
     |      "max_depth" -> 10,
     |     "num_round" -> 20,
     |     "num_workers" -> 1
     |   )
xgbParam: scala.collection.immutable.Map[String,Any] = Map(tree_method -> hist, max_depth -> 10, num_round -> 20, num_workers -> 1)

scala> val xgbRegressor = new XGBoostRegressor(xgbParam).setFeaturesCol("features").setLabelCol(result.columns.last)
xgbRegressor: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor = xgbr_1a5c56ecd942

scala> val xgbRegressionModel = xgbRegressor.fit(xgbInput)
xgbRegressionModel: ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel = xgbr_1a5c56ecd942

scala> val res = xgbRegressionModel.transform(xgbInput)
res: org.apache.spark.sql.DataFrame = [features: vector, target: double ... 1 more field]

scala> val res_rdd = res.select("target", "prediction").rdd
res_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[41] at rdd at <console>:41

scala> val valuesAndPreds = res_rdd.map { point =>
     | val array = point.toSeq.toArray.map(_.asInstanceOf[Double])
     | (array(1), array(0))
     | }
valuesAndPreds: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[42] at map at <console>:41

scala> import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.mllib.evaluation.RegressionMetrics

scala> val metrics = new RegressionMetrics(valuesAndPreds)
metrics: org.apache.spark.mllib.evaluation.RegressionMetrics = org.apache.spark.mllib.evaluation.RegressionMetrics@9992404

scala> metrics.meanSquaredError
res3: Double = 0.11138415118383496

if increase number of estimators to 30, when I receive the following results:
0.022200668623300408 for distributed
0.009170220935654705 for single-node

I also tried to save the prediction result on scala and read it through python for a more correct comparison. The result is the same.

In a real project we have about 6000 estimators and in this case we get a difference of more than 10 times.

Environment info

XGBoost version: 1.1.0
git clone --recursive https://github.com/dmlc/xgboost
git checkout release_1.1.0
cd xgboost/jvm-packages/
mvn clean -DskipTests install package

@PivovarA
Copy link
Author

Are there any updates here? Maybe I should provide some additional information?

@trivialfis
Copy link
Member

There are some floating point difference that can not be prevented. Like sketching, we need to merge sketches from different workers which can generate additional difference. Also reducing histogram across workers, which generates floating point error due to non-associative aspect of floating point summation. Right now we are testing these kind of things with a tolerance. For large size of cluster I think the floating point difference can be larger. With dask now being supported, we are adding more and more tests on it, but so far no obvious bug is found.

@FelixYBW
Copy link
Contributor

FelixYBW commented Aug 27, 2020

I did some test, here is the result:

  • Default value of max_bin and tree_method is different. Single node is 256 and "tree", spark is 16 and "auto". @trivialfis @CodingCat are they on purpose?
  • using the same parameter and make sure the same data set row order, single node, single rank Spark and single rank dmlc_submit generate exactly the same model. accurancy are all 0.08576235795386919
  • with two ranks, how the dataset is splited impacts the accurancy a lot. How the data is ordered also impact. In my tests I got
    -- 0.11525561936939305 (253 rows and 253 rows, random order)
    -- 0.09395430881665474 (253 and 253 rows, order by index)
    -- 0.10249318660825679(252 rows and 254 rows)
    -- 0.08658100097273445 (254 rows and 252 rows)
  • using 2 ranks, I still can't get exactly the same model between dmlc_submit and spark. I can make sure the input DMatrix are the same. Looks like some configs are still different.
    -- Spark: 0.08658100097273445
    -- dmlc-submit: 0.07866148830213669

The parameters I used:
param = { 'max_depth': 10,
'tree_method': 'hist',
'n_jobs': 1,
'max_bin':256,
'verbosity': 1,
'skip_drop': 0,
'normalize_type':'tree'
}

@trivialfis
Copy link
Member

I'm sure 16 bins is not an optimal parameter ...

@PivovarA
Copy link
Author

Different default values of parameter ​​between different languages ​​and modes can lead to serious problems: unpredictable increase in model size and different accuracy, as, for example, in this case.
Are there any good reasons why these parameters cannot be aligned?

@FelixYBW
Copy link
Contributor

I compared all params. Spark doesn't support all params C has. All have the same default value except max_bin.
tree_method is auto as default in both case.

Created PR 6066 to fix max_bin

@hcho3
Copy link
Collaborator

hcho3 commented Sep 8, 2020

The default for the parameter maxBins was revised in Spark to be consistent with C++: #6066

@hcho3 hcho3 closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants