[jvm-packages] XGBoost Spark training quite slow - Good practices #4774

kumarprabhu1988 · 2019-08-14T20:56:26Z

Hello!

I posted a question about some OOM errors I was facing with training here: https://discuss.xgboost.ai/t/xgboost4j-spark-fails-with-oom-errors/1054. Thankfully, I was able to resolve these issues and found a configuration that works. However, it takes ~8 min for 1 round. I'm looking for tips to speed this up.

My data is ~120GB before transforming into a vector. I'm using spark 2.3.2 and xgboost 0.82. My configuration ensures most of the training and validation data are present in memory. I'm using spark.memory.storageFraction = 0.16 which is quite low. Execution apparently requires lot of memory. I get OOM errors if I increase this value. I've tried increasing spark.memory.fraction as well and I get OOM errors if I increase it to 0.7.

Here's my full spark configuration:

spark.executor.cores             8
spark.driver.memory              11171M
spark.executor.memory            10000M
spark.executor.memoryOverhead    15600M
spark.default.parallelism        1280
spark.sql.shuffle.partitions     2000
spark.memory.fraction            0.6
spark.memory.storageFraction     0.16
spark.task.cpus                  4

For xgboost, I use this configuration:

val booster = new XGBoostClassifier(
  Map(
    "missing" -> -999999.0,
    "booster" -> "gbtree",
    "objective" -> "binary:logistic",
    "eval_metric" -> "logloss",
    "tree_method" -> "approx",
    "eta" -> 0.2,
    "gamma" -> 1.0,
    "alpha" -> 20,
    "max_depth" -> 4,
    "num_round" -> 1800,
    "num_workers" -> 160,
    "nthread" -> 4,
    "timeout_request_workers" -> 60000L
  )
).setLabelCol(targetVar)
.setEvalSets(evalSet)
.setUseExternalMemory(true)
.setCheckpointInterval(2)
.setCheckpointPath("checkpoints_path")

FYI, I can train this data on a single super large machine and it takes ~1 min per iteration (though the first iteration takes more than 1 hour in addition to 0.5 hours for loading data) on this machine. The goal is to move this whole training process to xgboost-spark so it can scale with the data and we don't have to get larger machines.

Posting here because I didn't get any responses on the discussion forum.

@CodingCat
@trivialfis

Any help will be appreciated.
Thank you!

The text was updated successfully, but these errors were encountered:

chenqin · 2019-08-16T16:08:16Z

@kumarprabhu1988 please try 0.9, there was a critical bug in .82

CodingCat · 2019-08-16T16:12:04Z

@kumarprabhu1988 please try 0.9, there was a critical bug in .82

it's not relevant, the bug I told you is only involved in a prediction process with a upstream shuffle

@kumarprabhu1988 number of features in your data?

CodingCat · 2019-08-16T16:13:23Z

try to remove

.setCheckpointInterval(2)
.setCheckpointPath("checkpoints_path")

for now, a fix is coming soon

kumarprabhu1988 · 2019-08-16T19:35:02Z

Thank you for your response @chenqin and @CodingCat. I have ~1300 features in the data. I've tried without checkpoints, it's not much faster. Here's how long each step takes.

repartition at line 420 in XGBoost.scala for training data takes ~5 min.
repartition at line 199 in XGBoost.scala for each eval set takes ~3 min - but it's not 3 additional 3 min. these 2 steps are in parallel and take overall ~6-7 min.
foreachPartition at line 397 in XGBoost.scala which actually trains the model takes 8 min for a total of 14-15 min.

chenqin · 2019-08-17T05:27:45Z

@kumarprabhu1988 will you be able to share sample/fake dataset that can simulate what you observed. It can help us drive investigation forward more effectively.

CodingCat · 2019-08-17T21:55:39Z

line 420 and 199 happen for only once

line 397 involved reading the shuffle data from 420 and establish xgb cluster as well so it is not only training...you can try to run more iterations and calculate the average cost for each iteration

1300 features is not a trivial workload for a tree model

kumarprabhu1988 · 2019-08-19T13:59:56Z

@chenqin Unfortunately I cannot share the original dataset. I can remove anonymize the data and remove column names and share. What's the best way to share?

@CodingCat I ran 10 iterations and it seems the time taken for each iteration after the first is almost the same. Here is a screenshot for each stage from the spark application page:

This is how a single stage looks:

It looks very similar for multiple iterations.

It is possible that some of the data does not fit in memory and is loaded from the disk. However, any increase in spark.memory.fraction or spark.memory.storageFraction leads to OOM errors.

chenqin · 2019-08-19T18:34:25Z

@chenqin Unfortunately I cannot share the original dataset. I can remove de-anonymize the data and remove column names and share. What's the best way to share?

will you be able to share file link to cloud drive e.g dropbox, google drive etc to qinnchen@gmail.com

kumarprabhu1988 · 2019-08-19T19:00:35Z

@chenqin I'm generating randomized data and will upload it in a couple of hours.

Meanwhile I had another question - how do you know the progress of training if you don't use checkpoints?

CodingCat · 2019-08-19T19:11:29Z

@kumarprabhu1988 again, remove checkpoint or set it to a large value for now, you can save ~50% of time for each checkpoint interval

with 1300 features, 12 mins for each iteration is not a surprise to me, otherwise you can do col_sampling (check https://xgboost.readthedocs.io/en/latest/parameter.html)

chenqin · 2019-08-19T20:09:13Z

Meanwhile I had another question - how do you know the progress of training if you don't use checkpoints?

I think you might be able to track executor log or driver log and see how many iterations has been done (a.k.a 🌲 )

kumarprabhu1988 · 2019-08-19T20:10:31Z

@CodingCat Will do, thank you. I'll set it to 200 so that I can track progress. Will also explore col_sampling.

kumarprabhu1988 · 2019-08-20T19:36:03Z

@chenqin Sorry for the delay, just sent you the email with some sample data.

@CodingCat I set it to 200 and ran 10 rounds. The average time reduced to 3.5 min now. Guess it'll reduce further if I run it for longer than that. Also can I further improve training time by just increasing number of nodes in the cluster?

Additionally noticed something else now. The same configuration I used fails with OOM errors if I use auc as the eval_metric instead of logloss. Specifically, I get the same errors as this issue: #4440.

You asked for executor logs on that issue. Here they are.

stderr.txt
stdout.txt

kumarprabhu1988 · 2019-08-21T15:48:06Z

@CodingCat I set it to 200 and ran 400 rounds. The first 200 rounds took 7.1 hours which means the average is ~2.1 min. However, the second 200 rounds has been running for 11.2 hours. I looked at the CPU utilization in the instances and it is 5-10% which is half of the 10-20% it used in the first 200 rounds. Previously each time it took the same amount of time for each checkpoint.

Any ideas why this might happen?

kumarprabhu1988 · 2019-09-10T18:12:44Z

@CodingCat @chenqin Any ideas guys?

chenqin · 2019-09-11T01:38:03Z

#3946

kumarprabhu1988 · 2019-09-11T20:38:07Z

Thank you @chenqin. This issue looks the same as mine. What about the OOM issue when using AUC?

billbargens · 2019-09-22T13:36:38Z

@chenqin Unfortunately I cannot share the original dataset. I can remove anonymize the data and remove column names and share. What's the best way to share?

@CodingCat I ran 10 iterations and it seems the time taken for each iteration after the first is almost the same. Here is a screenshot for each stage from the spark application page:

This is how a single stage looks:

It looks very similar for multiple iterations.

It is possible that some of the data does not fit in memory and is loaded from the disk. However, any increase in spark.memory.fraction or spark.memory.storageFraction leads to OOM errors.

what does the iteration mean here?

chenqin · 2019-09-22T22:40:22Z

@billbargens
one iteration = building one tree here.

kumarprabhu1988 · 2020-01-22T20:32:04Z

@CodingCat @chenqin FYI, it was a matter of finding the right spark.memory.fraction and spark.memory.storageFraction configurations and once I did, training works smoothly. This issue can be closed.

JackerGao · 2020-07-10T13:13:12Z

@CodingCat @chenqin FYI, it was a matter of finding the right spark.memory.fraction and spark.memory.storageFraction configurations and once I did, training works smoothly. This issue can be closed.

how do you set the config? can you give me some suggestion? i have same issue.

kumarprabhu1988 changed the title ~~XGBoost Spark training quite slow - Good practices~~ [jvm-packages] XGBoost Spark training quite slow - Good practices Aug 14, 2019

CodingCat mentioned this issue Feb 2, 2020

[native] [jvm-packages] allow rebuild prediction cache when it is not initialized #5272

Closed

trivialfis mentioned this issue Feb 10, 2020

Move prediction cache to Learner. #5220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] XGBoost Spark training quite slow - Good practices #4774

[jvm-packages] XGBoost Spark training quite slow - Good practices #4774

kumarprabhu1988 commented Aug 14, 2019 •

edited

Loading

chenqin commented Aug 16, 2019

CodingCat commented Aug 16, 2019

CodingCat commented Aug 16, 2019

kumarprabhu1988 commented Aug 16, 2019

chenqin commented Aug 17, 2019

CodingCat commented Aug 17, 2019

kumarprabhu1988 commented Aug 19, 2019 •

edited

Loading

chenqin commented Aug 19, 2019 •

edited by hcho3

Loading

kumarprabhu1988 commented Aug 19, 2019

CodingCat commented Aug 19, 2019

chenqin commented Aug 19, 2019 •

edited

Loading

kumarprabhu1988 commented Aug 19, 2019 •

edited

Loading

kumarprabhu1988 commented Aug 20, 2019 •

edited

Loading

kumarprabhu1988 commented Aug 21, 2019

kumarprabhu1988 commented Sep 10, 2019

chenqin commented Sep 11, 2019

kumarprabhu1988 commented Sep 11, 2019

billbargens commented Sep 22, 2019

chenqin commented Sep 22, 2019

kumarprabhu1988 commented Jan 22, 2020 •

edited

Loading

JackerGao commented Jul 10, 2020

[jvm-packages] XGBoost Spark training quite slow - Good practices #4774

[jvm-packages] XGBoost Spark training quite slow - Good practices #4774

Comments

kumarprabhu1988 commented Aug 14, 2019 • edited Loading

chenqin commented Aug 16, 2019

CodingCat commented Aug 16, 2019

CodingCat commented Aug 16, 2019

kumarprabhu1988 commented Aug 16, 2019

chenqin commented Aug 17, 2019

CodingCat commented Aug 17, 2019

kumarprabhu1988 commented Aug 19, 2019 • edited Loading

chenqin commented Aug 19, 2019 • edited by hcho3 Loading

kumarprabhu1988 commented Aug 19, 2019

CodingCat commented Aug 19, 2019

chenqin commented Aug 19, 2019 • edited Loading

kumarprabhu1988 commented Aug 19, 2019 • edited Loading

kumarprabhu1988 commented Aug 20, 2019 • edited Loading

kumarprabhu1988 commented Aug 21, 2019

kumarprabhu1988 commented Sep 10, 2019

chenqin commented Sep 11, 2019

kumarprabhu1988 commented Sep 11, 2019

billbargens commented Sep 22, 2019

chenqin commented Sep 22, 2019

kumarprabhu1988 commented Jan 22, 2020 • edited Loading

JackerGao commented Jul 10, 2020

kumarprabhu1988 commented Aug 14, 2019 •

edited

Loading

kumarprabhu1988 commented Aug 19, 2019 •

edited

Loading

chenqin commented Aug 19, 2019 •

edited by hcho3

Loading

chenqin commented Aug 19, 2019 •

edited

Loading

kumarprabhu1988 commented Aug 19, 2019 •

edited

Loading

kumarprabhu1988 commented Aug 20, 2019 •

edited

Loading

kumarprabhu1988 commented Jan 22, 2020 •

edited

Loading