Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rabit Poll Timeout Error Flakiness when using Approx Tree Method #7250

Closed
tristers-at-square opened this issue Sep 22, 2021 · 9 comments
Closed

Comments

@tristers-at-square
Copy link

tristers-at-square commented Sep 22, 2021

Using the "approx" tree method during distributed XGBoost training seems rather flaky. By my observation, the exact same training job with the exact same data and configurations will fail around 50% of the time when using this method.

When it does fail, it will get to the part where the RabitTracker receives the start signal from all of the workers and print "All Workers have started". However, it will then just hang there for 30 minutes and eventually fail with some opaque Rabit poll timeout error.

The "hist" tree method is far more stable and almost always works, but unfortunately consumes vastly more memory than the "approx" method so I'd like to stick to the approx method if possible.

Dataset contains ~6500 columns and 40000 rows. This is for testing purposes, but I've observed the same flaky behavior in the full size dataset as well.

Environment

  • Spark 3.1.1
  • xgboost4j 1.4.1
  • xgboost4j-spark 1.4.1
  • YARN
  • 2.0.19-debian10 via Google DataProc
  • Scala 2.12

Spark Config:

  • spark.yarn.am.memory":"1024m"
  • spark.yarn.am.memoryOverhead":"1024m"
  • spark.dynamicAllocation.enabled":"false"
  • spark.kryoserializer.buffer.max":"1048"
  • spark.task.cpus":"4"
  • spark.executor.memory":"45g"
  • spark.executor.memoryOverhead":"5g"
  • spark.executor.cores":"8"
  • spark.executor.heartbeatInterval":"1000000"
  • spark.network.timeout":"2000000"
  • spark.driver.memory":"26g"
  • spark.driver.cores":"8"
  • spark.sql.shuffle.partitions":"1000"
  • spark.default.parallelism":"1000"
  • spark.sql.parquet.enableVectorizedReader":"false"
  • spark.memory.fraction":"0.8"
  • spark.executor.instances":"8"

Learning Algorithm Parameters

  • num_workers: 16
  • nthread: 4
  • importance_type: total_gain
  • colsample_bytree: 0.8
  • objective: "binary:logistic"
  • alpha: 0.1
  • lambda: 1.2
  • subsample: 0.8
  • verbosity: 3
  • max_depth: 8
  • learning_rate: 0.25
  • min_child_weight: 5.2
  • gamma: 0.15
  • num_round: 10
  • scale_pos_weight: 1.52
  • tree_method: "approx"
  • random_state: 666
  • eval_metric: "logloss"

Logs:
21/09/22 16:36:13 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
[INFO] [09/22/2021 16:36:14.082] [RabitTracker-akka.actor.default-dispatcher-2] [akka://RabitTracker/user/Handler] Tracker listening @ 10.0.0.38:44077
[INFO] [09/22/2021 16:36:14.083] [RabitTracker-akka.actor.default-dispatcher-2] [akka://RabitTracker/user/Handler] Worker connection timeout is 5 minutes.
21/09/22 16:36:14 INFO XGBoostSpark: starting training with timeout set as 1800000 ms for waiting for resources
21/09/22 16:36:14 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 1672.6 KiB
[INFO] [09/22/2021 16:36:56.078] [RabitTracker-akka.actor.default-dispatcher-4] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 0]
[INFO] [09/22/2021 16:36:56.080] [RabitTracker-akka.actor.default-dispatcher-4] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 1]
[INFO] [09/22/2021 16:36:56.086] [RabitTracker-akka.actor.default-dispatcher-5] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 2]
[INFO] [09/22/2021 16:36:56.088] [RabitTracker-akka.actor.default-dispatcher-5] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 3]
[INFO] [09/22/2021 16:36:56.088] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 4]
[INFO] [09/22/2021 16:36:56.089] [RabitTracker-akka.actor.default-dispatcher-3] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 0) has started.
[INFO] [09/22/2021 16:36:56.098] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 5]
[INFO] [09/22/2021 16:36:56.098] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 1) has started.
[INFO] [09/22/2021 16:36:56.099] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 6]
[INFO] [09/22/2021 16:36:56.099] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 7]
[INFO] [09/22/2021 16:36:56.100] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 2) has started.
[INFO] [09/22/2021 16:36:56.101] [RabitTracker-akka.actor.default-dispatcher-16] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 8]
[INFO] [09/22/2021 16:36:56.107] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 3) has started.
[INFO] [09/22/2021 16:36:56.108] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 9]
[INFO] [09/22/2021 16:36:56.108] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 10]
[INFO] [09/22/2021 16:36:56.112] [RabitTracker-akka.actor.default-dispatcher-11] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 4) has started.
[INFO] [09/22/2021 16:36:56.114] [RabitTracker-akka.actor.default-dispatcher-22] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 11]
[INFO] [09/22/2021 16:36:56.114] [RabitTracker-akka.actor.default-dispatcher-22] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 12]
[INFO] [09/22/2021 16:36:56.116] [RabitTracker-akka.actor.default-dispatcher-2] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 5) has started.
[INFO] [09/22/2021 16:36:56.119] [RabitTracker-akka.actor.default-dispatcher-31] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 6) has started.
[INFO] [09/22/2021 16:36:56.125] [RabitTracker-akka.actor.default-dispatcher-31] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 7) has started.
[INFO] [09/22/2021 16:36:56.126] [RabitTracker-akka.actor.default-dispatcher-12] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 8) has started.
[INFO] [09/22/2021 16:36:56.129] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 9) has started.
[INFO] [09/22/2021 16:36:56.130] [RabitTracker-akka.actor.default-dispatcher-27] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 10) has started.
[INFO] [09/22/2021 16:36:56.132] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 11) has started.
[INFO] [09/22/2021 16:36:56.134] [RabitTracker-akka.actor.default-dispatcher-14] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 12) has started.
[INFO] [09/22/2021 16:36:56.149] [RabitTracker-akka.actor.default-dispatcher-33] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 13]
[INFO] [09/22/2021 16:36:56.151] [RabitTracker-akka.actor.default-dispatcher-5] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 13) has started.
[INFO] [09/22/2021 16:36:56.169] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 14]
[INFO] [09/22/2021 16:36:56.171] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 14) has started.
[INFO] [09/22/2021 16:36:56.181] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 15]
[INFO] [09/22/2021 16:36:56.184] [RabitTracker-akka.actor.default-dispatcher-20] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 15) has started.
[INFO] [09/22/2021 16:36:56.184] [RabitTracker-akka.actor.default-dispatcher-20] [akka://RabitTracker/user/Handler] All workers have started.
[INFO] [09/22/2021 17:06:56.409] [RabitTracker-akka.actor.default-dispatcher-4] [akka://RabitTracker/user/Handler] Received shutdown signal from 11
21/09/22 17:06:56 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 10.0 in stage 41.0 (TID 9758) (fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal executor 8): ml.dmlc.xgboost4j.java.XGBoostError: [17:06:56] /workspace/rabit/include/rabit/internal/socket.h:630: Poll timeout
Stack trace:
[bt] (0) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f5f57f31843]
[bt] (1) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::engine::AllreduceBase::TryAllreduceTree(void*, unsigned long, unsigned long, void ()(void const, void*, int, MPI::Datatype const&))+0xa16) [0x7f5f58214166]
[bt] (2) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::engine::AllreduceBase::TryAllreduce(void*, unsigned long, unsigned long, void ()(void const, void*, int, MPI::Datatype const&))+0x12) [0x7f5f58216c52]
[bt] (3) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::engine::ReduceHandle::Allreduce(void*, unsigned long, unsigned long, void ()(void), void*)+0xa4) [0x7f5f58226c14]
[bt] (4) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::SerializeReducer<xgboost::common::QuantileSketchTemplate<float, float, xgboost::common::WXQSummary<float, float> >::SummaryContainer>::Allreduce(xgboost::common::QuantileSketchTemplate<float, float, xgboost::common::WXQSummary<float, float> >::SummaryContainer*, unsigned long, unsigned long, void ()(void), void*)+0xc0) [0x7f5f581a1fb0]
[bt] (5) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::CQHistMaker::ResetPosAndPropose(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator > const&, xgboost::RegTree const&)+0x1c4c) [0x7f5f581a76cc]
[bt] (6) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::GlobalProposalHistMaker::ResetPosAndPropose(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator > const&, xgboost::RegTree const&)+0x11c) [0x7f5f581a80dc]
[bt] (7) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::HistMaker::UpdateTree(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix*, xgboost::RegTree*)+0xb5) [0x7f5f58191295]
[bt] (8) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::HistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0xb8) [0x7f5f5818c1b8]

at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:172)
at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:218)
at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:300)
at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
at scala.Option.getOrElse(Option.scala:121)
at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:416)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainForNonRanking$1(XGBoost.scala:499)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
@hcho3
Copy link
Collaborator

hcho3 commented Sep 22, 2021

Can you post an example program with which we can reproduce the problem?

@tristers-at-square
Copy link
Author

tristers-at-square commented Sep 23, 2021

@hcho3

For privacy reasons I can't give the exact same data, but we can create synthetic data like this to mimic the dataset I used (in Python):

import pandas
import random
num_features = 7000
num_rows = 30000
data = [[random.random() for _ in range(num_features)] + [random.randint(0,1)] for _ in range(num_rows)]
columns = [f"feature{i}" for i in range(num_features)] + ["label"]
df = pandas.DataFrame(data, columns=columns)
df = pandas.to_parquet("path_to_your_parquet_file.pq")

Using approx method and the same conditions as I described in this post, I got 2/5 successful runs using the synthetic data.

For the training code (in Scala), here's a pared down version of my work code. Nothing fancy:

val dataPath = "path_to_your_parquet_file.pq"
val trainDF = spark.read.parquet(dataPath)

val featureColumns = Range(0, 7000).map(x => s"feature${x}")
val vectorAssembler = new VectorAssembler().setInputCols(featureColumns)
                                           .setOutputCol("features")
                                           .setHandleInvalid("keep")
val sparseToDenseTransformer = new SparseToDenseTransformer().setInputCol("features")
                                                             .setOutputCol("denseFeatures")
val xgboostTrainer = new XGBoostClassifier(xgboostParams).setLabelCol("label")
                                                         .setFeaturesCol("denseFeatures")
val pipeline = new Pipeline().setStages(Array(
  vectorAssembler,
  sparseToDenseTransformer,
  xgboostModel
))
val pipelineModel = pipeline.fit(trainDF)

It doesn't seem to be a memory issue either as I get the same flakiness even when I use more machines. It just seemingly hangs after receiving the worker start command from all the workers until it hits a poll timeout in rabit.

@hcho3
Copy link
Collaborator

hcho3 commented Sep 23, 2021

@tristers-at-square Thanks. Let me see if I can reproduce this issue locally using a single machine.

@tristers-at-square
Copy link
Author

tristers-at-square commented Sep 24, 2021

@hcho3 Thanks!

Seems other people on the forums are having the same issue. For example here.

@trivialfis
Copy link
Member

@hcho3 If you manage to reproduce it, could you please also test #7214 ?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 8, 2021

@tristers-at-square I tried running the script you provided and got error

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/param/shared/HasInputCol$class 

What should I do?

@dchristle
Copy link

I'm running into a similar issue when using Spark on k8s with the approx method that does not seem to trigger when using the hist method. I'm using a SNAPSHOT xgboost compiled from the master branch within the last few days with a SNAPSHOT of spark (~3.3.x). Java 11.0.13 and Python 3.8 are used on all the hosts.

After I start the job, it takes only a few minutes to get to the forEachPartition step. In this job, nWorkers is set to 250, there are 3 cores per executor, nthread is set to 3, and spark.task.cpus is also set to 3.
forEachPartition

From the logs, the driver acknowledges the correct number of nodes are available and appears to start the training process.
logStart

However, just a few minutes in, the CPU usage goes to 0% and the memory usage does not change further.

executor1_stats

This suggests to me that the training process is hung and no useful work is being done. The driver CPU usage also goes down and is mostly flat, but does not quite go to zero.

After about 30 minutes, the job gives up and the logs report the following error:

 WARN TaskSetManager: Lost task 42.0 in stage 11.0 (TID 2064) (10.100.1.114 executor 70): ml.dmlc.xgboost4j.java.XGBoostError: [20:54:20] /opt/xgboost/rabit/include/rabit/internal/socket.h:630: Poll timeout
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
Stack trace:
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (0) /tmp/libxgboost4j11483072497966676486.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x59) [0x7f3b835b72c5]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (1) /tmp/libxgboost4j11483072497966676486.so(rabit::utils::PollHelper::Poll(std::chrono::duration<long, std::ratio<1l, 1l> >)+0x141) [0x7f3b839129d5]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (2) /tmp/libxgboost4j11483072497966676486.so(rabit::engine::AllreduceBase::TryAllreduceTree(void*, unsigned long, unsigned long, void (*)(void const*, void*, int, MPI::Datatype const&))+0x474) [0x7f3b8390f584]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (3) /tmp/libxgboost4j11483072497966676486.so(rabit::engine::AllreduceBase::TryAllreduce(void*, unsigned long, unsigned long, void (*)(void const*, void*, int, MPI::Datatype const&))+0x71) [0x7f3b8390f10d]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (4) /tmp/libxgboost4j11483072497966676486.so(rabit::engine::AllreduceBase::Allreduce(void*, unsigned long, unsigned long, void (*)(void const*, void*, int, MPI::Datatype const&), void (*)(void*), void*)+0x93) [0x7f3b83912d2b]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (5) /tmp/libxgboost4j11483072497966676486.so(rabit::engine::ReduceHandle::Allreduce(void*, unsigned long, unsigned long, void (*)(void*), void*)+0x82) [0x7f3b8391f2a0]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (6) /tmp/libxgboost4j11483072497966676486.so(rabit::SerializeReducer<xgboost::common::QuantileSketchTemplate<float, float, xgboost::common::WXQSummary<float, float> >::SummaryContainer>::Allreduce(xgboost::common::QuantileSketchTemplate<float, float, xgboost::common::WXQSummary<float, float> >::SummaryContainer*, unsigned long, unsigned long, void (*)(void*), void*)+0xfe) [0x7f3b83887dcc]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (7) /tmp/libxgboost4j11483072497966676486.so(xgboost::tree::CQHistMaker::ResetPosAndPropose(std::vector<xgboost::detail::GradientPairInternal<float>, std::allocator<xgboost::detail::GradientPairInternal<float> > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator<unsigned int> > const&, xgboost::RegTree const&)+0xa26) [0x7f3b8388224e]
Error
2021-11-23 12:54:20.982 PST
spark-kubernetes-driver
 [bt] (8) /tmp/libxgboost4j11483072497966676486.so(xgboost::tree::GlobalProposalHistMaker::ResetPosAndPropose(std::vector<xgboost::detail::GradientPairInternal<float>, std::allocator<xgboost::detail::GradientPairInternal<float> > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator<unsigned int> > const&, xgboost::RegTree const&)+0x1a2) [0x7f3b83883c4c]

I've attached a thread dump from the driver recorded about 15 minutes after the nodes have nominally started in case it may help in finding the root cause.

driver_threaddump.log

@trivialfis
Copy link
Member

I can't reproduce the issue, it would be great if someone can test the nightly build. Feel free to ping me if you need any help. ;-)

@trivialfis
Copy link
Member

Closing since the approx is rewritten based on the hist tree method. Now they share a lot of common code structures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants