-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rabit Poll Timeout Error Flakiness when using Approx Tree Method #7250
Comments
Can you post an example program with which we can reproduce the problem? |
For privacy reasons I can't give the exact same data, but we can create synthetic data like this to mimic the dataset I used (in Python):
Using approx method and the same conditions as I described in this post, I got 2/5 successful runs using the synthetic data. For the training code (in Scala), here's a pared down version of my work code. Nothing fancy:
It doesn't seem to be a memory issue either as I get the same flakiness even when I use more machines. It just seemingly hangs after receiving the worker start command from all the workers until it hits a poll timeout in rabit. |
@tristers-at-square Thanks. Let me see if I can reproduce this issue locally using a single machine. |
@tristers-at-square I tried running the script you provided and got error
What should I do? |
I can't reproduce the issue, it would be great if someone can test the nightly build. Feel free to ping me if you need any help. ;-) |
Closing since the approx is rewritten based on the hist tree method. Now they share a lot of common code structures. |
Using the "approx" tree method during distributed XGBoost training seems rather flaky. By my observation, the exact same training job with the exact same data and configurations will fail around 50% of the time when using this method.
When it does fail, it will get to the part where the RabitTracker receives the start signal from all of the workers and print "All Workers have started". However, it will then just hang there for 30 minutes and eventually fail with some opaque Rabit poll timeout error.
The "hist" tree method is far more stable and almost always works, but unfortunately consumes vastly more memory than the "approx" method so I'd like to stick to the approx method if possible.
Dataset contains ~6500 columns and 40000 rows. This is for testing purposes, but I've observed the same flaky behavior in the full size dataset as well.
Environment
Spark Config:
Learning Algorithm Parameters
Logs:
21/09/22 16:36:13 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
[INFO] [09/22/2021 16:36:14.082] [RabitTracker-akka.actor.default-dispatcher-2] [akka://RabitTracker/user/Handler] Tracker listening @ 10.0.0.38:44077
[INFO] [09/22/2021 16:36:14.083] [RabitTracker-akka.actor.default-dispatcher-2] [akka://RabitTracker/user/Handler] Worker connection timeout is 5 minutes.
21/09/22 16:36:14 INFO XGBoostSpark: starting training with timeout set as 1800000 ms for waiting for resources
21/09/22 16:36:14 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 1672.6 KiB
[INFO] [09/22/2021 16:36:56.078] [RabitTracker-akka.actor.default-dispatcher-4] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 0]
[INFO] [09/22/2021 16:36:56.080] [RabitTracker-akka.actor.default-dispatcher-4] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 1]
[INFO] [09/22/2021 16:36:56.086] [RabitTracker-akka.actor.default-dispatcher-5] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 2]
[INFO] [09/22/2021 16:36:56.088] [RabitTracker-akka.actor.default-dispatcher-5] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 3]
[INFO] [09/22/2021 16:36:56.088] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 4]
[INFO] [09/22/2021 16:36:56.089] [RabitTracker-akka.actor.default-dispatcher-3] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 0) has started.
[INFO] [09/22/2021 16:36:56.098] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 5]
[INFO] [09/22/2021 16:36:56.098] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 1) has started.
[INFO] [09/22/2021 16:36:56.099] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 6]
[INFO] [09/22/2021 16:36:56.099] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 7]
[INFO] [09/22/2021 16:36:56.100] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 2) has started.
[INFO] [09/22/2021 16:36:56.101] [RabitTracker-akka.actor.default-dispatcher-16] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 8]
[INFO] [09/22/2021 16:36:56.107] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 3) has started.
[INFO] [09/22/2021 16:36:56.108] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 9]
[INFO] [09/22/2021 16:36:56.108] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 10]
[INFO] [09/22/2021 16:36:56.112] [RabitTracker-akka.actor.default-dispatcher-11] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 4) has started.
[INFO] [09/22/2021 16:36:56.114] [RabitTracker-akka.actor.default-dispatcher-22] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 11]
[INFO] [09/22/2021 16:36:56.114] [RabitTracker-akka.actor.default-dispatcher-22] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 12]
[INFO] [09/22/2021 16:36:56.116] [RabitTracker-akka.actor.default-dispatcher-2] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 5) has started.
[INFO] [09/22/2021 16:36:56.119] [RabitTracker-akka.actor.default-dispatcher-31] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 6) has started.
[INFO] [09/22/2021 16:36:56.125] [RabitTracker-akka.actor.default-dispatcher-31] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 7) has started.
[INFO] [09/22/2021 16:36:56.126] [RabitTracker-akka.actor.default-dispatcher-12] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 8) has started.
[INFO] [09/22/2021 16:36:56.129] [RabitTracker-akka.actor.default-dispatcher-21] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 9) has started.
[INFO] [09/22/2021 16:36:56.130] [RabitTracker-akka.actor.default-dispatcher-27] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 10) has started.
[INFO] [09/22/2021 16:36:56.132] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 11) has started.
[INFO] [09/22/2021 16:36:56.134] [RabitTracker-akka.actor.default-dispatcher-14] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 12) has started.
[INFO] [09/22/2021 16:36:56.149] [RabitTracker-akka.actor.default-dispatcher-33] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 13]
[INFO] [09/22/2021 16:36:56.151] [RabitTracker-akka.actor.default-dispatcher-5] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 13) has started.
[INFO] [09/22/2021 16:36:56.169] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal [rank: 14]
[INFO] [09/22/2021 16:36:56.171] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Worker 10.0.0.39 (rank: 14) has started.
[INFO] [09/22/2021 16:36:56.181] [RabitTracker-akka.actor.default-dispatcher-25] [akka://RabitTracker/user/Handler] Received start signal from fa-test-scala-trackerconf-3-w-1.c.ds-risk-prod.internal [rank: 15]
[INFO] [09/22/2021 16:36:56.184] [RabitTracker-akka.actor.default-dispatcher-20] [akka://RabitTracker/user/Handler] Worker 10.0.0.30 (rank: 15) has started.
[INFO] [09/22/2021 16:36:56.184] [RabitTracker-akka.actor.default-dispatcher-20] [akka://RabitTracker/user/Handler] All workers have started.
[INFO] [09/22/2021 17:06:56.409] [RabitTracker-akka.actor.default-dispatcher-4] [akka://RabitTracker/user/Handler] Received shutdown signal from 11
21/09/22 17:06:56 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 10.0 in stage 41.0 (TID 9758) (fa-test-scala-trackerconf-3-w-0.c.ds-risk-prod.internal executor 8): ml.dmlc.xgboost4j.java.XGBoostError: [17:06:56] /workspace/rabit/include/rabit/internal/socket.h:630: Poll timeout
Stack trace:
[bt] (0) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f5f57f31843]
[bt] (1) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::engine::AllreduceBase::TryAllreduceTree(void*, unsigned long, unsigned long, void ()(void const, void*, int, MPI::Datatype const&))+0xa16) [0x7f5f58214166]
[bt] (2) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::engine::AllreduceBase::TryAllreduce(void*, unsigned long, unsigned long, void ()(void const, void*, int, MPI::Datatype const&))+0x12) [0x7f5f58216c52]
[bt] (3) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::engine::ReduceHandle::Allreduce(void*, unsigned long, unsigned long, void ()(void), void*)+0xa4) [0x7f5f58226c14]
[bt] (4) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(rabit::SerializeReducer<xgboost::common::QuantileSketchTemplate<float, float, xgboost::common::WXQSummary<float, float> >::SummaryContainer>::Allreduce(xgboost::common::QuantileSketchTemplate<float, float, xgboost::common::WXQSummary<float, float> >::SummaryContainer*, unsigned long, unsigned long, void ()(void), void*)+0xc0) [0x7f5f581a1fb0]
[bt] (5) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::CQHistMaker::ResetPosAndPropose(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator > const&, xgboost::RegTree const&)+0x1c4c) [0x7f5f581a76cc]
[bt] (6) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::GlobalProposalHistMaker::ResetPosAndPropose(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix*, std::vector<unsigned int, std::allocator > const&, xgboost::RegTree const&)+0x11c) [0x7f5f581a80dc]
[bt] (7) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::HistMaker::UpdateTree(std::vector<xgboost::detail::GradientPairInternal, std::allocator<xgboost::detail::GradientPairInternal > > const&, xgboost::DMatrix*, xgboost::RegTree*)+0xb5) [0x7f5f58191295]
[bt] (8) /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1632327675372_0001/container_1632327675372_0001_01_000008/tmp/libxgboost4j3745035270672734986.so(xgboost::tree::HistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0xb8) [0x7f5f5818c1b8]
The text was updated successfully, but these errors were encountered: