You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran xgboost on yarn and test if fault tolerance could work. I started 4 workers. When xgboost started to update model, I killed one worker(called worker0). Yarn started a worker named worker0_1 instead. But the worker failed finally due to this error: Allreduce Recovered data size do not match the specification of function call.
The responding code is,
allreduce_robust.cc(line 817)
if (role == kRequestData || role == kHaveData) {
utils::Check(data_size == size,
"Allreduce Recovered data size do not match the specification of function call.\n"
"Please check if calling sequence of recovered program is the "
"same the original one in current VersionNumber");
}
I printed some details then. data_size is 800 and size is 8. But I don't know the reason.
The text was updated successfully, but these errors were encountered:
allreduce_robust.cc(line 817)
if (role == kRequestData || role == kHaveData) {
utils::Check(data_size == size,
"Allreduce Recovered data size do not match the specification of function call.\n"
"Please check if calling sequence of recovered program is the "
"same the original one in current VersionNumber");
}
I printed some details then. data_size is 800 and size is 8. But I don't know the reason.
The text was updated successfully, but these errors were encountered: