-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support vertical federated learning #8932
Conversation
@trivialfis @hcho3 this is working end to end, please take a look. I'll see if I can add more tests. Thanks! |
} | ||
|
||
dmlc::TemporaryDirectory tmpdir; | ||
std::string path = tmpdir.path + "/small" + std::to_string(rank) + ".csv"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extract this into helpers.h
if it's not test-case-sepcifc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
collective::Allreduce<collective::Operation::kSum>( | ||
reinterpret_cast<double*>(h_sum.Values().data()), h_sum.Size() * 2); | ||
|
||
// In vertical federated learning, only worker 0 needs to call this, no need to do an allreduce. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can simply run it for all workers to remove a condition? We have a validate
method in the learner model param, which is a good place for checking whether the label is correctly distributed across workers for federated learning. If labels are the same for all workers, the base_score should also be the same. Also, we don't need an additional info parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is in learner.cc
we only call InitEstimation
for worker 0, which in turn calls this method. If we don't skip this allreduce we'd get a mismatch in non-0 workers.
@@ -857,6 +857,25 @@ class LearnerConfiguration : public Learner { | |||
mparam_.num_target = n_targets; | |||
} | |||
} | |||
|
|||
void InitEstimation(MetaInfo const& info, linalg::Tensor<float, 1>* base_score) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we just calculate the gradient using individual workers? Is the gradient still the same? If so, we can just let them calculate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't have labels in non-0 workers, they won't be able to calculate the gradient.
collective::Broadcast(out_gpair->HostPointer(), out_gpair->Size() * sizeof(GradientPair), | ||
0); | ||
} else { | ||
CHECK_EQ(info.labels.Size(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be difficult for users to specify their own worker rank once we put xgboost in an automated pipeline. I look at your nvflare example, the rank is not assigned by user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can check if the label size is 0 here to determine who needs to calculate the gradient. But in general we need stable ranks for the trained model to be useful for inference. That's more of an nvflare requirement. I'll ask them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to automatically agree on who should be the one to own the label? Maybe it's easier to have a fully automated pipeline if everyone has equal access to labels? Just curious from a user's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometime (most times?) it's not possible for all the parties to have access to the labels. For example, a hospital may have the diagnosis results of a patient, but labs only have access to blood work, DNA tests, etc.
I think the best way to guarantee the ordering for now is to always launch the workers in the same sequence. Since federated learning is usually done by a single admin, this is reasonable solution. I'll ask the NVFLARE team to see if they can add some new features to better support this.
src/data/simple_dmatrix.cu
Outdated
@@ -35,12 +36,14 @@ SimpleDMatrix::SimpleDMatrix(AdapterT* adapter, float missing, int32_t /*nthread | |||
info_.num_col_ = adapter->NumColumns(); | |||
info_.num_row_ = adapter->NumRows(); | |||
// Synchronise worker columns | |||
collective::Allreduce<collective::Operation::kMax>(&info_.num_col_, 1); | |||
info_.data_split_mode = data_split_mode; | |||
ReindexFeatures(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mark it not implemented for now. This may pull the data back to CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
collective::Broadcast(out_gpair->HostPointer(), out_gpair->Size() * sizeof(GradientPair), | ||
0); | ||
} else { | ||
CHECK_EQ(info.labels.Size(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to automatically agree on who should be the one to own the label? Maybe it's easier to have a fully automated pipeline if everyone has equal access to labels? Just curious from a user's perspective.
@@ -1051,6 +1063,13 @@ void SparsePage::SortIndices(int32_t n_threads) { | |||
}); | |||
} | |||
|
|||
void SparsePage::Reindex(uint64_t feature_offset, int32_t n_threads) { | |||
auto& h_data = this->data.HostVector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This potentially pulls data from the device to the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but it's not much different from some of the other methods there.
There are two main sets of changes: