Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vertical federated learning #8932

Merged
merged 9 commits into from
Mar 22, 2023
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Mar 17, 2023

There are two main sets of changes:

  • When loading the data, we assume each worker has its own set of features, and the feature index starts from 0. So after loading, we figure out the global feature indices, and add an offset to the feature indices on each worker to make them globally consistent.
  • When calculating gradients (similarly getting the initial base score), we assume the labels are only available on the first worker, so the gradients are calculated there and broadcast to other workers.

@rongou
Copy link
Contributor Author

rongou commented Mar 17, 2023

@trivialfis @hcho3 this is working end to end, please take a look. I'll see if I can add more tests. Thanks!

}

dmlc::TemporaryDirectory tmpdir;
std::string path = tmpdir.path + "/small" + std::to_string(rank) + ".csv";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extract this into helpers.h if it's not test-case-sepcifc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

collective::Allreduce<collective::Operation::kSum>(
reinterpret_cast<double*>(h_sum.Values().data()), h_sum.Size() * 2);

// In vertical federated learning, only worker 0 needs to call this, no need to do an allreduce.
Copy link
Member

@trivialfis trivialfis Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can simply run it for all workers to remove a condition? We have a validate method in the learner model param, which is a good place for checking whether the label is correctly distributed across workers for federated learning. If labels are the same for all workers, the base_score should also be the same. Also, we don't need an additional info parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is in learner.cc we only call InitEstimation for worker 0, which in turn calls this method. If we don't skip this allreduce we'd get a mismatch in non-0 workers.

@@ -857,6 +857,25 @@ class LearnerConfiguration : public Learner {
mparam_.num_target = n_targets;
}
}

void InitEstimation(MetaInfo const& info, linalg::Tensor<float, 1>* base_score) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we just calculate the gradient using individual workers? Is the gradient still the same? If so, we can just let them calculate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have labels in non-0 workers, they won't be able to calculate the gradient.

collective::Broadcast(out_gpair->HostPointer(), out_gpair->Size() * sizeof(GradientPair),
0);
} else {
CHECK_EQ(info.labels.Size(), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be difficult for users to specify their own worker rank once we put xgboost in an automated pipeline. I look at your nvflare example, the rank is not assigned by user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can check if the label size is 0 here to determine who needs to calculate the gradient. But in general we need stable ranks for the trained model to be useful for inference. That's more of an nvflare requirement. I'll ask them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to automatically agree on who should be the one to own the label? Maybe it's easier to have a fully automated pipeline if everyone has equal access to labels? Just curious from a user's perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometime (most times?) it's not possible for all the parties to have access to the labels. For example, a hospital may have the diagnosis results of a patient, but labs only have access to blood work, DNA tests, etc.

I think the best way to guarantee the ordering for now is to always launch the workers in the same sequence. Since federated learning is usually done by a single admin, this is reasonable solution. I'll ask the NVFLARE team to see if they can add some new features to better support this.

@@ -35,12 +36,14 @@ SimpleDMatrix::SimpleDMatrix(AdapterT* adapter, float missing, int32_t /*nthread
info_.num_col_ = adapter->NumColumns();
info_.num_row_ = adapter->NumRows();
// Synchronise worker columns
collective::Allreduce<collective::Operation::kMax>(&info_.num_col_, 1);
info_.data_split_mode = data_split_mode;
ReindexFeatures();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mark it not implemented for now. This may pull the data back to CPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

collective::Broadcast(out_gpair->HostPointer(), out_gpair->Size() * sizeof(GradientPair),
0);
} else {
CHECK_EQ(info.labels.Size(), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to automatically agree on who should be the one to own the label? Maybe it's easier to have a fully automated pipeline if everyone has equal access to labels? Just curious from a user's perspective.

@@ -1051,6 +1063,13 @@ void SparsePage::SortIndices(int32_t n_threads) {
});
}

void SparsePage::Reindex(uint64_t feature_offset, int32_t n_threads) {
auto& h_data = this->data.HostVector();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potentially pulls data from the device to the host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but it's not much different from some of the other methods there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants