Support vertical federated learning #8932

rongou · 2023-03-17T00:45:44Z

There are two main sets of changes:

When loading the data, we assume each worker has its own set of features, and the feature index starts from 0. So after loading, we figure out the global feature indices, and add an offset to the feature indices on each worker to make them globally consistent.
When calculating gradients (similarly getting the initial base score), we assume the labels are only available on the first worker, so the gradients are calculated there and broadcast to other workers.

rongou · 2023-03-17T00:46:41Z

@trivialfis @hcho3 this is working end to end, please take a look. I'll see if I can add more tests. Thanks!

trivialfis · 2023-03-17T09:13:33Z

tests/cpp/plugin/test_federated_data.cc

+    }
+
+    dmlc::TemporaryDirectory tmpdir;
+    std::string path = tmpdir.path + "/small" + std::to_string(rank) + ".csv";


Could you please extract this into helpers.h if it's not test-case-sepcifc?

trivialfis · 2023-03-17T09:17:14Z

src/tree/fit_stump.cc

-  collective::Allreduce<collective::Operation::kSum>(
-      reinterpret_cast<double*>(h_sum.Values().data()), h_sum.Size() * 2);
+
+  // In vertical federated learning, only worker 0 needs to call this, no need to do an allreduce.


Maybe we can simply run it for all workers to remove a condition? We have a validate method in the learner model param, which is a good place for checking whether the label is correctly distributed across workers for federated learning. If labels are the same for all workers, the base_score should also be the same. Also, we don't need an additional info parameter.

The issue is in learner.cc we only call InitEstimation for worker 0, which in turn calls this method. If we don't skip this allreduce we'd get a mismatch in non-0 workers.

trivialfis · 2023-03-17T09:20:14Z

src/learner.cc

@@ -857,6 +857,25 @@ class LearnerConfiguration : public Learner {
      mparam_.num_target = n_targets;
    }
  }
+
+  void InitEstimation(MetaInfo const& info, linalg::Tensor<float, 1>* base_score) {


What happens if we just calculate the gradient using individual workers? Is the gradient still the same? If so, we can just let them calculate.

Since we don't have labels in non-0 workers, they won't be able to calculate the gradient.

trivialfis · 2023-03-20T20:03:02Z

src/learner.cc

+        collective::Broadcast(out_gpair->HostPointer(), out_gpair->Size() * sizeof(GradientPair),
+                              0);
+      } else {
+        CHECK_EQ(info.labels.Size(), 0)


I think it would be difficult for users to specify their own worker rank once we put xgboost in an automated pipeline. I look at your nvflare example, the rank is not assigned by user.

I think we can check if the label size is 0 here to determine who needs to calculate the gradient. But in general we need stable ranks for the trained model to be useful for inference. That's more of an nvflare requirement. I'll ask them.

Is there any way to automatically agree on who should be the one to own the label? Maybe it's easier to have a fully automated pipeline if everyone has equal access to labels? Just curious from a user's perspective.

Sometime (most times?) it's not possible for all the parties to have access to the labels. For example, a hospital may have the diagnosis results of a patient, but labs only have access to blood work, DNA tests, etc.

I think the best way to guarantee the ordering for now is to always launch the workers in the same sequence. Since federated learning is usually done by a single admin, this is reasonable solution. I'll ask the NVFLARE team to see if they can add some new features to better support this.

trivialfis · 2023-03-21T01:16:48Z

src/data/simple_dmatrix.cu

@@ -35,12 +36,14 @@ SimpleDMatrix::SimpleDMatrix(AdapterT* adapter, float missing, int32_t /*nthread
  info_.num_col_ = adapter->NumColumns();
  info_.num_row_ = adapter->NumRows();
  // Synchronise worker columns
-  collective::Allreduce<collective::Operation::kMax>(&info_.num_col_, 1);
+  info_.data_split_mode = data_split_mode;
+  ReindexFeatures();


Let's mark it not implemented for now. This may pull the data back to CPU

trivialfis · 2023-03-21T04:41:33Z

src/learner.cc

+        collective::Broadcast(out_gpair->HostPointer(), out_gpair->Size() * sizeof(GradientPair),
+                              0);
+      } else {
+        CHECK_EQ(info.labels.Size(), 0)


Is there any way to automatically agree on who should be the one to own the label? Maybe it's easier to have a fully automated pipeline if everyone has equal access to labels? Just curious from a user's perspective.

trivialfis · 2023-03-21T04:42:32Z

src/data/data.cc

@@ -1051,6 +1063,13 @@ void SparsePage::SortIndices(int32_t n_threads) {
  });
 }

+void SparsePage::Reindex(uint64_t feature_offset, int32_t n_threads) {
+  auto& h_data = this->data.HostVector();


This potentially pulls data from the device to the host.

Agreed, but it's not much different from some of the other methods there.

rongou added 3 commits March 16, 2023 10:08

handle num_col and get gradients

d4eb488

fix base score estimation

858bded

fix reindexing

e1f67f2

rongou mentioned this pull request Mar 17, 2023

Vertical Federated Learning RFC #8424

Open

trivialfis reviewed Mar 17, 2023

View reviewed changes

rongou added 4 commits March 17, 2023 10:50

Merge remote-tracking branch 'upstream/master' into federated-vertical

81235a5

remove plugin helpers.cc

1e0c070

extract method to create test csv

6ac73e3

fix distributed tests

5f8c7e2

trivialfis reviewed Mar 20, 2023

View reviewed changes

trivialfis reviewed Mar 21, 2023

View reviewed changes

rongou added 2 commits March 21, 2023 16:26

Merge remote-tracking branch 'upstream/master' into federated-vertical

21d1519

no reindexing in gpu code

919dadb

trivialfis approved these changes Mar 22, 2023

View reviewed changes

trivialfis merged commit b240f05 into dmlc:master Mar 22, 2023

rongou deleted the federated-vertical branch September 25, 2023 16:41

ZiyueXu77 mentioned this pull request Jan 15, 2024

Vertical Federated Learning with Secure Features (secure inference and encrypted training) RFC #9987

Closed

ShellLM mentioned this pull request Aug 11, 2024

Xgboost 2.0.0 · dmlc/xgboost irthomasthomas/undecidability#878

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vertical federated learning #8932

Support vertical federated learning #8932

rongou commented Mar 17, 2023

rongou commented Mar 17, 2023

trivialfis Mar 17, 2023

rongou Mar 18, 2023

trivialfis Mar 17, 2023 •

edited

Loading

rongou Mar 18, 2023

trivialfis Mar 17, 2023

rongou Mar 18, 2023

trivialfis Mar 20, 2023

rongou Mar 20, 2023

trivialfis Mar 21, 2023

rongou Mar 22, 2023

trivialfis Mar 21, 2023

rongou Mar 22, 2023

trivialfis Mar 21, 2023

trivialfis Mar 21, 2023

rongou Mar 22, 2023

Support vertical federated learning #8932

Support vertical federated learning #8932

Conversation

rongou commented Mar 17, 2023

rongou commented Mar 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Mar 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Mar 17, 2023 •

edited

Loading