Initial support for column-wise data split #8468

rongou · 2022-11-16T00:41:20Z

Support splitting data by column for in-memory DMatrix. When loading data, first construct the full dmatrix, and then slice the columns based on rank and world_size. MetaInfo is kept as is except for the num_nonzero_ field.

Part of #8424

rongou · 2022-11-16T00:43:04Z

@trivialfis @hcho3

One question I have is what happens with sparse data, when a row may come up empty for the column slice. Do we support having multiple row pointers pointing to the same Entry? I guess we can also say this only supports dense dmatrix for now.

hcho3 · 2022-11-17T22:42:03Z

@rongou XGBoost currently allows empty rows in a CSR matrix.

rongou · 2022-11-17T22:46:47Z

@hcho3 good to know! Removed the empty row comment.

trivialfis

Exciting feature!

A couple questions:

Do we need to assume every participant has complete access to the data? Otherwise, we can't split them.
We can use GetBatch<CSCPage>.
The MetaInfo::Copy can be implemented based on Extend.

rongou · 2022-11-18T19:56:23Z

Do we need to assume every participant has complete access to the data? Otherwise, we can't split them.

Here we assume we're doing distributed training, so all data is available to all the workers. For vertical federated learning, data is already "split", so this is not needed.

We can use GetBatch<CSCPage>.

This just does a transpose of the CSR page, right? Then once we slice the columns, we'd have to transpose it back to CSR since most of the code uses that. That seems less efficient.

The MetaInfo::Copy can be implemented based on Extend.

Done.

trivialfis · 2022-11-21T08:28:36Z

apologies for the ambiguity, I meant MetaInfo::Extend.

This just does a transpose of the CSR page, right? Then once we slice the columns, we'd have to transpose it back to CSR since most of the code uses that. That seems less efficient.

Yeah, you are right that we need to get the CSR back anyway. I just thought using CSC might be simpler in code. Also, the SparsePage::GetTranspose is implemented in parallel. I will leave the decision to you.

so all data is available to all the workers.

Hmm.. I'm confused by this assumption. Why is it necessary for all data to be available to all workers? How's it different from federated learning?

rongou · 2022-11-21T19:49:18Z

apologies for the ambiguity, I meant MetaInfo::Extend.

Done.

This just does a transpose of the CSR page, right? Then once we slice the columns, we'd have to transpose it back to CSR since most of the code uses that. That seems less efficient.

Yeah, you are right that we need to get the CSR back anyway. I just thought using CSC might be simpler in code. Also, the SparsePage::GetTranspose is implemented in parallel. I will leave the decision to you.

Yeah not sure about the running time, but this is probably more memory efficient. Anyway, I think we just need correctness here, it probably doesn't make too much sense for column split if the data fits in memory. It gets more interesting when we add support for external memory to train on super wide datasets.

so all data is available to all the workers.

Hmm.. I'm confused by this assumption. Why is it necessary for all data to be available to all workers? How's it different from federated learning?

Right this is just the assumption for distributed training. We make the same assumption for row split. I guess we could add an option to not split the data if the use has already done some preprocessing to split the data beforehand, but I don't believe this is currently supported.

For federated learning, it's the opposite, the data is already split and cannot be combined or exchanged.

trivialfis · 2022-11-24T04:02:51Z

For federated learning, it's the opposite, the data is already split and cannot be combined or exchanged.

I thought in the case of distributed learning it's the same? The distributed framework/user would split the features and XGBoost just train on those input?

rongou · 2022-11-28T17:53:45Z

For federated learning, it's the opposite, the data is already split and cannot be combined or exchanged.

I thought in the case of distributed learning it's the same? The distributed framework/user would split the features and XGBoost just train on those input?

Not sure about dask or spark, but if we just using python, loading data in the distributed setting automatically triggers splitting: https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L213

rongou · 2022-12-01T19:39:31Z

Can this be merged? Thanks!

trivialfis · 2022-12-02T07:00:05Z

apologies, triggered the CI. Will merge it once it's finished.

trivialfis · 2022-12-12T16:55:34Z

@rongou We need to slice the feature_weights/names/types accordingly as they have the length of n_features.

rongou · 2022-12-12T18:44:22Z

@trivialfis hmm we are not changing the metadata, so I'm not sure these need to be changed. Slicing a dmatrix only slices the feature values stored for each row so that we may reduce memory usage during boosting, but the metadata is kept as is.

rongou added 5 commits November 15, 2022 13:28

stub for slice col

23f9c73

implement slice col method

ab97f69

support column-wise data split

8a68b14

fix last slice size

c81b071

check for external memory

73ad5bc

rongou mentioned this pull request Nov 16, 2022

Vertical Federated Learning RFC #8424

Open

Merge remote-tracking branch 'upstream/master' into slice-col

6dbb007

remove empty row comment

f45e63c

trivialfis reviewed Nov 18, 2022

View reviewed changes

use extend

490545d

rongou added 2 commits November 21, 2022 11:19

Merge remote-tracking branch 'upstream/master' into slice-col

c5f630f

use extend to copy metainfo

91fe8ee

rongou added 2 commits November 29, 2022 12:49

Merge remote-tracking branch 'upstream/master' into slice-col

237046c

fix windows build

dc384d5

trivialfis approved these changes Nov 30, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into slice-col

af4671f

Merge remote-tracking branch 'upstream/master' into slice-col

81c2094

trivialfis merged commit 78d65a1 into dmlc:master Dec 3, 2022

rongou deleted the slice-col branch September 25, 2023 16:42

ShellLM mentioned this pull request Aug 11, 2024

Xgboost 2.0.0 · dmlc/xgboost irthomasthomas/undecidability#878

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for column-wise data split #8468

Initial support for column-wise data split #8468

rongou commented Nov 16, 2022

rongou commented Nov 16, 2022

hcho3 commented Nov 17, 2022

rongou commented Nov 17, 2022

trivialfis left a comment •

edited

Loading

rongou commented Nov 18, 2022

trivialfis commented Nov 21, 2022 •

edited

Loading

rongou commented Nov 21, 2022

trivialfis commented Nov 24, 2022

rongou commented Nov 28, 2022

rongou commented Dec 1, 2022

trivialfis commented Dec 2, 2022

trivialfis commented Dec 12, 2022

rongou commented Dec 12, 2022

Initial support for column-wise data split #8468

Initial support for column-wise data split #8468

Conversation

rongou commented Nov 16, 2022

rongou commented Nov 16, 2022

hcho3 commented Nov 17, 2022

rongou commented Nov 17, 2022

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

rongou commented Nov 18, 2022

trivialfis commented Nov 21, 2022 • edited Loading

rongou commented Nov 21, 2022

trivialfis commented Nov 24, 2022

rongou commented Nov 28, 2022

rongou commented Dec 1, 2022

trivialfis commented Dec 2, 2022

trivialfis commented Dec 12, 2022

rongou commented Dec 12, 2022

trivialfis left a comment •

edited

Loading

trivialfis commented Nov 21, 2022 •

edited

Loading