Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for column-wise data split #8468

Merged
merged 14 commits into from
Dec 3, 2022
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Nov 16, 2022

Support splitting data by column for in-memory DMatrix. When loading data, first construct the full dmatrix, and then slice the columns based on rank and world_size. MetaInfo is kept as is except for the num_nonzero_ field.

Part of #8424

@rongou
Copy link
Contributor Author

rongou commented Nov 16, 2022

@trivialfis @hcho3

One question I have is what happens with sparse data, when a row may come up empty for the column slice. Do we support having multiple row pointers pointing to the same Entry? I guess we can also say this only supports dense dmatrix for now.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 17, 2022

@rongou XGBoost currently allows empty rows in a CSR matrix.

@rongou
Copy link
Contributor Author

rongou commented Nov 17, 2022

@hcho3 good to know! Removed the empty row comment.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting feature!

A couple questions:

  • Do we need to assume every participant has complete access to the data? Otherwise, we can't split them.
  • We can use GetBatch<CSCPage>.
  • The MetaInfo::Copy can be implemented based on Extend.

@rongou
Copy link
Contributor Author

rongou commented Nov 18, 2022

  • Do we need to assume every participant has complete access to the data? Otherwise, we can't split them.

Here we assume we're doing distributed training, so all data is available to all the workers. For vertical federated learning, data is already "split", so this is not needed.

  • We can use GetBatch<CSCPage>.

This just does a transpose of the CSR page, right? Then once we slice the columns, we'd have to transpose it back to CSR since most of the code uses that. That seems less efficient.

  • The MetaInfo::Copy can be implemented based on Extend.

Done.

@trivialfis
Copy link
Member

trivialfis commented Nov 21, 2022

apologies for the ambiguity, I meant MetaInfo::Extend.

This just does a transpose of the CSR page, right? Then once we slice the columns, we'd have to transpose it back to CSR since most of the code uses that. That seems less efficient.

Yeah, you are right that we need to get the CSR back anyway. I just thought using CSC might be simpler in code. Also, the SparsePage::GetTranspose is implemented in parallel. I will leave the decision to you.

so all data is available to all the workers.

Hmm.. I'm confused by this assumption. Why is it necessary for all data to be available to all workers? How's it different from federated learning?

@rongou
Copy link
Contributor Author

rongou commented Nov 21, 2022

apologies for the ambiguity, I meant MetaInfo::Extend.

Done.

This just does a transpose of the CSR page, right? Then once we slice the columns, we'd have to transpose it back to CSR since most of the code uses that. That seems less efficient.

Yeah, you are right that we need to get the CSR back anyway. I just thought using CSC might be simpler in code. Also, the SparsePage::GetTranspose is implemented in parallel. I will leave the decision to you.

Yeah not sure about the running time, but this is probably more memory efficient. Anyway, I think we just need correctness here, it probably doesn't make too much sense for column split if the data fits in memory. It gets more interesting when we add support for external memory to train on super wide datasets.

so all data is available to all the workers.

Hmm.. I'm confused by this assumption. Why is it necessary for all data to be available to all workers? How's it different from federated learning?

Right this is just the assumption for distributed training. We make the same assumption for row split. I guess we could add an option to not split the data if the use has already done some preprocessing to split the data beforehand, but I don't believe this is currently supported.

For federated learning, it's the opposite, the data is already split and cannot be combined or exchanged.

@trivialfis
Copy link
Member

For federated learning, it's the opposite, the data is already split and cannot be combined or exchanged.

I thought in the case of distributed learning it's the same? The distributed framework/user would split the features and XGBoost just train on those input?

@rongou
Copy link
Contributor Author

rongou commented Nov 28, 2022

For federated learning, it's the opposite, the data is already split and cannot be combined or exchanged.

I thought in the case of distributed learning it's the same? The distributed framework/user would split the features and XGBoost just train on those input?

Not sure about dask or spark, but if we just using python, loading data in the distributed setting automatically triggers splitting: https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L213

@rongou
Copy link
Contributor Author

rongou commented Dec 1, 2022

Can this be merged? Thanks!

@trivialfis
Copy link
Member

apologies, triggered the CI. Will merge it once it's finished.

@trivialfis trivialfis merged commit 78d65a1 into dmlc:master Dec 3, 2022
@trivialfis
Copy link
Member

@rongou We need to slice the feature_weights/names/types accordingly as they have the length of n_features.

@rongou
Copy link
Contributor Author

rongou commented Dec 12, 2022

@trivialfis hmm we are not changing the metadata, so I'm not sure these need to be changed. Slicing a dmatrix only slices the feature values stored for each row so that we may reduce memory usage during boosting, but the metadata is kept as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants