-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for column-wise data split #8468
Conversation
One question I have is what happens with sparse data, when a row may come up empty for the column slice. Do we support having multiple row pointers pointing to the same |
@rongou XGBoost currently allows empty rows in a CSR matrix. |
@hcho3 good to know! Removed the empty row comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting feature!
A couple questions:
- Do we need to assume every participant has complete access to the data? Otherwise, we can't split them.
- We can use
GetBatch<CSCPage>
. - The
MetaInfo::Copy
can be implemented based onExtend
.
Here we assume we're doing distributed training, so all data is available to all the workers. For vertical federated learning, data is already "split", so this is not needed.
This just does a transpose of the CSR page, right? Then once we slice the columns, we'd have to transpose it back to CSR since most of the code uses that. That seems less efficient.
Done. |
apologies for the ambiguity, I meant
Yeah, you are right that we need to get the CSR back anyway. I just thought using CSC might be simpler in code. Also, the
Hmm.. I'm confused by this assumption. Why is it necessary for all data to be available to all workers? How's it different from federated learning? |
Done.
Yeah not sure about the running time, but this is probably more memory efficient. Anyway, I think we just need correctness here, it probably doesn't make too much sense for column split if the data fits in memory. It gets more interesting when we add support for external memory to train on super wide datasets.
Right this is just the assumption for distributed training. We make the same assumption for row split. I guess we could add an option to not split the data if the use has already done some preprocessing to split the data beforehand, but I don't believe this is currently supported. For federated learning, it's the opposite, the data is already split and cannot be combined or exchanged. |
I thought in the case of distributed learning it's the same? The distributed framework/user would split the features and XGBoost just train on those input? |
Not sure about dask or spark, but if we just using python, loading data in the distributed setting automatically triggers splitting: https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L213 |
Can this be merged? Thanks! |
apologies, triggered the CI. Will merge it once it's finished. |
@rongou We need to slice the |
@trivialfis hmm we are not changing the metadata, so I'm not sure these need to be changed. Slicing a dmatrix only slices the feature values stored for each row so that we may reduce memory usage during boosting, but the metadata is kept as is. |
Support splitting data by column for in-memory
DMatrix
. When loading data, first construct the full dmatrix, and then slice the columns based onrank
andworld_size
.MetaInfo
is kept as is except for thenum_nonzero_
field.Part of #8424