-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] cuDF integration into XGBoost #3997
Conversation
- DMatrix can use GDF for features - DMatrix can use GDF DataFrame or Series for labels - in GPU histogram algorithm, DMatrix data is not copied to host if it is already on the GPU
- these changes go into a separate pull request
"rebasing to master"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the impact of this PR on the rest of the code bases too high. The conversion from cudf should occur in c_api.h/c_api.cu and the rest of the code base should have no idea what cudf is.
We should consider putting <cudf/types.h> in src/c_api/external/cudf_types.h. This would mean xgboost can be built with support for cudf without any dependencies (although obviously you will need cudf to actually use the functionality via python) and the official release will support this.
include/xgboost/c_api.h
Outdated
@@ -16,6 +16,10 @@ | |||
#include <stdint.h> | |||
#endif | |||
|
|||
#ifdef XGBOOST_USE_CUDF | |||
#include <cudf/types.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, is it possible to use opaque type in c_api.h
? I have been trying to rewrite the CMake scripts and added a installation target, bring in an external header will cause some troubles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. Could you clarify and provide an example?
@mt-jones in general there is no dependency on cudf in python due to the compat.py module creating dummy classes if there is no cudf. Obviously you will need cudf installed to actually pass a cudf dataframe. I will tidy up the python code when I see what the cudf side looks like. @hcho3 can I get a review please. The python code is in flux but the native code is ready. |
@hcho3 I just noticed we have some strange methods for DMatrix in the Python API such as |
@RAMitchell Will review this after 0.82 release. |
@hcho3 @trivialfis @RAMitchell given that this will not be integrated into the 0.82 Release, we're thinking of generalizing this PR which would involve the following changes:
I believe this is a more robust solution that provides a large community access to XGB, but I wanted to probe for thoughts on this matter. Thanks. |
@mt-jones Sounds interesting. :) |
@mt-jones Interesting. I particularly like the idea of integrating Apache Arrow. |
@mt-jones interesting idea, I'd share some experience related to Arrow&XGB before we have tried to integrate with Apache Arrow but find it doesn't provide enough benefits which beats the concerning of code complexity in the middle of the journey our previous concern is that the memory footprint to copy data from JVM to native layer is too high in XGBoost-Spark leading to the scalability concerning. So we choose to use Arrow as the intermediate layer to avoid data copying however, after we fix the memory leak caused by iterator.duplicate() in Scala language, we found the savings would be only the size of 32K records in the memory (which is the number of records we copy from jvm to native for every batch) => after this fix, xgb 0.81 can scale to 10+TB training dataset with XGB-Spark more details with back-of-envelope calculation: With Arrow: we translate training data from Spark's LabeledPoint to Arrow format Without Arrow: we translate LabeledPoints from Spark's LabeledPoint to XGBoost's LabeledPoint and then copy to native memory with 32K records batch by batch so, the additional memory footprint in BTW I think this type of potentially huge change and roadmap thing is better to come up with a RFC for discussion (especially on what we want to get from the Arrow integration (e.g. Spark triggers the work of Arrow integration because they want to interact with Pandas more efficiently, maybe we can come up with the same story and make the integration of Arrow and better Pandas perf in another PR") ) |
Another concern making us discard the Arrow experiment in the middle is that, Arrow essentially put everything in memory and in most of shared Hadoop clusters, memory size used by each container is strictly limited, Arrow brings more user complaints about why my container is killed (actually because they use too much memory) to make it more aligned with other undergoing work, maybe we should focus on one thing in this PR, |
@CodingCat these are excellent considerations, and I think there are a lot of moving parts to this. I would say they can be broken down into three components.
Regarding (1), I think your points are well made. Probing how data migrates through different ecosystems helps us better understand the path of greatest performance, and of least code impact. I’m not sure I’m sold on a good solution for handling data with Arrow on the host. Regarding (2) the proposition is as follows: cuDF’s specification is itself in flux as developments are made and features are added. One thing we work diligently to achieve is cohesion with the Arrow specification of the data structure. We are always compliant. So rather than building an Ideally, there would be a matching feature addition for Arrow host-side. I don’t think it belongs in this PR because it would touch a lot more code, and delay this PR unnecessarily. Regarding (3) I think co-designing a specification for a data structure will help us create matching interoperability in cuDF, making it easier to maintain a functional workflow that involves both libraries. It would be awesome if we could perform a conversion internally, and pass a pointer to XGB and have it build a distributed DMatrix for training, etc. Let me know what you’re thinking is with the above. Again, if you think it’s better to build in direct support for cuDF, we’re open to the idea; though, it would be predicated on (3), I think. |
@mt-jones thanks for the reply! IIUC, Arrow is chosen as a standardized specification for the intermediate data layer between "any frameworks running with hardware other than host" and XGBoost, in future, if XGBoost supports other devices beyond GPU, say XPU, the frameworks operating on those devices are supposed to transform data to Arrow first so that XGBoost can directly consume from that |
@CodingCat thanks for the insight. Let me make sure I understand.
Is this correct? If so, (3) prompts us to go a bit further by defining some of infrastructural components in XGBoost in conjunction with co-designing the DMatrix specification. Looking forward to your thinking. |
@CodingCat @mt-jones So we should review this PR as it is? Can I review it now? |
@hcho3 I think we may go in a different direction, so there is no need right now. |
@mt-jones yes, that’s what I meant and what I got from your first reply to me? Did I bring something beyond what you originally scoped? |
Not at all. This is entirely consistent with our plans for this PR. Just making sure we’re aligned :) I was surprised to hear about the host side issues with Arrow. But it makes sense. |
Enable XGBoost by providing native support for cuDF: https://github.com/rapidsai/cudf
Note: this PR is currently a work in progress. We are publishing it to acquire early feedback, so that we can ensure smooth integration with DMLC/XGBoost.
This PR provides the following:
cudf.Dataframe
at data ingest in PythonDMatrix
construction fromgdf_column
typescudf
support.