Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] cuDF integration into XGBoost #3997

Closed
wants to merge 36 commits into from
Closed

Conversation

mtjrider
Copy link
Contributor

Enable XGBoost by providing native support for cuDF: https://github.com/rapidsai/cudf

Note: this PR is currently a work in progress. We are publishing it to acquire early feedback, so that we can ensure smooth integration with DMLC/XGBoost.

This PR provides the following:

  • Support for cudf.Dataframe at data ingest in Python
  • Support for DMatrix construction from gdf_column types
  • Updates to build infrastructure for compiling with cudf support.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the impact of this PR on the rest of the code bases too high. The conversion from cudf should occur in c_api.h/c_api.cu and the rest of the code base should have no idea what cudf is.

We should consider putting <cudf/types.h> in src/c_api/external/cudf_types.h. This would mean xgboost can be built with support for cudf without any dependencies (although obviously you will need cudf to actually use the functionality via python) and the official release will support this.

cmake/modules/FindGDF.cmake Outdated Show resolved Hide resolved
include/xgboost/data.h Outdated Show resolved Hide resolved
src/data/gdf.cuh Outdated Show resolved Hide resolved
src/tree/updater_gpu_hist.cu Outdated Show resolved Hide resolved
tests/python-gpu/test_gpu_gdf.py Outdated Show resolved Hide resolved
tests/python-gpu/test_gpu_gdf.py Outdated Show resolved Hide resolved
tests/python-gpu/test_gpu_gdf.py Outdated Show resolved Hide resolved
@@ -16,6 +16,10 @@
#include <stdint.h>
#endif

#ifdef XGBOOST_USE_CUDF
#include <cudf/types.h>
Copy link
Member

@trivialfis trivialfis Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, is it possible to use opaque type in c_api.h ? I have been trying to rewrite the CMake scripts and added a installation target, bring in an external header will cause some troubles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. Could you clarify and provide an example?

@RAMitchell
Copy link
Member

@mt-jones in general there is no dependency on cudf in python due to the compat.py module creating dummy classes if there is no cudf. Obviously you will need cudf installed to actually pass a cudf dataframe.

I will tidy up the python code when I see what the cudf side looks like.

@hcho3 can I get a review please. The python code is in flux but the native code is ready.

@hcho3 hcho3 self-requested a review February 28, 2019 01:18
@RAMitchell
Copy link
Member

@hcho3 I just noticed we have some strange methods for DMatrix in the Python API such as set_weight_npy2d. Shouldn't set_float_info automatically detect the input type and call the appropriate c_api function? All of these _npy2d functions seem redundant.

@hcho3
Copy link
Collaborator

hcho3 commented Mar 1, 2019

@RAMitchell Will review this after 0.82 release.

@mtjrider
Copy link
Contributor Author

mtjrider commented Mar 6, 2019

@hcho3 @trivialfis @RAMitchell given that this will not be integrated into the 0.82 Release, we're thinking of generalizing this PR which would involve the following changes:

  1. Remove the interchange format altogether
  2. Enable XGB ingest of Apache Arrow via passing opaque pointers to data structure (device side)

I believe this is a more robust solution that provides a large community access to XGB, but I wanted to probe for thoughts on this matter.

Thanks.

@trivialfis
Copy link
Member

@mt-jones Sounds interesting. :)

@hcho3
Copy link
Collaborator

hcho3 commented Mar 6, 2019

@mt-jones Interesting. I particularly like the idea of integrating Apache Arrow.

@CodingCat
Copy link
Member

CodingCat commented Mar 7, 2019

@mt-jones interesting idea, I'd share some experience related to Arrow&XGB before

we have tried to integrate with Apache Arrow but find it doesn't provide enough benefits which beats the concerning of code complexity in the middle of the journey

our previous concern is that the memory footprint to copy data from JVM to native layer is too high in XGBoost-Spark leading to the scalability concerning. So we choose to use Arrow as the intermediate layer to avoid data copying

however, after we fix the memory leak caused by iterator.duplicate() in Scala language, we found the savings would be only the size of 32K records in the memory (which is the number of records we copy from jvm to native for every batch) => after this fix, xgb 0.81 can scale to 10+TB training dataset with XGB-Spark

more details with back-of-envelope calculation:

With Arrow: we translate training data from Spark's LabeledPoint to Arrow format

Without Arrow: we translate LabeledPoints from Spark's LabeledPoint to XGBoost's LabeledPoint and then copy to native memory with 32K records batch by batch

so, the additional memory footprint in Without Arrow case is that 32K records

BTW I think this type of potentially huge change and roadmap thing is better to come up with a RFC for discussion (especially on what we want to get from the Arrow integration (e.g. Spark triggers the work of Arrow integration because they want to interact with Pandas more efficiently, maybe we can come up with the same story and make the integration of Arrow and better Pandas perf in another PR") )

@CodingCat
Copy link
Member

Another concern making us discard the Arrow experiment in the middle is that, Arrow essentially put everything in memory and in most of shared Hadoop clusters, memory size used by each container is strictly limited, Arrow brings more user complaints about why my container is killed (actually because they use too much memory)

to make it more aligned with other undergoing work, maybe we should focus on one thing in this PR, integrate cuDF and work with others to have a co-design about how the data layer of xgboost should look like, e.g. as @trivialfis mentioned in the thread of #4193

@mtjrider
Copy link
Contributor Author

mtjrider commented Mar 7, 2019

@CodingCat these are excellent considerations, and I think there are a lot of moving parts to this. I would say they can be broken down into three components.

  1. Arrow ingest from the host
  2. Arrow ingest from the device
  3. Defining a data specification

Regarding (1), I think your points are well made. Probing how data migrates through different ecosystems helps us better understand the path of greatest performance, and of least code impact. I’m not sure I’m sold on a good solution for handling data with Arrow on the host.

Regarding (2) the proposition is as follows: cuDF’s specification is itself in flux as developments are made and features are added. One thing we work diligently to achieve is cohesion with the Arrow specification of the data structure. We are always compliant. So rather than building an interchange format which tries to encapsulate numerical types as floats, and building a matching structure in cuDF, we thought it made more sense to explicitly match the Arrow specification for the device side struct. That way, we guarantee some consistency moving forward. Materially, this will place a definition of the Arrow data struct device side into XGB. This is what we’d use to perform instantiation.

Ideally, there would be a matching feature addition for Arrow host-side. I don’t think it belongs in this PR because it would touch a lot more code, and delay this PR unnecessarily.

Regarding (3) I think co-designing a specification for a data structure will help us create matching interoperability in cuDF, making it easier to maintain a functional workflow that involves both libraries.

It would be awesome if we could perform a conversion internally, and pass a pointer to XGB and have it build a distributed DMatrix for training, etc.

Let me know what you’re thinking is with the above. Again, if you think it’s better to build in direct support for cuDF, we’re open to the idea; though, it would be predicated on (3), I think.

@CodingCat
Copy link
Member

@mt-jones thanks for the reply!

IIUC, Arrow is chosen as a standardized specification for the intermediate data layer between "any frameworks running with hardware other than host" and XGBoost,

in future, if XGBoost supports other devices beyond GPU, say XPU, the frameworks operating on those devices are supposed to transform data to Arrow first so that XGBoost can directly consume from that

@mtjrider
Copy link
Contributor Author

mtjrider commented Mar 7, 2019

@CodingCat thanks for the insight. Let me make sure I understand.

  1. Arrow is the standard for device interoperability.
  2. If cuDF output a data structure which was Arrow compliant, XGBoost will handle it in the future
  3. As far as I can tell, cuDF interoperability represents the first implementable undertaking which would require defining infrastructural components for ingest from GPU DataFrame (or similar) libraries

Is this correct?

If so, (3) prompts us to go a bit further by defining some of infrastructural components in XGBoost in conjunction with co-designing the DMatrix specification.

Looking forward to your thinking.

@hcho3
Copy link
Collaborator

hcho3 commented Mar 7, 2019

@CodingCat @mt-jones So we should review this PR as it is? Can I review it now?

@RAMitchell
Copy link
Member

@hcho3 I think we may go in a different direction, so there is no need right now.

@CodingCat
Copy link
Member

@mt-jones yes, that’s what I meant and what I got from your first reply to me?

Did I bring something beyond what you originally scoped?

@mtjrider
Copy link
Contributor Author

mtjrider commented Mar 8, 2019

@mt-jones yes, that’s what I meant and what I got from your first reply to me?

Did I bring something beyond what you originally scoped?

Not at all. This is entirely consistent with our plans for this PR. Just making sure we’re aligned :)

I was surprised to hear about the host side issues with Arrow. But it makes sense.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants