[WIP] cuDF integration into XGBoost #3997

mtjrider · 2018-12-13T17:16:57Z

Enable XGBoost by providing native support for cuDF: https://github.com/rapidsai/cudf

Note: this PR is currently a work in progress. We are publishing it to acquire early feedback, so that we can ensure smooth integration with DMLC/XGBoost.

This PR provides the following:

Support for cudf.Dataframe at data ingest in Python
Support for DMatrix construction from gdf_column types
Updates to build infrastructure for compiling with cudf support.

- DMatrix can use GDF for features - DMatrix can use GDF DataFrame or Series for labels - in GPU histogram algorithm, DMatrix data is not copied to host if it is already on the GPU

- these changes go into a separate pull request

"rebasing to master"

python-package/xgboost/core.py

tests/python-gpu/test_gpu_gdf.py

RAMitchell

In general the impact of this PR on the rest of the code bases too high. The conversion from cudf should occur in c_api.h/c_api.cu and the rest of the code base should have no idea what cudf is.

We should consider putting <cudf/types.h> in src/c_api/external/cudf_types.h. This would mean xgboost can be built with support for cudf without any dependencies (although obviously you will need cudf to actually use the functionality via python) and the official release will support this.

cmake/modules/FindGDF.cmake

include/xgboost/data.h

src/data/gdf.cuh

src/tree/updater_gpu_hist.cu

tests/python-gpu/test_gpu_gdf.py

python-package/xgboost/core.py

trivialfis · 2019-01-03T13:06:59Z

include/xgboost/c_api.h

@@ -16,6 +16,10 @@
 #include <stdint.h>
 #endif

+#ifdef XGBOOST_USE_CUDF
+#include <cudf/types.h>


Just a suggestion, is it possible to use opaque type in c_api.h ? I have been trying to rewrite the CMake scripts and added a installation target, bring in an external header will cause some troubles.

I'm not sure what you mean. Could you clarify and provide an example?

python-package/xgboost/compat.py

RAMitchell · 2019-02-28T01:18:02Z

@mt-jones in general there is no dependency on cudf in python due to the compat.py module creating dummy classes if there is no cudf. Obviously you will need cudf installed to actually pass a cudf dataframe.

I will tidy up the python code when I see what the cudf side looks like.

@hcho3 can I get a review please. The python code is in flux but the native code is ready.

RAMitchell · 2019-03-01T01:06:03Z

@hcho3 I just noticed we have some strange methods for DMatrix in the Python API such as set_weight_npy2d. Shouldn't set_float_info automatically detect the input type and call the appropriate c_api function? All of these _npy2d functions seem redundant.

hcho3 · 2019-03-01T21:04:48Z

@RAMitchell Will review this after 0.82 release.

mtjrider · 2019-03-06T02:50:25Z

@hcho3 @trivialfis @RAMitchell given that this will not be integrated into the 0.82 Release, we're thinking of generalizing this PR which would involve the following changes:

Remove the interchange format altogether
Enable XGB ingest of Apache Arrow via passing opaque pointers to data structure (device side)

I believe this is a more robust solution that provides a large community access to XGB, but I wanted to probe for thoughts on this matter.

Thanks.

trivialfis · 2019-03-06T20:13:36Z

@mt-jones Sounds interesting. :)

hcho3 · 2019-03-06T20:30:31Z

@mt-jones Interesting. I particularly like the idea of integrating Apache Arrow.

CodingCat · 2019-03-07T03:53:38Z

@mt-jones interesting idea, I'd share some experience related to Arrow&XGB before

we have tried to integrate with Apache Arrow but find it doesn't provide enough benefits which beats the concerning of code complexity in the middle of the journey

our previous concern is that the memory footprint to copy data from JVM to native layer is too high in XGBoost-Spark leading to the scalability concerning. So we choose to use Arrow as the intermediate layer to avoid data copying

however, after we fix the memory leak caused by iterator.duplicate() in Scala language, we found the savings would be only the size of 32K records in the memory (which is the number of records we copy from jvm to native for every batch) => after this fix, xgb 0.81 can scale to 10+TB training dataset with XGB-Spark

more details with back-of-envelope calculation:

With Arrow: we translate training data from Spark's LabeledPoint to Arrow format

Without Arrow: we translate LabeledPoints from Spark's LabeledPoint to XGBoost's LabeledPoint and then copy to native memory with 32K records batch by batch

so, the additional memory footprint in Without Arrow case is that 32K records

BTW I think this type of potentially huge change and roadmap thing is better to come up with a RFC for discussion (especially on what we want to get from the Arrow integration (e.g. Spark triggers the work of Arrow integration because they want to interact with Pandas more efficiently, maybe we can come up with the same story and make the integration of Arrow and better Pandas perf in another PR") )

CodingCat · 2019-03-07T04:08:36Z

Another concern making us discard the Arrow experiment in the middle is that, Arrow essentially put everything in memory and in most of shared Hadoop clusters, memory size used by each container is strictly limited, Arrow brings more user complaints about why my container is killed (actually because they use too much memory)

to make it more aligned with other undergoing work, maybe we should focus on one thing in this PR, integrate cuDF and work with others to have a co-design about how the data layer of xgboost should look like, e.g. as @trivialfis mentioned in the thread of #4193

mtjrider · 2019-03-07T05:15:48Z

@CodingCat these are excellent considerations, and I think there are a lot of moving parts to this. I would say they can be broken down into three components.

Arrow ingest from the host
Arrow ingest from the device
Defining a data specification

Regarding (1), I think your points are well made. Probing how data migrates through different ecosystems helps us better understand the path of greatest performance, and of least code impact. I’m not sure I’m sold on a good solution for handling data with Arrow on the host.

Regarding (2) the proposition is as follows: cuDF’s specification is itself in flux as developments are made and features are added. One thing we work diligently to achieve is cohesion with the Arrow specification of the data structure. We are always compliant. So rather than building an interchange format which tries to encapsulate numerical types as floats, and building a matching structure in cuDF, we thought it made more sense to explicitly match the Arrow specification for the device side struct. That way, we guarantee some consistency moving forward. Materially, this will place a definition of the Arrow data struct device side into XGB. This is what we’d use to perform instantiation.

Ideally, there would be a matching feature addition for Arrow host-side. I don’t think it belongs in this PR because it would touch a lot more code, and delay this PR unnecessarily.

Regarding (3) I think co-designing a specification for a data structure will help us create matching interoperability in cuDF, making it easier to maintain a functional workflow that involves both libraries.

It would be awesome if we could perform a conversion internally, and pass a pointer to XGB and have it build a distributed DMatrix for training, etc.

Let me know what you’re thinking is with the above. Again, if you think it’s better to build in direct support for cuDF, we’re open to the idea; though, it would be predicated on (3), I think.

CodingCat · 2019-03-07T05:26:39Z

@mt-jones thanks for the reply!

IIUC, Arrow is chosen as a standardized specification for the intermediate data layer between "any frameworks running with hardware other than host" and XGBoost,

in future, if XGBoost supports other devices beyond GPU, say XPU, the frameworks operating on those devices are supposed to transform data to Arrow first so that XGBoost can directly consume from that

mtjrider · 2019-03-07T21:07:24Z

@CodingCat thanks for the insight. Let me make sure I understand.

Arrow is the standard for device interoperability.
If cuDF output a data structure which was Arrow compliant, XGBoost will handle it in the future
As far as I can tell, cuDF interoperability represents the first implementable undertaking which would require defining infrastructural components for ingest from GPU DataFrame (or similar) libraries

Is this correct?

If so, (3) prompts us to go a bit further by defining some of infrastructural components in XGBoost in conjunction with co-designing the DMatrix specification.

Looking forward to your thinking.

hcho3 · 2019-03-07T21:14:48Z

@CodingCat @mt-jones So we should review this PR as it is? Can I review it now?

RAMitchell · 2019-03-07T21:58:34Z

@hcho3 I think we may go in a different direction, so there is no need right now.

CodingCat · 2019-03-08T00:33:01Z

@mt-jones yes, that’s what I meant and what I got from your first reply to me?

Did I bring something beyond what you originally scoped?

mtjrider · 2019-03-08T03:06:08Z

@mt-jones yes, that’s what I meant and what I got from your first reply to me?

Did I bring something beyond what you originally scoped?

Not at all. This is entirely consistent with our plans for this PR. Just making sure we’re aligned :)

I was surprised to hear about the host side issues with Arrow. But it makes sense.

canonizer and others added 14 commits September 27, 2018 18:12

Added single-GPU GDF (GPU DataFrame) support.

2771d31

- DMatrix can use GDF for features - DMatrix can use GDF DataFrame or Series for labels - in GPU histogram algorithm, DMatrix data is not copied to host if it is already on the GPU

Fixed the case of 0 rows for DMatrix construction.

6b70684

Reverted the changes to GPU predictor.

ea727a6

- these changes go into a separate pull request

Added extracting column names from GDF.

87bcb8c

USE_GDF cmake option.

370cda9

Updated GDF cmake module.

a95419e

Initial commit for cuDF (former GDF) support.

c198371

Renamed GDF -> cuDF wherever it is currently possible.

32f6d32

Updated function and type names in new files.

da3098c

More GDF->cuDF.

6891658

Lightweight header-only dependency: cudf.h->cudf/types.h.

b5d882e

Removed cuDF-specific error handling.

8a3d9d1

FIX refactor cudf branch to match latest cudf developments

ad7ff22

Merge branch 'master' into cudf

00817c4

"rebasing to master"

kkraus14 reviewed Dec 13, 2018

View reviewed changes

python-package/xgboost/core.py Outdated Show resolved Hide resolved

kkraus14 reviewed Dec 13, 2018

View reviewed changes

tests/python-gpu/test_gpu_gdf.py Outdated Show resolved Hide resolved

mtjrider added 3 commits December 13, 2018 10:38

updating pygdf to cudf in core and tests

41d2142

removing trailing references to gdf.h

b034242

addressing merge conflicts with dmlc:master

669e1e8

RAMitchell requested changes Dec 13, 2018

View reviewed changes

RAMitchell reviewed Dec 13, 2018

View reviewed changes

python-package/xgboost/core.py Outdated Show resolved Hide resolved

trivialfis reviewed Jan 3, 2019

View reviewed changes

mtjrider added 7 commits January 18, 2019 11:28

Merge remote-tracking branch 'origin' into cudf

28bacd4

removing commented code

78b2ea0

removing deprecated objective function

1bdd513

removing redundant file

2cef479

skip test_gpu_gdf.py if cudf fails on import

a01f54e

removing redundant cmake module

f0a35e4

removing hard imports in core.py

3194a01

canonizer reviewed Jan 18, 2019

View reviewed changes

python-package/xgboost/compat.py Outdated Show resolved Hide resolved

mtjrider mentioned this pull request Feb 28, 2019

[WIP] CUDF Interchange Format rapidsai/cudf#1065

Closed

6 tasks

hcho3 self-requested a review February 28, 2019 01:18

RAMitchell force-pushed the cudf branch from 6685954 to b326237 Compare March 1, 2019 01:17

Tidy Python code

63c9f8b

RAMitchell force-pushed the cudf branch from b326237 to 63c9f8b Compare March 1, 2019 01:26

Merge branch 'master' into cudf

9a918a6

randerzander mentioned this pull request Mar 20, 2019

Converting cuDF dataframe to DMatrix on GPU directly rapidsai/cudf#1239

Closed

RAMitchell mentioned this pull request Apr 10, 2019

[RFC] Possible DMatrix refactor #4354

Closed

RAMitchell mentioned this pull request Jun 7, 2019

Add c_api callbacks for direct DMatrix construction #4539

Closed

tgravescs mentioned this pull request Jun 26, 2019

GPU Assignment for XGBoost Spark on YARN and Standalone mode #4612

Closed

trivialfis closed this Jul 23, 2019

mtjrider mentioned this pull request Jul 23, 2019

[WIP] Adding Support for Foreign Buffers #4696

Closed

mtjrider deleted the cudf branch August 19, 2019 17:43

lock bot locked as resolved and limited conversation to collaborators Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] cuDF integration into XGBoost #3997

[WIP] cuDF integration into XGBoost #3997

mtjrider commented Dec 13, 2018

RAMitchell left a comment

trivialfis Jan 3, 2019 •

edited

Loading

mtjrider Jan 30, 2019

RAMitchell commented Feb 28, 2019

RAMitchell commented Mar 1, 2019

hcho3 commented Mar 1, 2019

mtjrider commented Mar 6, 2019 •

edited

Loading

trivialfis commented Mar 6, 2019

hcho3 commented Mar 6, 2019

CodingCat commented Mar 7, 2019 •

edited

Loading

CodingCat commented Mar 7, 2019

mtjrider commented Mar 7, 2019

CodingCat commented Mar 7, 2019

mtjrider commented Mar 7, 2019

hcho3 commented Mar 7, 2019

RAMitchell commented Mar 7, 2019

CodingCat commented Mar 8, 2019

mtjrider commented Mar 8, 2019

[WIP] cuDF integration into XGBoost #3997

[WIP] cuDF integration into XGBoost #3997

Conversation

mtjrider commented Dec 13, 2018

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis Jan 3, 2019 • edited Loading

Choose a reason for hiding this comment

mtjrider Jan 30, 2019

Choose a reason for hiding this comment

RAMitchell commented Feb 28, 2019

RAMitchell commented Mar 1, 2019

hcho3 commented Mar 1, 2019

mtjrider commented Mar 6, 2019 • edited Loading

trivialfis commented Mar 6, 2019

hcho3 commented Mar 6, 2019

CodingCat commented Mar 7, 2019 • edited Loading

CodingCat commented Mar 7, 2019

mtjrider commented Mar 7, 2019

CodingCat commented Mar 7, 2019

mtjrider commented Mar 7, 2019

hcho3 commented Mar 7, 2019

RAMitchell commented Mar 7, 2019

CodingCat commented Mar 8, 2019

mtjrider commented Mar 8, 2019

trivialfis Jan 3, 2019 •

edited

Loading

mtjrider commented Mar 6, 2019 •

edited

Loading

CodingCat commented Mar 7, 2019 •

edited

Loading