-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Apache Arrow #5667
Conversation
We have built in support for cudf, which uses arrow. Could you use a similar code path? I'm happy to answer questions. |
@trivialfis The support for cudf only uses the same physical memory layout defined by Arrow. It doesn't directly work with Arrow tables or other Arrow constructs. Did I miss anything? If an in-memory Arrow table is already available, then its content can just be copied into a DMatrix. It shouldn't be necessary to use a sequence o json objects ( |
@zhangzhang10 We first export a handle to the columnar memory block (in cudf's case, we chose it to be a list of Also, once an adapter is defined, there's no need for a new |
python-package/xgboost/core.py
Outdated
def _maybe_chunked_array(data, feature_names, feature_types): | ||
if PYARROW_INSTALLED and isinstance(data, ArrowChunkedArray): | ||
if not PANDAS_INSTALLED: | ||
raise Exception(('pandas must be available to use this method.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose better to say: 'Pandas must be available to use Apache Arrow as an input format'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will incorporate your input.
std::shared_ptr<arrow::Table> table; | ||
CHECK(arrow::py::unwrap_table(data, &table).ok()); | ||
data::ArrowAdapter adapter(table, nrow, ncol); | ||
*out = new std::shared_ptr<DMatrix>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is delete
for this new
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in XGDMatrixFree(), which is supposed to be called by client code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it aligned with other DMatrixes?
// Set number of threads but keep old value so we can reset it after | ||
const int nthreadmax = omp_get_max_threads(); | ||
if (nthread <= 0) nthread = nthreadmax; | ||
int nthread_original = omp_get_max_threads(); | ||
omp_set_num_threads(nthread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need different amount of threads for all lib and this piece of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we can change # of threads timely by following option:
#pragma omp parallel for schedule(static) num_threads(nthread)
...
It looks simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm just following the convention used by the SparsePage::Push template function, defined above this particular specialization.
#if defined(XGBOOST_BUILD_ARROW_SUPPORT) | ||
template SimpleDMatrix::SimpleDMatrix(ArrowAdapter* adapter, float missing, | ||
int nthread); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it will be good to add tests for Apache Arrow support on C++/Python side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add tests.
e94ac76
to
5f05644
Compare
Codecov Report
@@ Coverage Diff @@
## master #5667 +/- ##
=======================================
Coverage 78.49% 78.50%
=======================================
Files 12 12
Lines 3013 3028 +15
=======================================
+ Hits 2365 2377 +12
- Misses 648 651 +3
Continue to review full report at Codecov.
|
cad2171
to
2bcf1b9
Compare
@hcho3, @CodingCat, @trivialfis, @SmirnovEgorRu, can you please review recent changes to PR 5667? This PR introduces support for Apache Arrow. Specifically, this allows creating DMatrix directly from Arrow tables. To move forward, I am about to introduce dependencies on Arrow libs and Pyarrow. For example, I am thinking about adding a new dockerfile in ci_build that installs these dependencies. Is this the right approach? I'd appreciate your feedback. Thanks! |
@zhangzhang10 If I build XGBoost with Arrow support, can I use that built binary to another machine without Arrow installed? If so, we can enable Arrow support in the default builds and distribute it via Pip. As a maintainer of CI, I'd prefer not to add more special cases if we can avoid it. Currently, we build CUDA support into the XGBoost binary already, and the binary can be used on machines without NVIDIA GPUs. (Requesting a GPU algorithm would produce a run-time error on these machines.) |
From #5754 (comment):
Integration with Arrow satisfies the first criterion: many users stand to benefit from fast ingestion of tabular data. It would be even better if XGBoost can support it out of the box: On the other hand, the Arrow support did introduce lots of new C++ code, and I'm concerned about maintenance burden. @zhangzhang10 Would you be willing to take care of this code in the medium and long term? Or find people who would be? Mainly, I want to avoid a situation in the future where the Arrow ingestion feature breaks and no one is around to diagnose or fix the bug. |
@hcho3, since it's a part of Intel's efforts to improve XGBoost - I can be responsible for this. |
@SmirnovEgorRu Ah that's a good news. Thanks for letting us know. |
2bcf1b9
to
3e58335
Compare
@hcho3 Thanks for the comments. I will take a look at how CUDA support is being integrated in XGBoost binary, and then see if we can do it similarly for Arrow support. |
After looking into arrow a little bit more. I think the array interface can be obtained by: import pyarrow as pa
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
table: pa.lib.Table = pa.Table.from_pandas(df)
interface = table.column('a').chunks[0].__array__().__array_interface__
print(interface)
Having this, we can parse the memory buffer directly. Just not sure whether calling |
A column of an Arrow table is of type
But this will likely lead to data copying because multiple buffers need to be combined. Or, we could iterate over the chunks and call
To make it more complicated, columns in a table can have different numbers of chunks. So we would end up parsing many buffers and keeping track of their relative positions ourselves, whereas the arrow lib has already provided this facility at no cost (zero copy). |
525d6c7
to
45bee45
Compare
Hi, we are having some headaches for supporting more and more data types as input (currently 15 of them!). And corresponding support for meta data like labels (which is different from data X like you can't push CSR as label). I would like to create something really simple for data dispatching and make it work uniformly. So is it okay if I push to your branch in coming days? |
@trivialfis Yes, please feel free to push to my branch. Thank you so much for your help! I really appreciate it. |
3eaa58a
to
f7260a4
Compare
ccc1552
to
3b64187
Compare
Let me try working on this during weekend. |
The specification has grown a lot since the last time I look into it ... |
This feature allows creating DMatrix from Arrow memory format. It works with all file formats that PyArrow supports. And it provides better performance than, for example, XGBoost's CSV interface and than Pandas read_parquet interface. - Python: Create DMatrix from pyarrow.Table - C: Add XGDMatrixCreateFromArrowTable() - C++: Add class ArrowAdapter and class ArrowAdapterBatch - CMake: Add FindArrow.cmake Build: - Set env variable ARROW_ROOT to the path where Arrow libs, PyArrow libs, and Arrow C++ headers are found. - Configure the build by passing '-DUSE_ARROW=ON' to cmake command. Usage: import xgboost as xgb import pyarrow.csv as pc import pyarrow.parquet as pq # CSV input table = pc.read_csv('/path/to/csv/input') dmat = xgb.DMatrix(table) # Parquet input table = pq.read_table('/path/to/parquet/intput') dmat = xgb.DMatrix(table)
3b64187
to
0aebde6
Compare
d960152
to
f2208da
Compare
- Modify run_test.sh cmake command - Modify setup.sh to include pyarrow 0.17
f2208da
to
e5a739b
Compare
Note to myself, current compilation is failing on this branch:
|
I will continue the work on this on a private branch first to avoid spamming the CI. |
@trivialfis Are you still working on this PR? Any update? Please let me know if I can help. Thanks! |
Sorry for leaving this PR open without update. At the moment, no. I really don't want to have a C/C++ level dependency. But arrow is specified as such that it can trunk the data arbitrarily, which is really painful for me to have a simple adapter. |
@trivialfis It's been a long while since we touched this PR. One major reason this has been stuck so long is the dependence on Arrow C++ library introduced. It would be a headache to maintain yet another third-party dependence, given the complex dependences XGBoost already has. I'd like to propose a solution that can completely remove Arrow C++ dependence but still support the Arrow data format. The Arrow C data interface is just a couple of C structs that enables a project to integrate with the Arrow format only. It makes it possible to easily exchange columnar data between different runtime components, e.g. between the Python and C++ layers. This can also be used to support other types of columnar data, such as Spark ColumnarBatch, as long as it can be marshalled into the Arrow C data interface. Shall we continue to use this PR to further discuss this idea? Or, shall we close this and create another one? Thanks! |
Sorry for the stalled progress. Revisiting. |
It's not the dependency that I'm worrying about, it's the specailized code. |
I will pick up the progress starting Monday, just cleared out some other to-do items. |
Doing some refactorings. |
If by "specialized code" you mean code written with the only purpose of supporting Arrow, then I believe the worry is unnecessary. Firstly, my new proposal of using the Arrow C Data Interface aims at supporting a general columnar data format. Currently in XGBoost there isn't columnar data format support in general, except for the cudf support on GPUs. But my proposal is a solution that all kinds of columnar data sources can benefit from. It simply integrates the Arrow format only. It doesn't integrate the Arrow API. Thanks to the Arrow C Data Interface, the format can be expressed with only two C structs. And their specification is frozen by the Arrow project. This means we have a robust mechanism to exchange columnar data between different XGBoost components, for example, between the C++ lib and the Python layer (with help of Python FFI), and between the C++ lib and the JVM layer (with help of JNI). Secondly, many ingredients of the implementation already exist. The Arrow project itself maintains a producer that generates the aforementioned C structs from Arrow arrays or Arrow record batches. This mechanism is also available in PyArrow. Other columnar data sources can easily model the same approach (even reuse much of the code) of the producer. On the other hand, within XGBoost, creating DMatrix from cudf, whose data format is based on the Arrow columnar format, is already supported. We should be able to borrow some of the mechanisms to provide more general support for the columnar format. Let me know what you think. Thanks. |
@zhangzhang10 Thanks for the detailed explanation! For the "specialized code" part, I was referring to the new But I will look deeper and keep you posted here. |
I think @SmirnovEgorRu and @ShvetsKS are also familiar with the inplace predict implementation. Feel free to share your opinion. I can help refactor the code. |
@trivialfis Thanks for your thoughts! You are correct that in order to support Arrow data format we need a new It is true we need to maintain the new As to the code path for CUDA, cudf itself is based on the Arrow data format. So it sort of already achieves the same goal we want to achieve, supporting the Arrow columnar data format. The work I propose should not affect the GPU backend and will not change the existing situation of having both CPU and GPU adapters. But it would be desirable to structure our implementation in a way to reuse and unify some key data structures across the CPU and GPU code path. One example, probably, is the Thanks. |
@trivialfis @hcho3, I have a rework of this PR. Let's close this one and continue the discussion in the new PR (#7283). Thanks. |
Closed. See #7283 |
This feature allows creating DMatrix from Arrow memory format. It works
with all file formats that PyArrow supports. And it provides better
performance than, for example, XGBoost's CSV interface and than Pandas
read_parquet interface.
Performance of DMatrix creation:
Changes:
Build:
ARROW_HOME
to the path where Arrow libs, PyArrowlibs, and Arrow C++ headers are found. Alternatively, install
arrow-python
,pyarrow
, andgxx_linux-64
packages in a Conda environment.-DUSE_ARROW=ON
to cmake command.Usage: