Performance of converting datatable frames to DMatrix #7642

oleksiyskononenko · 2022-02-09T23:40:52Z

Dear developers, I'm now looking into a performance issue when converting datatable frames to XGBoost DMatrix. Here is a simple python script that I wrote to benchmark datatable->XGBoost vs numpy- XGBoost performance:

import datatable as dt
import xgboost as xgb
import numpy as np
import time

nrows = 10**8
ncols = 1
DATA = [range(nrows)] * ncols
print("Shape: (%d; %d)" % (nrows, ncols))

NP = np.array(DATA)
t0 = time.time()
XGB = xgb.DMatrix(NP)
tnp = time.time() - t0

DT = dt.Frame(np.transpose(NP))
t0 = time.time()
XGB = xgb.DMatrix(DT)
tdt = time.time() - t0

print("t(DT->XG)/t(NP->XG): ", tdt/tnp)

I can see that conversion of one-column numpy array takes somewhat less time than the conversion of a one-column datatable frame

Shape: (100000000; 1)
t(DT->XG):  3.133333921432495
t(NP->XG):  2.115936040878296
t(DT->XG)/t(NP->XG):  1.480826386478058

when I keep the number of elements the same, but go for 10 columns dt->XGBoost conversion becomes even slower

Shape: (10000000; 10)
t(DT->XG):  2.4032959938049316
t(NP->XG):  1.1607439517974854
t(DT->XG)/t(NP->XG):  2.0704790148449845

when I go for 100 columns, still keeping the number of elements the same, it becomes even worse

Shape: (1000000; 100)
t(DT->XG):  3.0988237857818604
t(NP->XG):  0.9062139987945557
t(DT->XG)/t(NP->XG):  3.41952760595611

For some other multicolumn data I found that t(DT->XG)/t(NP->XG) ratio could be as bad as 20. For some smaller data it could be around 1.

Do you have any ideas as to why performance of dt->XGBoost could be so bad comparing to numpy? First, I thought it could be related to the number of threads used. But it seems that the conversion is single-threaded for both np->XGBoost and dt->XGBoost, no matter how many threads I ask for in xgb.DMatrix(). Thanks in advance for any help on this issue.

P.S. For the benchmarks above I'm using the latest packages from pypi repo: XGBoost 1.5.2, numpy 1.22.2 and datatable 1.0.0.

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-02-10T12:04:12Z

I think it's a column-major data structure while DMatrix is row major. We can speed up the conversion but memory usage would increase significantly. While for numpy ndarray, since it's dense and can be arbitrarily indexed so we can handle both c layout can f layout efficiently.

oleksiyskononenko · 2022-02-10T18:17:30Z

Thanks, @trivialfis

Yes, I had a feeling that the issue is the memory layout of dt vs np vs XGBoost. I tried to vary nthread parameter in .DMatrix() to make it faster, however, it seems that both dt->XGBoost and np->XGBoost conversions are single-threaded. Do you know why parallelization is not applicable in this case?

My current finding is that in order to archive better performance, instead of dt->XGBoost, it is faster to do dt->np->XGBoost, since dt->np conversion is fully parallel and fast enough.

trivialfis · 2022-02-10T20:12:25Z

I think np -> DMatrix is running in parallel, the effect might not be obvious since there's not much work. While for dt it's running single-threaded. We used to do the transform in parallel, but the memory overhead is significant for large datasets and we have to remove it. See #6552 I did some searching back then but couldn't find a satisfying solution: #6552 (comment)

oleksiyskononenko · 2022-02-10T20:46:47Z

the memory overhead is significant for large datasets and we have to remove it

Instead of fully disabling nthread for dt, can it be kept as an option defaulting to 1? Then, when the memory overhead is not an issue, one could profit from parallelization and have some reasonable conversion times for big data.

trivialfis · 2022-02-11T09:58:57Z

Note to myself:

We can build some heuristic here by passing the ctx into SparsePage::Push then check whether the user has specified the number of threads.

oleksiyskononenko · 2022-02-11T23:17:03Z

Thanks, @trivialfis. What I still don't understand is why f-layout numpy arrays are handled faster than datatable, if they are both column-major. For instance, when instead of XGB = xgb.DMatrix(DT) I do

NP = DT.to_numpy() #creates F_CONTIGUOUS numpy array
XGB = xgb.DMatrix(NP)

performance is improved by about a factor of two.

trivialfis · 2022-02-12T12:37:47Z

I don't have an answer to that, we might need to run some profiling to be sure.

oleksiyskononenko · 2022-02-15T18:51:16Z

@trivialfis I see, could you please point me to the code that does the transpose of data? I wonder why it needs significant amount memory in addition to the transposed matrix. It looks that the code that sets number of threads to one: https://github.com/dmlc/xgboost/pull/6774/files is not there anymore.

trivialfis · 2022-02-16T10:54:09Z

@oleksiyskononenko

This is the starting point of a matrix being pushed into DMatrix

xgboost/src/data/data.cc

Line 1068 in 0149f81

constexpr bool kIsRowMajor = AdapterBatchT::kIsRowMajor;

This is the structure used to allocate working memory (and run out of it):

xgboost/src/data/data.cc

Line 1078 in 0149f81

common::ParallelGroupBuilder<

It's defined in here: https://github.com/dmlc/xgboost/blob/master/src/common/group_data.h

The "adapter" in the parameter that represents a datatable is defined here:

xgboost/src/data/adapter.h

Line 584 in 0149f81

class DataTableAdapter

.

Feel free to revert some of the new commits that check col/row-major input while you are digging into the code.

A related paper: https://synergy.cs.vt.edu/pubs/papers/wang-transposition-ics16.pdf

oleksiyskononenko · 2022-02-18T00:44:05Z

Thanks @trivialfis.

The paper seems to deal with transposition of sparse data, while datatable frames are not sparse. How is it applicable in this case? Also, it looks like the SparsePage class also expects sparse data?

trivialfis · 2022-02-18T03:09:12Z

The sparse page is a CSR matrix used internally as an uniform data structure for all xgboost components. So the target of the transpose is sparse. The input is also often sparse so we focus on the sparse transformation.

oleksiyskononenko · 2022-02-18T03:24:58Z

I see, so is it the reason you need O(nthread*batch_size) memory when going parallel? Because for a dense matrix I would imagine transposition would just require memory allocation and parallel writes to that memory from threads that all share the same source matrix. It should not need O(nthread*batch_size) of additional memory.

trivialfis · 2022-02-18T20:00:48Z

I see, so is it the reason you need O(nthread*batch_size) memory when going parallel?

Yes.

trivialfis added the feature-request label Feb 11, 2022

hcho3 mentioned this issue May 8, 2022

[FOLLOW-UP] Support for building DMatrix from Apache Arrow data format #7512

Merged

trivialfis mentioned this issue Nov 18, 2022

Take datatable as row major input. #8472

Merged

trivialfis closed this as completed in #8472 Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of converting datatable frames to DMatrix #7642

Performance of converting datatable frames to DMatrix #7642

oleksiyskononenko commented Feb 9, 2022 •

edited

Loading

trivialfis commented Feb 10, 2022

oleksiyskononenko commented Feb 10, 2022

trivialfis commented Feb 10, 2022 •

edited

Loading

oleksiyskononenko commented Feb 10, 2022 •

edited

Loading

trivialfis commented Feb 11, 2022

oleksiyskononenko commented Feb 11, 2022 •

edited

Loading

trivialfis commented Feb 12, 2022

oleksiyskononenko commented Feb 15, 2022

trivialfis commented Feb 16, 2022 •

edited

Loading

oleksiyskononenko commented Feb 18, 2022 •

edited

Loading

trivialfis commented Feb 18, 2022

oleksiyskononenko commented Feb 18, 2022

trivialfis commented Feb 18, 2022

Performance of converting datatable frames to DMatrix #7642

Performance of converting datatable frames to DMatrix #7642

Comments

oleksiyskononenko commented Feb 9, 2022 • edited Loading

trivialfis commented Feb 10, 2022

oleksiyskononenko commented Feb 10, 2022

trivialfis commented Feb 10, 2022 • edited Loading

oleksiyskononenko commented Feb 10, 2022 • edited Loading

trivialfis commented Feb 11, 2022

oleksiyskononenko commented Feb 11, 2022 • edited Loading

trivialfis commented Feb 12, 2022

oleksiyskononenko commented Feb 15, 2022

trivialfis commented Feb 16, 2022 • edited Loading

oleksiyskononenko commented Feb 18, 2022 • edited Loading

trivialfis commented Feb 18, 2022

oleksiyskononenko commented Feb 18, 2022

trivialfis commented Feb 18, 2022

oleksiyskononenko commented Feb 9, 2022 •

edited

Loading

trivialfis commented Feb 10, 2022 •

edited

Loading

oleksiyskononenko commented Feb 10, 2022 •

edited

Loading

oleksiyskononenko commented Feb 11, 2022 •

edited

Loading

trivialfis commented Feb 16, 2022 •

edited

Loading

oleksiyskononenko commented Feb 18, 2022 •

edited

Loading