Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of converting datatable frames to DMatrix #7642

Closed
oleksiyskononenko opened this issue Feb 9, 2022 · 13 comments · Fixed by #8472
Closed

Performance of converting datatable frames to DMatrix #7642

oleksiyskononenko opened this issue Feb 9, 2022 · 13 comments · Fixed by #8472

Comments

@oleksiyskononenko
Copy link

oleksiyskononenko commented Feb 9, 2022

Dear developers, I'm now looking into a performance issue when converting datatable frames to XGBoost DMatrix. Here is a simple python script that I wrote to benchmark datatable->XGBoost vs numpy- XGBoost performance:

import datatable as dt
import xgboost as xgb
import numpy as np
import time

nrows = 10**8
ncols = 1
DATA = [range(nrows)] * ncols
print("Shape: (%d; %d)" % (nrows, ncols))

NP = np.array(DATA)
t0 = time.time()
XGB = xgb.DMatrix(NP)
tnp = time.time() - t0

DT = dt.Frame(np.transpose(NP))
t0 = time.time()
XGB = xgb.DMatrix(DT)
tdt = time.time() - t0

print("t(DT->XG)/t(NP->XG): ", tdt/tnp)

I can see that conversion of one-column numpy array takes somewhat less time than the conversion of a one-column datatable frame

Shape: (100000000; 1)
t(DT->XG):  3.133333921432495
t(NP->XG):  2.115936040878296
t(DT->XG)/t(NP->XG):  1.480826386478058

when I keep the number of elements the same, but go for 10 columns dt->XGBoost conversion becomes even slower

Shape: (10000000; 10)
t(DT->XG):  2.4032959938049316
t(NP->XG):  1.1607439517974854
t(DT->XG)/t(NP->XG):  2.0704790148449845

when I go for 100 columns, still keeping the number of elements the same, it becomes even worse

Shape: (1000000; 100)
t(DT->XG):  3.0988237857818604
t(NP->XG):  0.9062139987945557
t(DT->XG)/t(NP->XG):  3.41952760595611

For some other multicolumn data I found that t(DT->XG)/t(NP->XG) ratio could be as bad as 20. For some smaller data it could be around 1.

Do you have any ideas as to why performance of dt->XGBoost could be so bad comparing to numpy? First, I thought it could be related to the number of threads used. But it seems that the conversion is single-threaded for both np->XGBoost and dt->XGBoost, no matter how many threads I ask for in xgb.DMatrix(). Thanks in advance for any help on this issue.

P.S. For the benchmarks above I'm using the latest packages from pypi repo: XGBoost 1.5.2, numpy 1.22.2 and datatable 1.0.0.

@trivialfis
Copy link
Member

I think it's a column-major data structure while DMatrix is row major. We can speed up the conversion but memory usage would increase significantly. While for numpy ndarray, since it's dense and can be arbitrarily indexed so we can handle both c layout can f layout efficiently.

@oleksiyskononenko
Copy link
Author

Thanks, @trivialfis

Yes, I had a feeling that the issue is the memory layout of dt vs np vs XGBoost. I tried to vary nthread parameter in .DMatrix() to make it faster, however, it seems that both dt->XGBoost and np->XGBoost conversions are single-threaded. Do you know why parallelization is not applicable in this case?

My current finding is that in order to archive better performance, instead of dt->XGBoost, it is faster to do dt->np->XGBoost, since dt->np conversion is fully parallel and fast enough.

@trivialfis
Copy link
Member

trivialfis commented Feb 10, 2022

I think np -> DMatrix is running in parallel, the effect might not be obvious since there's not much work. While for dt it's running single-threaded. We used to do the transform in parallel, but the memory overhead is significant for large datasets and we have to remove it. See #6552 I did some searching back then but couldn't find a satisfying solution: #6552 (comment)

@oleksiyskononenko
Copy link
Author

oleksiyskononenko commented Feb 10, 2022

the memory overhead is significant for large datasets and we have to remove it

Instead of fully disabling nthread for dt, can it be kept as an option defaulting to 1? Then, when the memory overhead is not an issue, one could profit from parallelization and have some reasonable conversion times for big data.

@trivialfis
Copy link
Member

Note to myself:

We can build some heuristic here by passing the ctx into SparsePage::Push then check whether the user has specified the number of threads.

@oleksiyskononenko
Copy link
Author

oleksiyskononenko commented Feb 11, 2022

Thanks, @trivialfis. What I still don't understand is why f-layout numpy arrays are handled faster than datatable, if they are both column-major. For instance, when instead of XGB = xgb.DMatrix(DT) I do

NP = DT.to_numpy() #creates F_CONTIGUOUS numpy array
XGB = xgb.DMatrix(NP)

performance is improved by about a factor of two.

@trivialfis
Copy link
Member

I don't have an answer to that, we might need to run some profiling to be sure.

@oleksiyskononenko
Copy link
Author

@trivialfis I see, could you please point me to the code that does the transpose of data? I wonder why it needs significant amount memory in addition to the transposed matrix. It looks that the code that sets number of threads to one: https://github.com/dmlc/xgboost/pull/6774/files is not there anymore.

@trivialfis
Copy link
Member

trivialfis commented Feb 16, 2022

@oleksiyskononenko

This is the starting point of a matrix being pushed into DMatrix

constexpr bool kIsRowMajor = AdapterBatchT::kIsRowMajor;

This is the structure used to allocate working memory (and run out of it):

common::ParallelGroupBuilder<
It's defined in here: https://github.com/dmlc/xgboost/blob/master/src/common/group_data.h

The "adapter" in the parameter that represents a datatable is defined here:

class DataTableAdapter
.

Feel free to revert some of the new commits that check col/row-major input while you are digging into the code.

A related paper: https://synergy.cs.vt.edu/pubs/papers/wang-transposition-ics16.pdf

@oleksiyskononenko
Copy link
Author

oleksiyskononenko commented Feb 18, 2022

Thanks @trivialfis.

The paper seems to deal with transposition of sparse data, while datatable frames are not sparse. How is it applicable in this case? Also, it looks like the SparsePage class also expects sparse data?

@trivialfis
Copy link
Member

The sparse page is a CSR matrix used internally as an uniform data structure for all xgboost components. So the target of the transpose is sparse. The input is also often sparse so we focus on the sparse transformation.

@oleksiyskononenko
Copy link
Author

I see, so is it the reason you need O(nthread*batch_size) memory when going parallel? Because for a dense matrix I would imagine transposition would just require memory allocation and parallel writes to that memory from threads that all share the same source matrix. It should not need O(nthread*batch_size) of additional memory.

@trivialfis
Copy link
Member

I see, so is it the reason you need O(nthread*batch_size) memory when going parallel?

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants