-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of converting datatable frames to DMatrix #7642
Comments
I think it's a column-major data structure while DMatrix is row major. We can speed up the conversion but memory usage would increase significantly. While for numpy ndarray, since it's dense and can be arbitrarily indexed so we can handle both c layout can f layout efficiently. |
Thanks, @trivialfis Yes, I had a feeling that the issue is the memory layout of dt vs np vs XGBoost. I tried to vary My current finding is that in order to archive better performance, instead of dt->XGBoost, it is faster to do dt->np->XGBoost, since dt->np conversion is fully parallel and fast enough. |
I think np -> DMatrix is running in parallel, the effect might not be obvious since there's not much work. While for dt it's running single-threaded. We used to do the transform in parallel, but the memory overhead is significant for large datasets and we have to remove it. See #6552 I did some searching back then but couldn't find a satisfying solution: #6552 (comment) |
Instead of fully disabling |
Note to myself: We can build some heuristic here by passing the ctx into |
Thanks, @trivialfis. What I still don't understand is why f-layout numpy arrays are handled faster than datatable, if they are both column-major. For instance, when instead of NP = DT.to_numpy() #creates F_CONTIGUOUS numpy array
XGB = xgb.DMatrix(NP) performance is improved by about a factor of two. |
I don't have an answer to that, we might need to run some profiling to be sure. |
@trivialfis I see, could you please point me to the code that does the transpose of data? I wonder why it needs significant amount memory in addition to the transposed matrix. It looks that the code that sets number of threads to one: https://github.com/dmlc/xgboost/pull/6774/files is not there anymore. |
This is the starting point of a matrix being pushed into Line 1068 in 0149f81
This is the structure used to allocate working memory (and run out of it): Line 1078 in 0149f81
The "adapter" in the parameter that represents a datatable is defined here: Line 584 in 0149f81
Feel free to revert some of the new commits that check col/row-major input while you are digging into the code. A related paper: https://synergy.cs.vt.edu/pubs/papers/wang-transposition-ics16.pdf |
Thanks @trivialfis. The paper seems to deal with transposition of sparse data, while datatable frames are not sparse. How is it applicable in this case? Also, it looks like the |
The sparse page is a CSR matrix used internally as an uniform data structure for all xgboost components. So the target of the transpose is sparse. The input is also often sparse so we focus on the sparse transformation. |
I see, so is it the reason you need |
Yes. |
Dear developers, I'm now looking into a performance issue when converting datatable frames to XGBoost DMatrix. Here is a simple python script that I wrote to benchmark datatable->XGBoost vs numpy- XGBoost performance:
I can see that conversion of one-column numpy array takes somewhat less time than the conversion of a one-column datatable frame
when I keep the number of elements the same, but go for 10 columns dt->XGBoost conversion becomes even slower
when I go for 100 columns, still keeping the number of elements the same, it becomes even worse
For some other multicolumn data I found that
t(DT->XG)/t(NP->XG)
ratio could be as bad as 20. For some smaller data it could be around 1.Do you have any ideas as to why performance of dt->XGBoost could be so bad comparing to numpy? First, I thought it could be related to the number of threads used. But it seems that the conversion is single-threaded for both np->XGBoost and dt->XGBoost, no matter how many threads I ask for in
xgb.DMatrix()
. Thanks in advance for any help on this issue.P.S. For the benchmarks above I'm using the latest packages from pypi repo: XGBoost
1.5.2
, numpy1.22.2
and datatable1.0.0
.The text was updated successfully, but these errors were encountered: