[RFC] Possible DMatrix refactor #4354

RAMitchell · 2019-04-10T04:46:24Z

As we add different types of algorithms to xgboost and as the performance of these algorithms improves, the DMatrix class may also be improved in order to improve performance and memory usage for these particular cases.

Current DMatrix
The current DMatrix class has two sub classes. The first is the primary in memory representation where the entire dataset is ingested and converted into a CSR format with values stored as 32-bit floats. The entire dataset may be transposed in memory into a column format for exact style algorithms.

The second is the external memory representation which constructs a number of binary page files (32mb in size by default) which are streamed from disk. If a column format is requested, new transposed pages will be generated and these will be streamed from disk.

The end user does not directly choose the type of DMatrix and these subclasses are not exposed via external APIs. The type of underlying DMatrix is automatically selected according to if the user supplies a cache prefix in the constructor.

Currently a batch of data is stored inside the DMatrix as a 'SparsePage' class. This uses HostDeviceVector in order to expose data to both the CPU and GPU.

Current fast histogram algorithms
The current static histogram algorithms (hist, gpu_hist) convert the DMatrix object on the fly inside of their respective TreeUpdater implementations, not as a part of DMatrix implementation, although I believe that it was eventually intended to be integrated with DMatrix (#1673). 'hist' converts the data into a CSR matrix with integers instead of floats. 'gpu_hist' converts the data into an ELLPACK format with integers instead of floats and additionally applies bitwise compression at run time to both the matrix elements and indices, commonly resulting in 4-5x compression over the standard CSR matrix. In gpu_hist we avoid ever copying the training DMatrix to the GPU if prediction cacheing is used.

Some desirable features for DMatrix
Here I will list some features that we would like to have. Not all of these will be practical under the same design.

The ability to train entirely on a compressed quantised DMatrix. We could save around ~4x memory and be much more competitive with LightGBM on memory usage.
Support for dense matrices. We would directly halve memory usage by using dense formats where appropriate.
DMatrix as a thin wrapper over an already allocated data matrix. e.g. a numpy ndarray. Given a dense numpy input, we copy this into DMatrix format resulting in 3n memory usage. Simply using the underlying numpy array as our storage uses 1n memory.
Support for ingest of data from GPU memory (see [WIP] cuDF integration into XGBoost #3997)
Support for external memory for all of the above

Notes on possible implementation

Current iterators used to access DMatrix could be made into virtual functions, abstracting away underlying memory format. There could be a significant performance penalty here from virtual function look up.
The question is how do the histogram learning algorithms then access the underlying integer representation if it exists? Do we provide another set of iterators specifically for quantised data? What happens if these iterators are called but the underlying data is stored as a float? We can either error and tell the user to construct a different type of DMatrix for this particular learning algorithm or generate the required representation on the fly with some silent performance penalty.
Another possible solution is to have the DMatrix as a completely 'lazy' object that contains a pointer to an external data source but defers constructing an internal representation until it knows the optimal format, based on whatever learning algorithm is chosen. This solution has the downside of introducing complicated state to the DMatrix object.

I realise the above is somewhat rambling and does not propose a concrete solution, but I hope to start discussion on these ideas.

@tqchen @hcho3 @trivialfis @CodingCat

trivialfis · 2019-04-10T05:16:31Z

silent performance penalty

There's always LOG(WARNING).

downside of introducing complicated state

I would suggest implementing lazy initialization. Even if we were to provide an explicit interface for specifying type of DMatrix, ideally we still need to add a check for users, I doubt that anyone would first try to understand what these jargons mean before using a library, so there must be lots of mistakes. To implement this check, those states are necessary. And the state can be encoded in Updater with following code built into Learner:

std::vector<DMatrixType> types;
for (auto up : updaters) {
  types.emplace_back(up->PreferredFormat());
}
dmat->BuildData(types);

In an updater, say gpu_hist:

DMatrixType PreferredFormat() const { return DMatrixType::Qunatile; }

And lastly in DMatrix::BuildData(DMatrixType type), that's just dispatching.

CodingCat · 2019-04-10T20:23:04Z

DMatrix as a thin wrapper over an already allocated data matrix.

echo on this, as I am facing several scenarios really needing low-latency prediction, making DMatrix as a wrapper instead of really allocating memory space to store data would reduce the overhead involved in the whole code path....

There could be a significant performance penalty here from virtual function look up.

isn't it very hard to eliminate virtual functions in this case? either in iterator layer or lower memory format layer, do you have other suggestions?

What happens if these iterators are called but the underlying data is stored as a float

I would vote for stop training in this case, silently (even putting a no-one-will-read warn) doing something for the users are kind of risky to me

RAMitchell · 2019-04-22T20:51:02Z

Thanks for your feedback. I think the action on this is to produce a prototype demonstrating DMatrix as a thin wrapper over an existing data structure and possibly exploring the idea of lazy DMatrix construction, then we will reevaluate.

It may be some time before I get to this.

trivialfis · 2019-05-18T08:01:15Z

Hi @hcho3 Could you post some references for feature grouping when time allows? Like:

Reference for grouping algorithm.
What's the bottle neck in Fast histogram algorithm exhibits long start-up time and high memory usage #2326 (time, space or both?)
Where does the bottle neck come from, column_matrix implementation, the algorithm itself or somewhere else?
Any other related issues, like difficulties you encountered before.
Are you working on it? ;-)

hcho3 · 2019-05-23T00:44:00Z

@trivialfis Feature grouping is a heuristic to cut down the number of feature columns in a sparse, high-dimensional dataset. It reduces memory cost and improves performance on multi-core CPUs (by improving work balance among worker threads). Addressing your questions:

Reference for grouping algorithm: See the LightGBM paper. The term they use is Exclusive Feature Bundling (EFB). The description given is as follows:

Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.

Bottleneck in Fast histogram algorithm exhibits long start-up time and high memory usage #2326: It's both time and space (as in memory consumption). Some parts of initialization logic is single-threaded so becomes a bottleneck when the number of threads is high. I think @SmirnovEgorRu's pull request (Optimizations of pre-processing for 'hist' tree method #4310) addresses this. Memory consumption is an open problem however. Feature grouping improves performance (better work balance in OpenMP threads) at the cost of increased memory consumption. See discussion in Improve initial setup time and memory consumption in fast histogram #2543.
I haven't had a chance to look at this for a while. Can we come back to this after merging CPU optimizations #4433 ?

trivialfis · 2019-05-23T09:05:41Z

Can we come back to this after merging #4433 ?

Absolutely. Thanks for the information.

RAMitchell · 2019-08-04T06:57:11Z

Just a note on ongoing work for refactoring:

We currently have a flow of data that looks like this:
External data (e.g. numpy 2d array) -> Dmatrix csr -> dmatrix other (column or quantile matrix)

It has been proposed that we do this to save memory in the intermediate step
External data -> Dmatrix other

There is a consequence to this that we must implement combinatorially many constructors for every possible pair of external data and internal representation. There are a lot of combinations.

In the first example the csr dmatrix works like an intermediate format, where we only need 1 constructor for each external data/internal representation.

This doesnt make it impossible to implement example b but is something to consider. Maybe its possible to use interators in the middle step to generalise data copying without extra storage?

@rongou @sriramch @trivialfis

trivialfis · 2019-08-04T07:05:52Z

Another thing I'm worrying is change of updater during experiment.

rongou · 2019-08-05T22:37:36Z

My goal is to make external memory work for GPUs. I think we likely need something along the lines of https://arxiv.org/abs/1901.09047, but tackling it directly may be too risk/disruptive to the existing code base, thus the need for refactoring.

Possible next steps, in no particular order:

The existing code handling SparsePage/CSCPage/SortedCSCPage could use some clean up, especially the external memory portion.
Create a new page format (EllpackPage?) that can hold the binned/compressed features. Basically means moving quantile sketch/compression out of the tree updater.
Enable reading/writing EllpackPages from/to disk. To start with, we can stick with the current approach and keep the whole final dataset in gpu memory.
Stream in the data during tree construction. This may be just too slow to make sense for GPUs.
Implemented stratified storage and importance sampling as in https://arxiv.org/abs/1901.09047.

hcho3 · 2019-08-05T22:57:06Z

@rongou Thanks for the link. The paper looks quite interesting, let me read it.

trivialfis · 2019-08-06T13:15:23Z

Just add a thing to the list, is it possible we turn the field qid_ into local variable? It's added by #2749 , glancing through the diff doesn't seem to be used by any algorithm.

rongou · 2019-08-06T19:15:24Z

@trivialfis looks like the qids_ are used by ranking objectives.

trivialfis · 2019-08-07T01:36:58Z

Groups are used by ranking, qid seems to be an intermediate storage that used to construct group

trivialfis · 2019-08-09T07:30:47Z

Added here as a convenient reference. We need to consider removing the strong reference between booster and dmatrix, see #4482.

rongou · 2019-08-09T16:47:12Z

Probably this line (and its effects): https://github.com/dmlc/xgboost/blob/master/src/learner.cc#L646

trivialfis · 2019-08-09T17:03:37Z

Em, sorry currently i can't look into that. Putting it here as a reminder.

RAMitchell · 2019-11-18T03:17:26Z

See #5044 for a proof of concept on re-factoring DMatrix construction from external data.

trivialfis · 2019-12-19T11:23:05Z

@RAMitchell Do you have a roadmap or check lists for this?

trivialfis added the status: RFC label Apr 10, 2019

rongou mentioned this issue Apr 10, 2019

[RFC] External memory support for GPU #4357

Closed

trivialfis mentioned this issue May 20, 2019

Ability to drop references to underlying DMatrix objects while keeping the booster object alive #4482

Closed

trivialfis mentioned this issue May 24, 2019

Feature interaction constraint for GPU Hist. #4488

Closed

CodingCat mentioned this issue Jul 23, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

rongou mentioned this issue Aug 2, 2019

refactor DMatrix to return batches of different page types #4686

Merged

rongou mentioned this issue Aug 7, 2019

remove the qids_ field in MetaInfo #4744

Merged

RAMitchell mentioned this issue Dec 21, 2019

[Roadmap] DMatrix refactoring #5143

Closed

10 tasks

hcho3 mentioned this issue Feb 21, 2020

[Roadmap] 1.1.0 Roadmap #5337

Closed

12 tasks

trivialfis mentioned this issue Mar 5, 2020

Thread safe, inplace prediction. #5389

Merged

This was referenced Jun 25, 2021

Move GHistIndex into DMatrix. #7064

Merged

[WIP] Get column matrix from GHistIndex. #7072

Closed

trivialfis closed this as completed Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Possible DMatrix refactor #4354

[RFC] Possible DMatrix refactor #4354

RAMitchell commented Apr 10, 2019 •

edited

Loading

trivialfis commented Apr 10, 2019 •

edited

Loading

CodingCat commented Apr 10, 2019

RAMitchell commented Apr 22, 2019

trivialfis commented May 18, 2019

hcho3 commented May 23, 2019

trivialfis commented May 23, 2019

RAMitchell commented Aug 4, 2019 •

edited

Loading

trivialfis commented Aug 4, 2019

rongou commented Aug 5, 2019

hcho3 commented Aug 5, 2019

trivialfis commented Aug 6, 2019

rongou commented Aug 6, 2019

trivialfis commented Aug 7, 2019

trivialfis commented Aug 9, 2019

rongou commented Aug 9, 2019

trivialfis commented Aug 9, 2019

RAMitchell commented Nov 18, 2019

trivialfis commented Dec 19, 2019

[RFC] Possible DMatrix refactor #4354

[RFC] Possible DMatrix refactor #4354

Comments

RAMitchell commented Apr 10, 2019 • edited Loading

trivialfis commented Apr 10, 2019 • edited Loading

CodingCat commented Apr 10, 2019

RAMitchell commented Apr 22, 2019

trivialfis commented May 18, 2019

hcho3 commented May 23, 2019

trivialfis commented May 23, 2019

RAMitchell commented Aug 4, 2019 • edited Loading

trivialfis commented Aug 4, 2019

rongou commented Aug 5, 2019

hcho3 commented Aug 5, 2019

trivialfis commented Aug 6, 2019

rongou commented Aug 6, 2019

trivialfis commented Aug 7, 2019

trivialfis commented Aug 9, 2019

rongou commented Aug 9, 2019

trivialfis commented Aug 9, 2019

RAMitchell commented Nov 18, 2019

trivialfis commented Dec 19, 2019

RAMitchell commented Apr 10, 2019 •

edited

Loading

trivialfis commented Apr 10, 2019 •

edited

Loading

RAMitchell commented Aug 4, 2019 •

edited

Loading