Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Possible DMatrix refactor #4354

Closed
RAMitchell opened this issue Apr 10, 2019 · 18 comments
Closed

[RFC] Possible DMatrix refactor #4354

RAMitchell opened this issue Apr 10, 2019 · 18 comments

Comments

@RAMitchell
Copy link
Member

RAMitchell commented Apr 10, 2019

As we add different types of algorithms to xgboost and as the performance of these algorithms improves, the DMatrix class may also be improved in order to improve performance and memory usage for these particular cases.

Current DMatrix
The current DMatrix class has two sub classes. The first is the primary in memory representation where the entire dataset is ingested and converted into a CSR format with values stored as 32-bit floats. The entire dataset may be transposed in memory into a column format for exact style algorithms.

The second is the external memory representation which constructs a number of binary page files (32mb in size by default) which are streamed from disk. If a column format is requested, new transposed pages will be generated and these will be streamed from disk.

The end user does not directly choose the type of DMatrix and these subclasses are not exposed via external APIs. The type of underlying DMatrix is automatically selected according to if the user supplies a cache prefix in the constructor.

Currently a batch of data is stored inside the DMatrix as a 'SparsePage' class. This uses HostDeviceVector in order to expose data to both the CPU and GPU.

Current fast histogram algorithms
The current static histogram algorithms (hist, gpu_hist) convert the DMatrix object on the fly inside of their respective TreeUpdater implementations, not as a part of DMatrix implementation, although I believe that it was eventually intended to be integrated with DMatrix (#1673). 'hist' converts the data into a CSR matrix with integers instead of floats. 'gpu_hist' converts the data into an ELLPACK format with integers instead of floats and additionally applies bitwise compression at run time to both the matrix elements and indices, commonly resulting in 4-5x compression over the standard CSR matrix. In gpu_hist we avoid ever copying the training DMatrix to the GPU if prediction cacheing is used.

Some desirable features for DMatrix
Here I will list some features that we would like to have. Not all of these will be practical under the same design.

  • The ability to train entirely on a compressed quantised DMatrix. We could save around ~4x memory and be much more competitive with LightGBM on memory usage.
  • Support for dense matrices. We would directly halve memory usage by using dense formats where appropriate.
  • DMatrix as a thin wrapper over an already allocated data matrix. e.g. a numpy ndarray. Given a dense numpy input, we copy this into DMatrix format resulting in 3n memory usage. Simply using the underlying numpy array as our storage uses 1n memory.
  • Support for ingest of data from GPU memory (see [WIP] cuDF integration into XGBoost #3997)
  • Support for external memory for all of the above

Notes on possible implementation

  • Current iterators used to access DMatrix could be made into virtual functions, abstracting away underlying memory format. There could be a significant performance penalty here from virtual function look up.
  • The question is how do the histogram learning algorithms then access the underlying integer representation if it exists? Do we provide another set of iterators specifically for quantised data? What happens if these iterators are called but the underlying data is stored as a float? We can either error and tell the user to construct a different type of DMatrix for this particular learning algorithm or generate the required representation on the fly with some silent performance penalty.
  • Another possible solution is to have the DMatrix as a completely 'lazy' object that contains a pointer to an external data source but defers constructing an internal representation until it knows the optimal format, based on whatever learning algorithm is chosen. This solution has the downside of introducing complicated state to the DMatrix object.

I realise the above is somewhat rambling and does not propose a concrete solution, but I hope to start discussion on these ideas.

@tqchen @hcho3 @trivialfis @CodingCat

@trivialfis
Copy link
Member

trivialfis commented Apr 10, 2019

silent performance penalty

There's always LOG(WARNING).

downside of introducing complicated state

I would suggest implementing lazy initialization. Even if we were to provide an explicit interface for specifying type of DMatrix, ideally we still need to add a check for users, I doubt that anyone would first try to understand what these jargons mean before using a library, so there must be lots of mistakes. To implement this check, those states are necessary. And the state can be encoded in Updater with following code built into Learner:

std::vector<DMatrixType> types;
for (auto up : updaters) {
  types.emplace_back(up->PreferredFormat());
}
dmat->BuildData(types);

In an updater, say gpu_hist:

DMatrixType PreferredFormat() const { return DMatrixType::Qunatile; }

And lastly in DMatrix::BuildData(DMatrixType type), that's just dispatching.

@CodingCat
Copy link
Member

DMatrix as a thin wrapper over an already allocated data matrix.

echo on this, as I am facing several scenarios really needing low-latency prediction, making DMatrix as a wrapper instead of really allocating memory space to store data would reduce the overhead involved in the whole code path....

There could be a significant performance penalty here from virtual function look up.

isn't it very hard to eliminate virtual functions in this case? either in iterator layer or lower memory format layer, do you have other suggestions?

What happens if these iterators are called but the underlying data is stored as a float

I would vote for stop training in this case, silently (even putting a no-one-will-read warn) doing something for the users are kind of risky to me

@RAMitchell
Copy link
Member Author

Thanks for your feedback. I think the action on this is to produce a prototype demonstrating DMatrix as a thin wrapper over an existing data structure and possibly exploring the idea of lazy DMatrix construction, then we will reevaluate.

It may be some time before I get to this.

@trivialfis
Copy link
Member

Hi @hcho3 Could you post some references for feature grouping when time allows? Like:

@hcho3
Copy link
Collaborator

hcho3 commented May 23, 2019

@trivialfis Feature grouping is a heuristic to cut down the number of feature columns in a sparse, high-dimensional dataset. It reduces memory cost and improves performance on multi-core CPUs (by improving work balance among worker threads). Addressing your questions:

  • Reference for grouping algorithm: See the LightGBM paper. The term they use is Exclusive Feature Bundling (EFB). The description given is as follows:

Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.

@trivialfis
Copy link
Member

Can we come back to this after merging #4433 ?

Absolutely. Thanks for the information.

@RAMitchell
Copy link
Member Author

RAMitchell commented Aug 4, 2019

Just a note on ongoing work for refactoring:

We currently have a flow of data that looks like this:
External data (e.g. numpy 2d array) -> Dmatrix csr -> dmatrix other (column or quantile matrix)

It has been proposed that we do this to save memory in the intermediate step
External data -> Dmatrix other

There is a consequence to this that we must implement combinatorially many constructors for every possible pair of external data and internal representation. There are a lot of combinations.

In the first example the csr dmatrix works like an intermediate format, where we only need 1 constructor for each external data/internal representation.

This doesnt make it impossible to implement example b but is something to consider. Maybe its possible to use interators in the middle step to generalise data copying without extra storage?

@rongou @sriramch @trivialfis

@trivialfis
Copy link
Member

Another thing I'm worrying is change of updater during experiment.

@rongou
Copy link
Contributor

rongou commented Aug 5, 2019

My goal is to make external memory work for GPUs. I think we likely need something along the lines of https://arxiv.org/abs/1901.09047, but tackling it directly may be too risk/disruptive to the existing code base, thus the need for refactoring.

Possible next steps, in no particular order:

  • The existing code handling SparsePage/CSCPage/SortedCSCPage could use some clean up, especially the external memory portion.
  • Create a new page format (EllpackPage?) that can hold the binned/compressed features. Basically means moving quantile sketch/compression out of the tree updater.
  • Enable reading/writing EllpackPages from/to disk. To start with, we can stick with the current approach and keep the whole final dataset in gpu memory.
  • Stream in the data during tree construction. This may be just too slow to make sense for GPUs.
  • Implemented stratified storage and importance sampling as in https://arxiv.org/abs/1901.09047.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 5, 2019

@rongou Thanks for the link. The paper looks quite interesting, let me read it.

@trivialfis
Copy link
Member

Just add a thing to the list, is it possible we turn the field qid_ into local variable? It's added by #2749 , glancing through the diff doesn't seem to be used by any algorithm.

@rongou
Copy link
Contributor

rongou commented Aug 6, 2019

@trivialfis looks like the qids_ are used by ranking objectives.

@trivialfis
Copy link
Member

Groups are used by ranking, qid seems to be an intermediate storage that used to construct group

@trivialfis
Copy link
Member

Added here as a convenient reference. We need to consider removing the strong reference between booster and dmatrix, see #4482.

@rongou
Copy link
Contributor

rongou commented Aug 9, 2019

Probably this line (and its effects): https://github.com/dmlc/xgboost/blob/master/src/learner.cc#L646

@trivialfis
Copy link
Member

Em, sorry currently i can't look into that. Putting it here as a reminder.

@RAMitchell
Copy link
Member Author

See #5044 for a proof of concept on re-factoring DMatrix construction from external data.

@trivialfis
Copy link
Member

@RAMitchell Do you have a roadmap or check lists for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants