Fast histogram algorithm exhibits long start-up time and high memory usage #2326

Laurae2 · 2017-05-20T10:33:56Z

Two issues in one:

Fast histogram xgboost does not start training after a while (stuck in building matrix?)
Fast histogram xgboost uses a massive amount of memory (reaching over 50GB RAM when exact xgboost takes only a peak of about 10GB RAM)

Before running, to know:

Will not work using Microsoft R versions (incompatible due to Microsoft R spamming messages about sending data collection to Microsoft while compiling packages)
Dataset download is 240MB, uncompressed is 2GB, sum of RDS files are 650MB
Make sure to have about 25GB RAM free to fast histogram (25GB extra seem committed in memory but never used for unknown reasons)
Creating the dataset (3GB, 7,744,201,107,060 elements) requires using my svmlight package in R
My scripts can run copy&paste just by changing the folders appropriately in setwd
I expected fast histogram to be slower than exact xgboost due to sparsity, but not by that much

Created files info:

  File: url_svmlight.tar.gz
CRC-32: c152f632
   MD4: b05d0a58ad6f53f9ad20a5f350c25df2
   MD5: 2eb74f59d06807a3845f90ed36fe8ded
 SHA-1: 605121dde12ce7baf098e30cd8856ef4ed6d5e69

  File: reput_sparse.rds
CRC-32: 846b1907
   MD4: 2ea36b208118b7d46d18ccb5fad8da98
   MD5: ce371f418463ebc799367ed41f5fd654
 SHA-1: 7f2a73609d72ff61a24a6d176c44752b68954ea6

Environment info

Operating System: Windows Server 2012 R2 (baremetal) - OS doesn't matter, happens also in Ubuntu 17.04

Compiler: MinGW 7.1

Package used (python/R/jvm/C++): R 3.4

xgboost version used: from source, latest commit of today (e5e7217)

Computer specs:

Quad E7-88xx v4 (72 cores), but 3 CPUs were disabled to get rid of NUMA
1TB RAM 1866MHz RAM (256GB RAM available to get rid of NUMA)
Not virtualized

Used 8 threads for xgboost because it was for a benchmark comparison.

Steps to reproduce

Download this dataset: http://archive.ics.uci.edu/ml/datasets/URL+Reputation (svmlight format, 121 parts, 2396130 rows, 3231961 features
Load data using my svmlight parser
Train a xgboost model with fast histogram method (default is 255 bins, one can start directly with 15 bins)

What have you tried?

Changing xgboost to exact: starts training immediately
Lower number of threads used to 4: still will not start after 30 minutes
Lowering number of bins to 15: still will not start after 30 minutes
Lowering number of bins and setting number of threads to 1 (using a small laptop with same environment): eventually starts training after 5 minutes

Scripts:

LIBRARY DOWNLOAD:

install.packages("devtools")
install.packages("Matrix")
install.packages("Rcpp")
install.packages("RcppEigen")
devtools::install_github("Laurae2/sparsity")
# Install xgboost separately from source

DATA LOAD:

# Libraries
library(sparsity)
library(Matrix)

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

data <- read.svmlight(paste0("Day0.svm"))
data$matrix@Dim[2] <- 3231962L
data$matrix@p[length(data$matrix@p):3231963] <- data$matrix@p[length(data$matrix@p)]
data$matrix <- data$matrix[1:(data$matrix@Dim[1] - 1), ]
label <- (data$labels[1:(data$matrix@Dim[1])] + 1) / 2
data <- data$matrix

new_data <- list()

for (i in 1:120) {
  indexed <- (i %% 10) + (10 * ((i %% 10) == 0))
  new_data[[indexed]] <- read.svmlight(paste0("Day", i, ".svm"))
  new_data[[indexed]]$matrix@Dim[2] <- 3231962L
  new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p):3231963] <- new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p)]
  new_data[[indexed]]$matrix <- new_data[[indexed]]$matrix[1:(new_data[[indexed]]$matrix@Dim[1] - 1), ]
  label <- c(label, (new_data[[indexed]]$labels[1:(new_data[[indexed]]$matrix@Dim[1])] + 1) / 2)
  
  if ((i %% 10) == 0) {
    
    data <- rbind(data, new_data[[1]]$matrix, new_data[[2]]$matrix, new_data[[3]]$matrix, new_data[[4]]$matrix, new_data[[5]]$matrix, new_data[[6]]$matrix, new_data[[7]]$matrix, new_data[[8]]$matrix, new_data[[9]]$matrix, new_data[[10]]$matrix)
    gc(verbose = FALSE)
    
    cat("Parsed element 'Day", i, ".svm'. Sparsity: ", sprintf("%05.0f", as.numeric(data@Dim[1]) * as.numeric(data@Dim[2]) / length(data@i)), ":1. Balance: ", sprintf("%04.02f", length(label) / sum(label)), ":1.\n", sep = "")
    
  }
  
}

# Save to RDS
gc()
saveRDS(data, file = "reput_sparse.rds", compress = TRUE)

# Save labels
saveRDS(label, file = "reput_label.rds", compress = TRUE)

RUN FAST HISTOGRAM, will take forever(?) to start:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "hist",
                                 grow_policy = "depthwise",
                                 eta = 0.10,
                                 max_bin = 15,
                                 eval_metric = "auc",
                                 debug_verbose = 1),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

RUN EXACT, will start immediately, and very fast. 10GB peak:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "exact",
                                 eta = 0.10,
                                 eval_metric = "auc"),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

ping @hcho3

The text was updated successfully, but these errors were encountered:

hcho3 · 2017-07-05T22:02:53Z

@Laurae2 I've been working to fix the poor scaling issue for a while, for which I submitted a patch today. Let me take a look at this issue. Sorry for the delay.

Laurae2 · 2017-07-08T22:48:34Z

@hcho3 Do you know which parameters can be used to lower RAM usage / build dataset faster using the new version (#2493) ?

It went from 30GB RAM to 120+GB RAM (I cancelled after 1 hour of building dataset when it took only 4 mins in a previous version). I am using my customized URL reputation dataset (https://github.com/Laurae2/gbt_benchmarks#reput)

It has been reported that new parallel algorithm (dmlc#2493) results in excessive message usage (see issue dmlc#2326). Until issues are resolved, XGBoost should use the old parallel algorithm by default. The user would have to specify `enable_feature_grouping=1` manually to enable the new algorithm.

hcho3 · 2017-07-09T03:07:27Z

Now that I see it, both old and new algorithms are suffering from similar issues (long start-up time, excessive memory usage). The change of algorithm was primarily for faster training, so the new algorithm won't help with whatever problem occurring at initialization stage. I will go ahead and try to reproduce the issue on my side. Thanks!

Laurae2 · 2017-07-09T12:05:20Z

@hcho3 I am providing below my training script if you are interested to reproduce my result of 120GB:

library(Matrix)
library(xgboost)

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse_final.rds") # Requires the reput customized dataset
label <- readRDS("reput_label.rds")

train_1 <- my_data[1:2250000, ]
train_2 <- label[1:2250000]
test_1 <- my_data[2250001:2396130, ]
test_2 <- label[2250001:2396130]

train <- xgb.DMatrix(data = train_1, label = train_2)
test <- xgb.DMatrix(data = test_1, label = test_2)
rm(train_1, train_2, test_1, test_2, my_data, label)
gc()

# rm(model)
gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 #max_depth = 3,
                                 num_leaves = 127,
                                 tree_method = "hist",
                                 grow_policy = "depthwise",
                                 eta = 0.25,
                                 max_bin = 255,
                                 eval_metric = "auc",
                                 debug_verbose = 2),
                   data = train,
                   nrounds = 10,
                   watchlist = list(test = test),
                   verbose = 2,
                   early_stopping_rounds = 50)

I left it one hour and it took 130GB, and was still growing. I also tried max_search_group = 5, same issue. However, with enable_feature_grouping = 0, the training started after the same expected time (about 4 minutes).

hcho3 · 2017-07-10T01:31:24Z

@Laurae2 I have managed to re-produce the issue on my end. Both old and new algorithms suffer from excessive memory usage and setup time, but the new one makes it even worse. I will submit a patch as soon as possible.

ps. Thanks so much for writing the sparsity package. I was looking for a way to export svmlight files from sparse matrices in R, and other packages took forever (?) to complete.

RAMitchell · 2017-07-10T02:12:43Z

Is the slowdown in the hmat/gmat initialisation? Our GPU algorithms would also massively benefit if you can speed up these functions.

It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues are resolved, XGBoost should use the old parallel algorithm by default. The user would have to specify `enable_feature_grouping=1` manually to enable the new algorithm.

hcho3 · 2017-07-10T18:11:39Z

@RAMitchell Yes. The gmat initialization needs to be updated for better performance and memory usage.

pseudotensor · 2017-07-14T07:21:41Z

@hcho3 As an example, for the full airlines data set (http://stat-computing.org/dataexpo/2009/, with some cats converted to numerical, 13GB csv, 121888025 rows and 31 columns), here are some timing numbers:

Time to read data: 163.64 # bad pandas single-threaded reader, can use data.table, so not a concern.
Time to np.array: 3.44
Time to DMatrix: 32.52
Training with 'gpu_hist'
[03:07:30] Device: [0] TITAN X (Pascal) with 10283 MB available device memory.
[03:07:30] Device: [1] TITAN X (Pascal) with 10284 MB available device memory.
[03:07:30] Device: [2] TITAN X (Pascal) with 11846 MB available device memory.
[03:07:31] Device: [3] TITAN X (Pascal) with 10284 MB available device memory.
[03:07:44] Allocated 3953 MB # so sparse, but not as much as lauren's data.
Time in gmat: 16.97 seconds
Time on GPU after gmat init: 1.206 seconds # for reasonable number depth=6 doing 2 iterations. Doing 20 iterations would balance things out, but I've seen other data sets (e.g. Higgs) where depth=6 and iterations=5000 (excessive by far) would be required to balance time spent in gmat init.

So most of the time is spent in Dmatrix creation and gmat initialization, making the GPU much less effective than it could be.

But notice that DMatrix creation is also a major issue, which I'll look at making multi-threaded on the CPU (it's single-threaded right now) or do on GPU. This is common for big data problems.

hcho3 · 2017-07-14T07:37:11Z

@pseudotensor It appears to me that quantile extraction is also presenting difficulties with the url dataset. I'll spend time on this issue in next few days and get back to you.

hcho3 · 2017-07-24T17:58:15Z

@Laurae2 I've submitted a pull request #2543. With this option, memory consumption goes down to ~~<15GB~~ ~30GB for the URL dataset. Setup time is also faster, with 5 minutes or less on a 32-core machine.

You can further reduce memory consumption and improve setup time by setting use_columnar_access=0.

hcho3 · 2017-07-24T18:45:14Z

Here are the results on a c3.8xlarge instance.

Configuration	setup time	peak memory usage	full log
`enable_feature_grouping=1, use_columnar_access=1`	190 sec	30.5 GB	log.txt
`enable_feature_grouping=0, use_columnar_access=1`	69 sec	12.4 GB	log2.txt
`enable_feature_grouping=0, use_columnar_access=0`	40 sec	12.4 GB	log3.txt

RAMitchell · 2017-07-25T14:11:23Z

Is there any reason why we don't use a conventional weighted quantile algorithm instead of a sketch in the single machine case?

tqchen · 2017-07-25T15:38:56Z

sketch have fewer memory consumption and lower memory complexity than sort and find algorithm.

RAMitchell · 2017-07-26T00:22:08Z

Consider that we have not allocated the quantised matrix yet (gmat). This would mean that we can use up to this amount of memory to find the quantiles and not increase the peak memory usage of the algorithm.

Using an in-place sort and accounting for storing the weights this would be about 8 bytes per nonzero matrix element to perform a sort based quantile algorithm (correct me if I'm wrong here), compared to 4 bytes per nonzero for the quantised matrix.

This means you could calculate the quantiles for roughly half your features simultaneously and never increase the peak memory usage.

tqchen · 2017-07-26T00:51:00Z

sure, in terms of time complexity, sketch gives smaller time cost than the sorting (that additional logn factor is actually quite a lot for big dataset). That is one reason why most people will use a sketch based algorithm for summary even when the data is in-memory

RAMitchell · 2017-07-26T01:04:11Z

We can use radix sort though which doesn't have the logn factor and can still be performed in place.

tqchen · 2017-07-28T02:20:59Z

@RAMitchell I get your point:) but let us keep the current method for now as it is also natural for distributed verison.

RAMitchell · 2017-07-28T04:30:53Z

Ok no problem :) I might look at doing this on the GPU to resolve the bottleneck for us.

It also occurred to me that it would be nice to have a multi-core radix sort implementation within dmlc-core. We could speed up xgboost by replacing std::sort in places.

Laurae2 · 2018-07-04T19:32:09Z

Is this issue really resolved? #2543 was not merged so this does not seem yet fixed (the issue is still reproducible).

hcho3 · 2018-07-04T20:38:40Z

@Laurae2 This issue was closed due to the current triaging. We try to aggressive close issues from now on so opened issues like this can get attended to. I went ahead and re-opened it.

trivialfis · 2019-01-06T07:56:18Z

@hcho3 This issue is getting too old by now. Can you summarize a little bit for what's happening? From discussion, it seems there are different problem including long setup time, bloated memory usage of feature grouping, and some crashes ...

Laurae2 · 2019-01-13T09:46:24Z

@trivialfis This issue is about multiple problems found when using xgboost fast histogram initialization on large datasets with many (millions of) features.

The problems encountered summarized:

fast histogram uses significantly more RAM than exact
fast histogram training takes a very long time to start while exact starts immediately (histogram initialization is not expected to be in the range of many minutes/hours)
the older the CPU generation (Intel), the poorer the scalability of fast histogram when creating the histogram initialization (negative efficiency with a low number of threads)

Possible xgboost solutions:

rewrite histogram creation, it was already attempted by @hcho3 in Patch to improve multithreaded performance scaling #2493 + Use old parallel algorithm for histogram construction by default #2501 (some side effects for performance / RAM usage, therefore the previous version of histogram initialization is enabled using enable_feature_grouping = 0)
attempt histogram creation in parallel for singlethreaded parts which bottlenecks the creation speed, attempted by @hcho3 in Improve initial setup time and memory consumption in fast histogram #2543
allow creating the histogram elsewhere and re-use it for training? (it would allow also to not always create it for training, like LightGBM does)

Possible end-user solutions, from top to bottom in priority:

use way less threads (the older the CPU generation, the lower the maximum number of threads which should be used)
use less features
use exact mode instead of fast histogram
upgrade CPU to better CPU generation for better scalability (seems to have a major impact)
get higher RAM frequency for additional RAM bandwidth for better scalability
fill all RAM channels to get peak RAM bandwidth

trivialfis · 2019-01-22T06:32:47Z

@Laurae2 Thanks.

trivialfis · 2019-06-27T20:23:01Z

@hcho3 I'm working on this, hopefully can get a solution soon.

Laurae2 mentioned this issue May 20, 2017

Extremely Sparse Datasets cannot be used for training microsoft/LightGBM#536

Closed

hcho3 mentioned this issue Jul 9, 2017

Use old parallel algorithm for histogram construction by default #2501

Merged

This was referenced Jul 14, 2017

[New Feature] support "Random Rotation Ensembles" microsoft/LightGBM#353

Closed

Re-running benchmarks Laurae2/gbt_benchmarks#2

Closed

hcho3 mentioned this issue Jul 24, 2017

Improve initial setup time and memory consumption in fast histogram #2543

Closed

tqchen closed this as completed Jul 4, 2018

hcho3 reopened this Jul 4, 2018

hcho3 changed the title ~~Fast histogram: Extremely Sparse Datasets require massive amount of RAM~~ Fast histogram algorithm exhibits long start-up time and huge memory usage Jul 4, 2018

hcho3 changed the title ~~Fast histogram algorithm exhibits long start-up time and huge memory usage~~ Fast histogram algorithm exhibits long start-up time and high memory usage Jul 4, 2018

hcho3 added status: help wanted priority: medium labels Jul 4, 2018

hcho3 mentioned this issue Nov 6, 2018

Add unittests for updaters. #3836

Merged

trivialfis mentioned this issue May 18, 2019

[RFC] Possible DMatrix refactor #4354

Closed

This was referenced May 25, 2019

Small optimization for block hist matrix. #4501

Closed

CPU optimizations #4433

Closed

trivialfis mentioned this issue Jun 30, 2019

Fix CPU hist init for sparse dataset. #4625

Merged

hcho3 closed this as completed in #4625 Jul 4, 2019

hcho3 removed priority: medium status: help wanted labels Jul 4, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast histogram algorithm exhibits long start-up time and high memory usage #2326

Fast histogram algorithm exhibits long start-up time and high memory usage #2326

Laurae2 commented May 20, 2017 •

edited

Loading

hcho3 commented Jul 5, 2017

Laurae2 commented Jul 8, 2017

hcho3 commented Jul 9, 2017 •

edited

Loading

Laurae2 commented Jul 9, 2017

hcho3 commented Jul 10, 2017 •

edited

Loading

RAMitchell commented Jul 10, 2017

hcho3 commented Jul 10, 2017

pseudotensor commented Jul 14, 2017 •

edited

Loading

hcho3 commented Jul 14, 2017

hcho3 commented Jul 24, 2017 •

edited

Loading

hcho3 commented Jul 24, 2017

RAMitchell commented Jul 25, 2017

tqchen commented Jul 25, 2017

RAMitchell commented Jul 26, 2017

tqchen commented Jul 26, 2017

RAMitchell commented Jul 26, 2017

tqchen commented Jul 28, 2017

RAMitchell commented Jul 28, 2017

Laurae2 commented Jul 4, 2018

hcho3 commented Jul 4, 2018

trivialfis commented Jan 6, 2019

Laurae2 commented Jan 13, 2019

trivialfis commented Jan 22, 2019

trivialfis commented Jun 27, 2019

Fast histogram algorithm exhibits long start-up time and high memory usage #2326

Fast histogram algorithm exhibits long start-up time and high memory usage #2326

Comments

Laurae2 commented May 20, 2017 • edited Loading

Environment info

Computer specs:

Steps to reproduce

What have you tried?

hcho3 commented Jul 5, 2017

Laurae2 commented Jul 8, 2017

hcho3 commented Jul 9, 2017 • edited Loading

Laurae2 commented Jul 9, 2017

hcho3 commented Jul 10, 2017 • edited Loading

RAMitchell commented Jul 10, 2017

hcho3 commented Jul 10, 2017

pseudotensor commented Jul 14, 2017 • edited Loading

hcho3 commented Jul 14, 2017

hcho3 commented Jul 24, 2017 • edited Loading

hcho3 commented Jul 24, 2017

RAMitchell commented Jul 25, 2017

tqchen commented Jul 25, 2017

RAMitchell commented Jul 26, 2017

tqchen commented Jul 26, 2017

RAMitchell commented Jul 26, 2017

tqchen commented Jul 28, 2017

RAMitchell commented Jul 28, 2017

Laurae2 commented Jul 4, 2018

hcho3 commented Jul 4, 2018

trivialfis commented Jan 6, 2019

Laurae2 commented Jan 13, 2019

trivialfis commented Jan 22, 2019

trivialfis commented Jun 27, 2019

Laurae2 commented May 20, 2017 •

edited

Loading

hcho3 commented Jul 9, 2017 •

edited

Loading

hcho3 commented Jul 10, 2017 •

edited

Loading

pseudotensor commented Jul 14, 2017 •

edited

Loading

hcho3 commented Jul 24, 2017 •

edited

Loading