Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast histogram algorithm exhibits long start-up time and high memory usage #2326

Closed
Laurae2 opened this issue May 20, 2017 · 24 comments · Fixed by #4625
Closed

Fast histogram algorithm exhibits long start-up time and high memory usage #2326

Laurae2 opened this issue May 20, 2017 · 24 comments · Fixed by #4625

Comments

@Laurae2
Copy link
Contributor

Laurae2 commented May 20, 2017

Two issues in one:

  • Fast histogram xgboost does not start training after a while (stuck in building matrix?)
  • Fast histogram xgboost uses a massive amount of memory (reaching over 50GB RAM when exact xgboost takes only a peak of about 10GB RAM)

Before running, to know:

  • Will not work using Microsoft R versions (incompatible due to Microsoft R spamming messages about sending data collection to Microsoft while compiling packages)
  • Dataset download is 240MB, uncompressed is 2GB, sum of RDS files are 650MB
  • Make sure to have about 25GB RAM free to fast histogram (25GB extra seem committed in memory but never used for unknown reasons)
  • Creating the dataset (3GB, 7,744,201,107,060 elements) requires using my svmlight package in R
  • My scripts can run copy&paste just by changing the folders appropriately in setwd
  • I expected fast histogram to be slower than exact xgboost due to sparsity, but not by that much

Created files info:

  File: url_svmlight.tar.gz
CRC-32: c152f632
   MD4: b05d0a58ad6f53f9ad20a5f350c25df2
   MD5: 2eb74f59d06807a3845f90ed36fe8ded
 SHA-1: 605121dde12ce7baf098e30cd8856ef4ed6d5e69
  File: reput_sparse.rds
CRC-32: 846b1907
   MD4: 2ea36b208118b7d46d18ccb5fad8da98
   MD5: ce371f418463ebc799367ed41f5fd654
 SHA-1: 7f2a73609d72ff61a24a6d176c44752b68954ea6

Environment info

Operating System: Windows Server 2012 R2 (baremetal) - OS doesn't matter, happens also in Ubuntu 17.04

Compiler: MinGW 7.1

Package used (python/R/jvm/C++): R 3.4

xgboost version used: from source, latest commit of today (e5e7217)

Computer specs:

  • Quad E7-88xx v4 (72 cores), but 3 CPUs were disabled to get rid of NUMA
  • 1TB RAM 1866MHz RAM (256GB RAM available to get rid of NUMA)
  • Not virtualized

Used 8 threads for xgboost because it was for a benchmark comparison.

Steps to reproduce

  1. Download this dataset: http://archive.ics.uci.edu/ml/datasets/URL+Reputation (svmlight format, 121 parts, 2396130 rows, 3231961 features
  2. Load data using my svmlight parser
  3. Train a xgboost model with fast histogram method (default is 255 bins, one can start directly with 15 bins)

What have you tried?

  1. Changing xgboost to exact: starts training immediately
  2. Lower number of threads used to 4: still will not start after 30 minutes
  3. Lowering number of bins to 15: still will not start after 30 minutes
  4. Lowering number of bins and setting number of threads to 1 (using a small laptop with same environment): eventually starts training after 5 minutes

Scripts:

LIBRARY DOWNLOAD:

install.packages("devtools")
install.packages("Matrix")
install.packages("Rcpp")
install.packages("RcppEigen")
devtools::install_github("Laurae2/sparsity")
# Install xgboost separately from source

DATA LOAD:

# Libraries
library(sparsity)
library(Matrix)

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

data <- read.svmlight(paste0("Day0.svm"))
data$matrix@Dim[2] <- 3231962L
data$matrix@p[length(data$matrix@p):3231963] <- data$matrix@p[length(data$matrix@p)]
data$matrix <- data$matrix[1:(data$matrix@Dim[1] - 1), ]
label <- (data$labels[1:(data$matrix@Dim[1])] + 1) / 2
data <- data$matrix

new_data <- list()

for (i in 1:120) {
  indexed <- (i %% 10) + (10 * ((i %% 10) == 0))
  new_data[[indexed]] <- read.svmlight(paste0("Day", i, ".svm"))
  new_data[[indexed]]$matrix@Dim[2] <- 3231962L
  new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p):3231963] <- new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p)]
  new_data[[indexed]]$matrix <- new_data[[indexed]]$matrix[1:(new_data[[indexed]]$matrix@Dim[1] - 1), ]
  label <- c(label, (new_data[[indexed]]$labels[1:(new_data[[indexed]]$matrix@Dim[1])] + 1) / 2)
  
  if ((i %% 10) == 0) {
    
    data <- rbind(data, new_data[[1]]$matrix, new_data[[2]]$matrix, new_data[[3]]$matrix, new_data[[4]]$matrix, new_data[[5]]$matrix, new_data[[6]]$matrix, new_data[[7]]$matrix, new_data[[8]]$matrix, new_data[[9]]$matrix, new_data[[10]]$matrix)
    gc(verbose = FALSE)
    
    cat("Parsed element 'Day", i, ".svm'. Sparsity: ", sprintf("%05.0f", as.numeric(data@Dim[1]) * as.numeric(data@Dim[2]) / length(data@i)), ":1. Balance: ", sprintf("%04.02f", length(label) / sum(label)), ":1.\n", sep = "")
    
  }
  
}

# Save to RDS
gc()
saveRDS(data, file = "reput_sparse.rds", compress = TRUE)

# Save labels
saveRDS(label, file = "reput_label.rds", compress = TRUE)

RUN FAST HISTOGRAM, will take forever(?) to start:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "hist",
                                 grow_policy = "depthwise",
                                 eta = 0.10,
                                 max_bin = 15,
                                 eval_metric = "auc",
                                 debug_verbose = 1),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

RUN EXACT, will start immediately, and very fast. 10GB peak:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "exact",
                                 eta = 0.10,
                                 eval_metric = "auc"),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

ping @hcho3

@hcho3
Copy link
Collaborator

hcho3 commented Jul 5, 2017

@Laurae2 I've been working to fix the poor scaling issue for a while, for which I submitted a patch today. Let me take a look at this issue. Sorry for the delay.

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jul 8, 2017

@hcho3 Do you know which parameters can be used to lower RAM usage / build dataset faster using the new version (#2493) ?

It went from 30GB RAM to 120+GB RAM (I cancelled after 1 hour of building dataset when it took only 4 mins in a previous version). I am using my customized URL reputation dataset (https://github.com/Laurae2/gbt_benchmarks#reput)

hcho3 added a commit to hcho3/xgboost that referenced this issue Jul 9, 2017
It has been reported that new parallel algorithm (dmlc#2493) results in excessive
message usage (see issue dmlc#2326). Until issues are resolved, XGBoost should use
the old parallel algorithm by default. The user would have to specify
`enable_feature_grouping=1` manually to enable the new algorithm.
hcho3 added a commit to hcho3/xgboost that referenced this issue Jul 9, 2017
It has been reported that new parallel algorithm (dmlc#2493) results in excessive
message usage (see issue dmlc#2326). Until issues are resolved, XGBoost should use
the old parallel algorithm by default. The user would have to specify
`enable_feature_grouping=1` manually to enable the new algorithm.
@hcho3
Copy link
Collaborator

hcho3 commented Jul 9, 2017

Now that I see it, both old and new algorithms are suffering from similar issues (long start-up time, excessive memory usage). The change of algorithm was primarily for faster training, so the new algorithm won't help with whatever problem occurring at initialization stage. I will go ahead and try to reproduce the issue on my side. Thanks!

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jul 9, 2017

@hcho3 I am providing below my training script if you are interested to reproduce my result of 120GB:

library(Matrix)
library(xgboost)

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse_final.rds") # Requires the reput customized dataset
label <- readRDS("reput_label.rds")

train_1 <- my_data[1:2250000, ]
train_2 <- label[1:2250000]
test_1 <- my_data[2250001:2396130, ]
test_2 <- label[2250001:2396130]

train <- xgb.DMatrix(data = train_1, label = train_2)
test <- xgb.DMatrix(data = test_1, label = test_2)
rm(train_1, train_2, test_1, test_2, my_data, label)
gc()

# rm(model)
gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 #max_depth = 3,
                                 num_leaves = 127,
                                 tree_method = "hist",
                                 grow_policy = "depthwise",
                                 eta = 0.25,
                                 max_bin = 255,
                                 eval_metric = "auc",
                                 debug_verbose = 2),
                   data = train,
                   nrounds = 10,
                   watchlist = list(test = test),
                   verbose = 2,
                   early_stopping_rounds = 50)

I left it one hour and it took 130GB, and was still growing. I also tried max_search_group = 5, same issue. However, with enable_feature_grouping = 0, the training started after the same expected time (about 4 minutes).

@hcho3
Copy link
Collaborator

hcho3 commented Jul 10, 2017

@Laurae2 I have managed to re-produce the issue on my end. Both old and new algorithms suffer from excessive memory usage and setup time, but the new one makes it even worse. I will submit a patch as soon as possible.

ps. Thanks so much for writing the sparsity package. I was looking for a way to export svmlight files from sparse matrices in R, and other packages took forever (?) to complete.

@RAMitchell
Copy link
Member

Is the slowdown in the hmat/gmat initialisation? Our GPU algorithms would also massively benefit if you can speed up these functions.

tqchen pushed a commit that referenced this issue Jul 10, 2017
It has been reported that new parallel algorithm (#2493) results in excessive
message usage (see issue #2326). Until issues are resolved, XGBoost should use
the old parallel algorithm by default. The user would have to specify
`enable_feature_grouping=1` manually to enable the new algorithm.
@hcho3
Copy link
Collaborator

hcho3 commented Jul 10, 2017

@RAMitchell Yes. The gmat initialization needs to be updated for better performance and memory usage.

@pseudotensor
Copy link
Contributor

pseudotensor commented Jul 14, 2017

@hcho3 As an example, for the full airlines data set (http://stat-computing.org/dataexpo/2009/, with some cats converted to numerical, 13GB csv, 121888025 rows and 31 columns), here are some timing numbers:

Time to read data: 163.64 # bad pandas single-threaded reader, can use data.table, so not a concern.
Time to np.array: 3.44
Time to DMatrix: 32.52
Training with 'gpu_hist'
[03:07:30] Device: [0] TITAN X (Pascal) with 10283 MB available device memory.
[03:07:30] Device: [1] TITAN X (Pascal) with 10284 MB available device memory.
[03:07:30] Device: [2] TITAN X (Pascal) with 11846 MB available device memory.
[03:07:31] Device: [3] TITAN X (Pascal) with 10284 MB available device memory.
[03:07:44] Allocated 3953 MB # so sparse, but not as much as lauren's data.
Time in gmat: 16.97 seconds
Time on GPU after gmat init: 1.206 seconds # for reasonable number depth=6 doing 2 iterations. Doing 20 iterations would balance things out, but I've seen other data sets (e.g. Higgs) where depth=6 and iterations=5000 (excessive by far) would be required to balance time spent in gmat init.

So most of the time is spent in Dmatrix creation and gmat initialization, making the GPU much less effective than it could be.

But notice that DMatrix creation is also a major issue, which I'll look at making multi-threaded on the CPU (it's single-threaded right now) or do on GPU. This is common for big data problems.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 14, 2017

@pseudotensor It appears to me that quantile extraction is also presenting difficulties with the url dataset. I'll spend time on this issue in next few days and get back to you.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 24, 2017

@Laurae2 I've submitted a pull request #2543. With this option, memory consumption goes down to <15GB ~30GB for the URL dataset. Setup time is also faster, with 5 minutes or less on a 32-core machine.

You can further reduce memory consumption and improve setup time by setting use_columnar_access=0.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 24, 2017

Here are the results on a c3.8xlarge instance.

Configuration setup time peak memory usage full log
enable_feature_grouping=1, use_columnar_access=1 190 sec 30.5 GB log.txt
enable_feature_grouping=0, use_columnar_access=1 69 sec 12.4 GB log2.txt
enable_feature_grouping=0, use_columnar_access=0 40 sec 12.4 GB log3.txt

@RAMitchell
Copy link
Member

Is there any reason why we don't use a conventional weighted quantile algorithm instead of a sketch in the single machine case?

@tqchen
Copy link
Member

tqchen commented Jul 25, 2017

sketch have fewer memory consumption and lower memory complexity than sort and find algorithm.

@RAMitchell
Copy link
Member

Consider that we have not allocated the quantised matrix yet (gmat). This would mean that we can use up to this amount of memory to find the quantiles and not increase the peak memory usage of the algorithm.

Using an in-place sort and accounting for storing the weights this would be about 8 bytes per nonzero matrix element to perform a sort based quantile algorithm (correct me if I'm wrong here), compared to 4 bytes per nonzero for the quantised matrix.

This means you could calculate the quantiles for roughly half your features simultaneously and never increase the peak memory usage.

@tqchen
Copy link
Member

tqchen commented Jul 26, 2017

sure, in terms of time complexity, sketch gives smaller time cost than the sorting (that additional logn factor is actually quite a lot for big dataset). That is one reason why most people will use a sketch based algorithm for summary even when the data is in-memory

@RAMitchell
Copy link
Member

We can use radix sort though which doesn't have the logn factor and can still be performed in place.

@tqchen
Copy link
Member

tqchen commented Jul 28, 2017

@RAMitchell I get your point:) but let us keep the current method for now as it is also natural for distributed verison.

@RAMitchell
Copy link
Member

Ok no problem :) I might look at doing this on the GPU to resolve the bottleneck for us.

It also occurred to me that it would be nice to have a multi-core radix sort implementation within dmlc-core. We could speed up xgboost by replacing std::sort in places.

@tqchen tqchen closed this as completed Jul 4, 2018
@Laurae2
Copy link
Contributor Author

Laurae2 commented Jul 4, 2018

Is this issue really resolved? #2543 was not merged so this does not seem yet fixed (the issue is still reproducible).

@hcho3 hcho3 reopened this Jul 4, 2018
@hcho3 hcho3 changed the title Fast histogram: Extremely Sparse Datasets require massive amount of RAM Fast histogram algorithm exhibits long start-up time and huge memory usage Jul 4, 2018
@hcho3 hcho3 changed the title Fast histogram algorithm exhibits long start-up time and huge memory usage Fast histogram algorithm exhibits long start-up time and high memory usage Jul 4, 2018
@hcho3
Copy link
Collaborator

hcho3 commented Jul 4, 2018

@Laurae2 This issue was closed due to the current triaging. We try to aggressive close issues from now on so opened issues like this can get attended to. I went ahead and re-opened it.

@trivialfis
Copy link
Member

@hcho3 This issue is getting too old by now. Can you summarize a little bit for what's happening? From discussion, it seems there are different problem including long setup time, bloated memory usage of feature grouping, and some crashes ...

@Laurae2
Copy link
Contributor Author

Laurae2 commented Jan 13, 2019

@trivialfis This issue is about multiple problems found when using xgboost fast histogram initialization on large datasets with many (millions of) features.

The problems encountered summarized:

  • fast histogram uses significantly more RAM than exact
  • fast histogram training takes a very long time to start while exact starts immediately (histogram initialization is not expected to be in the range of many minutes/hours)
  • the older the CPU generation (Intel), the poorer the scalability of fast histogram when creating the histogram initialization (negative efficiency with a low number of threads)

Possible xgboost solutions:

Possible end-user solutions, from top to bottom in priority:

  • use way less threads (the older the CPU generation, the lower the maximum number of threads which should be used)
  • use less features
  • use exact mode instead of fast histogram
  • upgrade CPU to better CPU generation for better scalability (seems to have a major impact)
  • get higher RAM frequency for additional RAM bandwidth for better scalability
  • fill all RAM channels to get peak RAM bandwidth

@trivialfis
Copy link
Member

@Laurae2 Thanks.

@trivialfis
Copy link
Member

@hcho3 I'm working on this, hopefully can get a solution soon.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants