Skip to content

try #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: dist_fast_histogram_per_level
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
7920edb
fix scalastyle error
CodingCat Oct 10, 2016
9374712
add back train method but mark as deprecated
CodingCat Sep 14, 2016
9ad2e94
fix scalastyle error
CodingCat Oct 10, 2016
3917c5d
add back train method but mark as deprecated
CodingCat Sep 14, 2016
0eca1d3
fix scalastyle error
CodingCat Oct 10, 2016
251b242
add back train method but mark as deprecated
CodingCat Sep 14, 2016
93b7f64
fix scalastyle error
CodingCat Oct 10, 2016
e510199
init
Nov 30, 2018
1de55d7
more changes
Dec 3, 2018
8b900e3
temp
Dec 6, 2018
4539e9d
update
Dec 10, 2018
8cf97a7
udpate rabit
Dec 10, 2018
2c5641e
change the histogram
Dec 10, 2018
e386139
update kfactor
Dec 10, 2018
fb7e77c
sync per node stats
Dec 12, 2018
1212fbf
temp
Dec 12, 2018
5be7b8b
update
Dec 19, 2018
2068a6f
final
Dec 19, 2018
40ef3fb
code clean
Dec 19, 2018
d15a359
update rabit
Dec 19, 2018
ee29b22
more cleanup
Dec 19, 2018
bd6d2f8
fix errors
Dec 19, 2018
ccfdf63
fix failed tests
Dec 19, 2018
d1593fc
enforce c++11
Dec 30, 2018
06f3be0
broadcast subsampled feature correctly
Jan 6, 2019
02b3184
revert some changes
Jan 23, 2019
edd7dde
fix lint issue
Jan 28, 2019
b7af1a5
enable monotone and interaction constraints
Jan 28, 2019
44b4e40
don't specify default for monotone and interactions
Jan 30, 2019
b7424b6
init col
Jan 10, 2019
e07c414
temp
Jan 10, 2019
f36fa28
col sampling
Jan 19, 2019
9b4b1a4
fix histmastrix init
Jan 19, 2019
4b91193
fix col sampling
Jan 20, 2019
ca26f57
remove cout
Jan 20, 2019
8a4f009
fix out of bound access
Jan 22, 2019
98c8c30
fix core dump
Jan 22, 2019
9cf4aa0
disbale test temporarily
Jan 22, 2019
0d97228
update
Jan 23, 2019
4bac204
add fid
Jan 23, 2019
6f148c4
print perf data
Jan 23, 2019
31a41bc
update
Jan 23, 2019
94f243d
temp
Jan 26, 2019
7db8242
temp
Jan 28, 2019
4e7cbce
pass all tests
Jan 28, 2019
c8c6572
bring back some tests
Jan 28, 2019
db4e7b8
recover some changes
Jan 28, 2019
f4e5811
recover column init part
Jan 31, 2019
50566bd
more recovery
Jan 31, 2019
8a30e7f
fix core dumps
Jan 31, 2019
9abded0
code clean
Jan 31, 2019
b0c5a22
revert some changes
Feb 5, 2019
0c8a96a
fix test compilation issue
Feb 5, 2019
be17c1f
fix lint issue
Feb 5, 2019
fb36fd4
resolve compilation issue
Feb 7, 2019
198b841
fix issues of lint caused by rebase
Feb 7, 2019
df10dcb
try
Feb 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion jvm-packages/dev/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@ rm /usr/bin/python
ln -s /opt/rh/python27/root/usr/bin/python /usr/bin/python

# build xgboost
cd /xgboost/jvm-packages;mvn package
cd /xgboost/jvm-packages;ulimit -c unlimited;mvn package

Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
/*
Copyright (c) 2014 by Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package ml.dmlc.xgboost4j.scala.spark

import scala.collection.mutable.ListBuffer

import ml.dmlc.xgboost4j.java.XGBoostError
import ml.dmlc.xgboost4j.scala.{DMatrix, ObjectiveTrait}
import org.apache.commons.logging.{Log, LogFactory}

/**
* loglikelihoode loss obj function
*/
class LogRegObj(x: Int) extends ObjectiveTrait {
private val logger: Log = LogFactory.getLog(classOf[LogRegObj])
/**
* user define objective function, return gradient and second order gradient
*
* @param predicts untransformed margin predicts
* @param dtrain training data
* @return List with two float array, correspond to first order grad and second order grad
*/
override def getGradient(predicts: Array[Array[Float]], dtrain: DMatrix)
: List[Array[Float]] = {
val nrow = predicts.length
val gradients = new ListBuffer[Array[Float]]
var labels: Array[Float] = null
try {
labels = dtrain.getLabel
} catch {
case e: XGBoostError =>
logger.error(e)
null
case _: Throwable =>
null
}
val grad = new Array[Float](nrow)
val hess = new Array[Float](nrow)
val transPredicts = transform(predicts)

for (i <- 0 until nrow) {
val predict = transPredicts(i)(0)
grad(i) = predict - labels(i)
hess(i) = predict * (1 - predict)
}
gradients += grad
gradients += hess
gradients.toList
}

/**
* simple sigmoid func
*
* @param input
* @return Note: this func is not concern about numerical stability, only used as example
*/
def sigmoid(input: Float): Float = {
(1 / (1 + Math.exp(-input))).toFloat
}

def transform(predicts: Array[Array[Float]]): Array[Array[Float]] = {
val nrow = predicts.length
val transPredicts = Array.fill[Float](nrow, 1)(0)
for (i <- 0 until nrow) {
transPredicts(i)(0) = sigmoid(predicts(i)(0))
}
transPredicts
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,22 @@ package ml.dmlc.xgboost4j.scala.spark
import java.io.{File, FileNotFoundException}
import java.util.Arrays

import ml.dmlc.xgboost4j.scala.DMatrix
import scala.collection.mutable.ListBuffer

import ml.dmlc.xgboost4j.scala.{DMatrix, ObjectiveTrait}
import scala.util.Random

import ml.dmlc.xgboost4j.java.XGBoostError
import ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter
import org.apache.commons.logging.{Log, LogFactory}

import org.apache.spark.ml.feature._
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.network.util.JavaUtils
import org.scalatest.{BeforeAndAfterAll, FunSuite}

import org.apache.spark.SparkContext

class PersistenceSuite extends FunSuite with PerTest with BeforeAndAfterAll {

private var tempDir: File = _
Expand Down Expand Up @@ -162,5 +170,25 @@ class PersistenceSuite extends FunSuite with PerTest with BeforeAndAfterAll {
assert(xgbModel.getNumRound === xgbModel2.getNumRound)
assert(xgbModel.getRawPredictionCol === xgbModel2.getRawPredictionCol)
}

test("test persistence of XGBoostClassificationModel with customizedObj and" +
" customizedEval") {
val r = new Random(0)
// maybe move to shared context, but requires session to import implicits
val df = ss.createDataFrame(Seq.fill(100)(r.nextInt(2)).map(i => (i, i))).
toDF("feature", "label")

val assembler = new VectorAssembler()
.setInputCols(df.columns.filter(!_.contains("label")))
.setOutputCol("features")
val transformedDF = assembler.transform(df)

val paramMap = Map("eta" -> "0.1", "max_depth" -> "6", "silent" -> "1",
"objective" -> "binary:logistic", "num_round" -> "10", "num_workers" -> numWorkers)
val xgb = new XGBoostClassifier(paramMap).setCustomObj(new LogRegObj(1)).
setObjectiveType("regression")
val xgbModel = xgb.fit(transformedDF)
DefaultXGBoostParamsWriter.getMetadataToSave(xgbModel, SparkContext.getOrCreate(), None, None)
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

package ml.dmlc.xgboost4j.scala.spark

import ml.dmlc.xgboost4j.scala.{DMatrix, XGBoost => ScalaXGBoost}
import ml.dmlc.xgboost4j.scala.{DMatrix, ObjectiveTrait, XGBoost => ScalaXGBoost}

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.param.ParamMap
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,6 @@ class XGBoostGeneralSuite extends FunSuite with PerTest {
assert(eval.eval(model._booster.predict(testDM, outPutMargin = true), testDM) < 0.1)
}


test("training with Scala-implemented Rabit tracker") {
val eval = new EvalError()
val training = buildDataFrame(Classification.train)
Expand All @@ -111,7 +110,31 @@ class XGBoostGeneralSuite extends FunSuite with PerTest {
assert(eval.eval(model._booster.predict(testDM, outPutMargin = true), testDM) < 0.1)
}

test("test with fast histo with monotone_constraints") {
test("test with fast histo with monotone_constraints (lossguide)") {
val eval = new EvalError()
val training = buildDataFrame(Classification.train)
val testDM = new DMatrix(Classification.test.iterator)
val paramMap = Map("eta" -> "1",
"max_depth" -> "6", "silent" -> "1",
"objective" -> "binary:logistic", "tree_method" -> "hist", "grow_policy" -> "lossguide",
"num_round" -> 5, "num_workers" -> numWorkers, "monotone_constraints" -> "(1, 0)")
val model = new XGBoostClassifier(paramMap).fit(training)
assert(eval.eval(model._booster.predict(testDM, outPutMargin = true), testDM) < 0.1)
}

test("test with fast histo with interaction_constraints (lossguide)") {
val eval = new EvalError()
val training = buildDataFrame(Classification.train)
val testDM = new DMatrix(Classification.test.iterator)
val paramMap = Map("eta" -> "1",
"max_depth" -> "6", "silent" -> "1",
"objective" -> "binary:logistic", "tree_method" -> "hist", "grow_policy" -> "lossguide",
"num_round" -> 5, "num_workers" -> numWorkers, "interaction_constraints" -> "[[1,2],[2,3,4]]")
val model = new XGBoostClassifier(paramMap).fit(training)
assert(eval.eval(model._booster.predict(testDM, outPutMargin = true), testDM) < 0.1)
}

test("test with fast histo with monotone_constraints (depthwise)") {
val eval = new EvalError()
val training = buildDataFrame(Classification.train)
val testDM = new DMatrix(Classification.test.iterator)
Expand All @@ -123,7 +146,7 @@ class XGBoostGeneralSuite extends FunSuite with PerTest {
assert(eval.eval(model._booster.predict(testDM, outPutMargin = true), testDM) < 0.1)
}

test("test with fast histo with interaction_constraints") {
test("test with fast histo with interaction_constraints (depthwise)") {
val eval = new EvalError()
val training = buildDataFrame(Classification.train)
val testDM = new DMatrix(Classification.test.iterator)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ private Booster trainBooster(DMatrix trainMat, DMatrix testMat) throws XGBoostEr
//train a boost model
return XGBoost.train(trainMat, paramMap, round, watches, null, null);
}

@Test
public void testBoosterBasic() throws XGBoostError, IOException {

Expand Down Expand Up @@ -536,15 +536,12 @@ public void testGetFeatureImportanceTotalCover() throws XGBoostError {
@Test
public void testFastHistoDepthwiseMaxDepth() throws XGBoostError {
DMatrix trainMat = new DMatrix("../../demo/data/agaricus.txt.train");
DMatrix testMat = new DMatrix("../../demo/data/agaricus.txt.test");
// testBoosterWithFastHistogram(trainMat, testMat);
Map<String, Object> paramMap = new HashMap<String, Object>() {
{
put("max_depth", 3);
put("silent", 1);
put("objective", "binary:logistic");
put("tree_method", "hist");
put("max_depth", 2);
put("grow_policy", "depthwise");
put("eval_metric", "auc");
}
Expand Down
2 changes: 0 additions & 2 deletions src/common/column_matrix.h
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,6 @@ class ColumnMatrix {
double sparse_threshold) {
const auto nfeature = static_cast<bst_uint>(gmat.cut.row_ptr.size() - 1);
const size_t nrow = gmat.row_ptr.size() - 1;

// identify type of each column
feature_counts_.resize(nfeature);
type_.resize(nfeature);
Expand Down Expand Up @@ -131,7 +130,6 @@ class ColumnMatrix {
// max() indicates missing values
}
}

// loop over all rows and fill column entries
// num_nonzeros[fid] = how many nonzeros have this feature accumulated so far?
std::vector<size_t> num_nonzeros;
Expand Down
13 changes: 3 additions & 10 deletions src/common/hist_util.cc
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
/*!
* Copyright 2017-2018 by Contributors
* Copyright 2019 by Contributors
* \file hist_util.h
* \brief Utilities to store histograms
* \author Philip Cho, Tianqi Chen
*/
#include <rabit/rabit.h>
#include <dmlc/omp.h>
Expand Down Expand Up @@ -57,15 +55,13 @@ void HistCutMatrix::Init(DMatrix* p_fmat, uint32_t max_num_bins) {
SparsePage::Inst inst = batch[i];
for (auto& ins : inst) {
if (ins.index >= begin && ins.index < end) {
sketchs[ins.index].Push(ins.fvalue,
weights.size() > 0 ? weights[ridx] : 1.0f);
sketchs[ins.index].Push(ins.fvalue, weights.size() > 0 ? weights[ridx] : 1.0f);
}
}
}
}
}
}

Init(&sketchs, max_num_bins);
}

Expand Down Expand Up @@ -135,7 +131,6 @@ uint32_t HistCutMatrix::GetBinIdx(const Entry& e) {

void GHistIndexMatrix::Init(DMatrix* p_fmat, int max_num_bins) {
cut.Init(p_fmat, max_num_bins);

const int nthread = omp_get_max_threads();
const uint32_t nbins = cut.row_ptr.back();
hit_count.resize(nbins, 0);
Expand All @@ -148,7 +143,6 @@ void GHistIndexMatrix::Init(DMatrix* p_fmat, int max_num_bins) {
row_ptr.push_back(batch[i].size() + row_ptr.back());
}
index.resize(row_ptr.back());

CHECK_GT(cut.cut.size(), 0U);
CHECK_EQ(cut.row_ptr.back(), cut.cut.size());

Expand All @@ -159,11 +153,10 @@ void GHistIndexMatrix::Init(DMatrix* p_fmat, int max_num_bins) {
size_t ibegin = row_ptr[rbegin + i];
size_t iend = row_ptr[rbegin + i + 1];
SparsePage::Inst inst = batch[i];

CHECK_EQ(ibegin + inst.size(), iend);

for (bst_uint j = 0; j < inst.size(); ++j) {
uint32_t idx = cut.GetBinIdx(inst[j]);

index[ibegin + j] = idx;
++hit_count_tloc_[tid * nbins + idx];
}
Expand Down
2 changes: 1 addition & 1 deletion src/common/hist_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ struct HistCutMatrix {
std::vector<bst_float> min_val;
/*! \brief the cut field */
std::vector<bst_float> cut;
uint32_t GetBinIdx(const Entry &e);
uint32_t GetBinIdx(const Entry& e);

using WXQSketch = common::WXQuantileSketch<bst_float, bst_float>;

Expand Down
2 changes: 0 additions & 2 deletions src/common/random.h
Original file line number Diff line number Diff line change
Expand Up @@ -103,10 +103,8 @@ class ColumnSampler {
std::shuffle(new_features.begin(), new_features.end(), common::GlobalRandom());
new_features.resize(n);
std::sort(new_features.begin(), new_features.end());

// ensure that new_features are the same across ranks
rabit::Broadcast(&new_features, 0);

return p_new_features;
}

Expand Down
1 change: 0 additions & 1 deletion src/tree/split_evaluator.cc
Original file line number Diff line number Diff line change
Expand Up @@ -412,7 +412,6 @@ class InteractionConstraint final : public SplitEvaluator {
if (!CheckInteractionConstraint(featureid, nodeid)) {
return -std::numeric_limits<bst_float>::infinity();
}

// Otherwise, get score from inner evaluator
bst_float score = inner_->ComputeSplitScore(
nodeid, featureid, left_stats, right_stats, left_weight, right_weight);
Expand Down
3 changes: 1 addition & 2 deletions src/tree/updater_histmaker.cc
Original file line number Diff line number Diff line change
Expand Up @@ -364,8 +364,7 @@ class CQHistMaker: public HistMaker {
&thread_stats_, &node_stats_);
for (int const nid : this->qexpand_) {
const int wid = this->node2workindex_[nid];
this->wspace_.hset[0][fset.size() + wid * (fset.size() + 1)]
.data[0] = node_stats_[nid];
this->wspace_.hset[0][fset.size() + wid * (fset.size() + 1)].data[0] = node_stats_[nid];
}
};
// sync the histogram
Expand Down
Loading