[jvm-packages] group data is only set for training set and is set incorrectly #3097

gaofan0905 · 2018-02-06T19:13:28Z

When creating a watch, input data is split into trainMatrix and testMatrix randomly. But the input groupData is set only to trainMatrix. And the groupData param is for the original data set, it does fit for the split trainMatrix any more.

https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala#L520

superbobry · 2018-02-06T23:39:25Z

Yes, this is a known limitation of the current group data support. The "right" way to fix this is to make the group explicitly available for each row in the input data frame, e.g. via #2749.

gaofan0905 · 2018-02-07T01:36:05Z

seems #2749 is merged, just wondering is there any example on how to set the group id in dataframe and pass it to XGBoost?

superbobry · 2018-02-08T15:34:07Z

No, it has not been merged yet. Exposing this in the JVM wrapper would require a little bit of work as well.

gaofan0905 · 2018-02-08T22:21:05Z

I made some local changes like following, would that work?

And also when looking at the results, I found for some records, the prediction result is different for the same input. Is it the nature of distributed training? That only a random subset of all trees will be used?

`
private object Watches {
def apply(
params: Map[String, Any],
labeledPoints: Iterator[XGBLabeledPoint],
baseMarginsOpt: Option[Array[Float]],
cacheDirName: Option[String]): Watches = {
val trainTestRatio = params.get("trainTestRatio").map(_.toString.toDouble).getOrElse(1.0)
if (params.contains("groupData") && params("groupData") != null) {
val groups = params("groupData").asInstanceOfSeq[Seq[Int]].toArray
val total = groups.sum
var cnt = 0
var index = 0
breakable {
for (i <- groups.indices) {
index = i
cnt += groups(i)
if (1.0 * cnt / total >= trainTestRatio) break
}
}

  val (trainGroups, testGroups) = groups.splitAt(index + 1)

  val trainPoints = mutable.ArrayBuffer.empty[XGBLabeledPoint]
  val testPoints = mutable.ArrayBuffer.empty[XGBLabeledPoint]

  while (labeledPoints.hasNext) {
    val p = labeledPoints.next()
    if (trainPoints.size < cnt) {
      trainPoints += p
    } else {
      testPoints += p
    }
  }

  val trainMatrix = new DMatrix(trainPoints.iterator, cacheDirName.map(_ + "/train").orNull)
  val testMatrix = new DMatrix(testPoints.iterator, cacheDirName.map(_ + "/test").orNull)

  for (baseMargins <- baseMarginsOpt) {
    val (trainMargin, testMargin) = baseMargins.splitAt(cnt)
    trainMatrix.setBaseMargin(trainMargin)
    testMatrix.setBaseMargin(testMargin)
  }

  trainMatrix.setGroup(trainGroups)
  testMatrix.setGroup(testGroups)
  new Watches(trainMatrix, testMatrix, cacheDirName)

} else {
  val seed = params.get("seed").map(_.toString.toLong).getOrElse(System.nanoTime())
  val r = new Random(seed)
  val testPoints = mutable.ArrayBuffer.empty[XGBLabeledPoint]
  val trainPoints = labeledPoints.filter { labeledPoint =>
    val accepted = r.nextDouble() <= trainTestRatio
    if (!accepted) {
      testPoints += labeledPoint
    }

    accepted
  }
  val trainMatrix = new DMatrix(trainPoints, cacheDirName.map(_ + "/train").orNull)
  val testMatrix = new DMatrix(testPoints.iterator, cacheDirName.map(_ + "/test").orNull)
  r.setSeed(seed)
  for (baseMargins <- baseMarginsOpt) {
    val (trainMargin, testMargin) = baseMargins.partition(_ => r.nextDouble() <= trainTestRatio)
    trainMatrix.setBaseMargin(trainMargin)
    testMatrix.setBaseMargin(testMargin)
  }

  new Watches(trainMatrix, testMatrix, cacheDirName)
}

}
`

hcho3 · 2018-07-04T23:05:33Z

#2749 has been merged now. I think additional work would be needed to enable grouping in test set.

All feature requests are now consolidated to #3439. This issue should be re-opened if someone decides to actively work on implementing this feature.

CodingCat · 2018-07-07T18:12:50Z

in the master branch of xgboost, we have allowed the user to have per-instance group info (like qid), check #3369

hcho3 closed this as completed Jul 4, 2018

hcho3 mentioned this issue Jul 4, 2018

Roadmap: feature requests #3439

Open

32 tasks

lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] group data is only set for training set and is set incorrectly #3097

[jvm-packages] group data is only set for training set and is set incorrectly #3097

gaofan0905 commented Feb 6, 2018

superbobry commented Feb 6, 2018

gaofan0905 commented Feb 7, 2018

superbobry commented Feb 8, 2018

gaofan0905 commented Feb 8, 2018

hcho3 commented Jul 4, 2018 •

edited

Loading

CodingCat commented Jul 7, 2018

[jvm-packages] group data is only set for training set and is set incorrectly #3097

[jvm-packages] group data is only set for training set and is set incorrectly #3097

Comments

gaofan0905 commented Feb 6, 2018

superbobry commented Feb 6, 2018

gaofan0905 commented Feb 7, 2018

superbobry commented Feb 8, 2018

gaofan0905 commented Feb 8, 2018

hcho3 commented Jul 4, 2018 • edited Loading

CodingCat commented Jul 7, 2018

hcho3 commented Jul 4, 2018 •

edited

Loading