-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JVM-Package][WIP] Add missing value as parameter for DMatrix. #4954
Conversation
@CodingCat I got some kind help for xgboost/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala Line 256 in 185e3f1
Could you elaborate on why Scala needs to remove missing value itself? Need some more investigations, but still don't understand why Scala needs to handle missing value itself. |
@CodingCat I think I found the error, here this dataframe is passed as multiple data iterators: Line 52 in 010b8f1
The first row is passed as a standalone iter (hence creating a DMatrix with single row), and Scala side removes the I think the primary reason of doing this is because creating xgboost/src/data/simple_csr_source.cc Line 73 in 010b8f1
|
I can't go all the way up to scala ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can help to remove missing value handling code in Scala
* @param cacheInfo Cache path information, used for external memory setting, null by default. | ||
* @throws XGBoostError native error | ||
*/ | ||
def this(dataIter: Iterator[LabeledPoint], cacheInfo: String = null) { | ||
this(new JDMatrix(dataIter.asJava, cacheInfo)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this, can we just add a new constructor?
if (iter == null) { | ||
throw new NullPointerException("iter: null"); | ||
} | ||
// 32k as batch size | ||
int batchSize = 32 << 10; | ||
Iterator<DataBatch> batchIter = new DataBatch.BatchIterator(iter, batchSize); | ||
long[] out = new long[1]; | ||
XGBoostJNI.checkCall(XGBoostJNI.XGDMatrixCreateFromDataIter(batchIter, cacheInfo, out)); | ||
XGBoostJNI.checkCall(XGBoostJNI.XGDMatrixCreateFromDataIterEx( | ||
batchIter, missing, cacheInfo, out)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the original one and add new constructor
@@ -48,6 +48,7 @@ class MissingValueHandlingSuite extends FunSuite with PerTest { | |||
test("handle Float.NaN as missing value correctly") { | |||
val spark = ss | |||
import spark.implicits._ | |||
println("handle Float.NaN as missing value correctly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and remove printlns
@CodingCat So glad that you are here .. |
How do you want me to proceed? Should I simply add the missing value handling in c++ in this PR and remove other changes? There's a check in this PR that will fail the current missing value handling test in scala. |
Continue #4594 .