Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JVM-Package][WIP] Add missing value as parameter for DMatrix. #4954

Closed
wants to merge 6 commits into from

Conversation

trivialfis
Copy link
Member

Continue #4594 .

@trivialfis
Copy link
Member Author

trivialfis commented Oct 16, 2019

@CodingCat I got some kind help for narrowing the test failure to this (see the aboved linked issue for a refresh):

Could you elaborate on why Scala needs to remove missing value itself?

Need some more investigations, but still don't understand why Scala needs to handle missing value itself.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 21, 2019

@CodingCat I think I found the error, here this dataframe is passed as multiple data iterators:

The first row is passed as a standalone iter (hence creating a DMatrix with single row), and Scala side removes the float.NaN before pushing it into c++, so there are only 2 columns left in there.

I think the primary reason of doing this is because creating SimpleCSRSource from parser does not handle missing value,

data_vec.emplace_back(index, fvalue);
. I can implement the missing value handling in C++. Could you help removing the scala version of missing value handling? It will make some breaking changes as missing value is added as a parameter to DMatrix. It's long established in Python, but seems not be the case for jvm package.

@trivialfis
Copy link
Member Author

I can't go all the way up to scala ...

@trivialfis trivialfis changed the title Check DMatrix. [JVM-Package] Add missing value as parameter for DMatrix. Oct 21, 2019
@trivialfis trivialfis changed the title [JVM-Package] Add missing value as parameter for DMatrix. [JVM-Package][WIP] Add missing value as parameter for DMatrix. Oct 21, 2019
Copy link
Member

@CodingCat CodingCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can help to remove missing value handling code in Scala

* @param cacheInfo Cache path information, used for external memory setting, null by default.
* @throws XGBoostError native error
*/
def this(dataIter: Iterator[LabeledPoint], cacheInfo: String = null) {
this(new JDMatrix(dataIter.asJava, cacheInfo))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, can we just add a new constructor?

if (iter == null) {
throw new NullPointerException("iter: null");
}
// 32k as batch size
int batchSize = 32 << 10;
Iterator<DataBatch> batchIter = new DataBatch.BatchIterator(iter, batchSize);
long[] out = new long[1];
XGBoostJNI.checkCall(XGBoostJNI.XGDMatrixCreateFromDataIter(batchIter, cacheInfo, out));
XGBoostJNI.checkCall(XGBoostJNI.XGDMatrixCreateFromDataIterEx(
batchIter, missing, cacheInfo, out));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the original one and add new constructor

@@ -48,6 +48,7 @@ class MissingValueHandlingSuite extends FunSuite with PerTest {
test("handle Float.NaN as missing value correctly") {
val spark = ss
import spark.implicits._
println("handle Float.NaN as missing value correctly")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and remove printlns

@trivialfis
Copy link
Member Author

@CodingCat So glad that you are here ..

@trivialfis
Copy link
Member Author

@CodingCat

I can help to remove missing value handling code in Scala

How do you want me to proceed? Should I simply add the missing value handling in c++ in this PR and remove other changes? There's a check in this PR that will fail the current missing value handling test in scala.

@trivialfis trivialfis closed this Feb 29, 2020
@trivialfis trivialfis deleted the validate-features branch February 29, 2020 11:24
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants