Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit predicted category using an appropriate JSON type. #877

Merged
merged 7 commits into from
Dec 6, 2019

Conversation

przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Dec 5, 2019

Currently, classification analysis allows dependent variable of integer or boolean type but in the results field, the prediction field is always emitted as JSON string (so true becomes "true", 1 becomes "1" etc.).

A solution to that problem is to pass desired prediction_field_type from Java to C++ and make C++ emit bool, int or string JSON field depending on the prediction_field_type passed.
This PR implements the C++ part of this solution.

Relates elastic/elasticsearch#49796

@przemekwitek przemekwitek force-pushed the prediction_field_type branch from 0825066 to 1eb3b44 Compare December 5, 2019 09:42
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd rather use standard library functions to do the string to number conversions instead of my core::CStringUtils suggestions then that's fine, but please make sure exceptions won't fail the entire analysis, all possible values a 64 bit signed integer can hold are covered on all platforms and we log when there are unexpected conversion errors.

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved
lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved
lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo adhering to naming conventions).

@przemekwitek przemekwitek removed the WIP label Dec 5, 2019
@przemekwitek przemekwitek marked this pull request as ready for review December 5, 2019 10:32
Copy link
Contributor Author

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd rather use standard library functions to do the string to number conversions instead of my core::CStringUtils suggestions then that's fine, but please make sure exceptions won't fail the entire analysis, all possible values a 64 bit signed integer can hold are covered on all platforms and we log when there are unexpected conversion errors.

You might think that problem 1 can be solved by using stol instead of stoi, but this is not the case, because on Windows a Java long is 64 bits but a C++ long is 32 bits. The C++ type that reliably corresponds to Java's long is int64_t.

Sounds like you've gone through this kind of issues before. Thanks for sharing. I'll gladly use the library method for doing conversions.

Additionally, I've renamed dependent_variable_type to prediction_field_type as that's how the field is really used in the code

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved
lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved
lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci

@droberts195
Copy link
Contributor

@przemekwitek to get the tests to pass you need to run clang-format:

A format error has been detected within the following files:
lib/api/unittest/CDataFrameTrainBoostedTreeClassifierRunnerTest.cc

@przemekwitek
Copy link
Contributor Author

A format error has been detected

Thanks for the hint. I'm wondering if it was possible to make this message stand out more in the console output. I was looking for the failure reason yesterday before you hint but couldn't find it.

@przemekwitek przemekwitek merged commit 881e5d0 into elastic:master Dec 6, 2019
@przemekwitek przemekwitek deleted the prediction_field_type branch December 6, 2019 10:05
przemekwitek added a commit to przemekwitek/ml-cpp that referenced this pull request Dec 6, 2019
@droberts195
Copy link
Contributor

I'm wondering if it was possible to make this message stand out more in the console output.

It gets printed from here:

if [ -n "${WRONG_FORMAT_FILES}" ] ; then
echo "A format error has been detected within the following files:"
printf "%s\n" "${WRONG_FORMAT_FILES[@]}"
RC=4
else

You are welcome to open a PR to change that so that it stands out more. If you want to use something like escape sequences to change colours you can use the PR build to check the escape sequences are interpreted correctly by Jenkins - make a PR that messes up the formatting in a source file and modifies check-style.sh, check the PR build log, iterate on check-style.sh, when it's working as desired correct the formatting of the chosen source file leaving just the change to check-style.sh in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants