Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new analysis type: classification #46537

Merged
merged 4 commits into from
Oct 4, 2019

Conversation

przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Sep 10, 2019

Implement new analysis type: Classification.
Also, extract the common parameters between Classification and Regression to a separate class: BoostedTreeParams.

This PR is not fully functional until changes on C++ are made (WIP).
However, I've sent it to review to gather feedback about the Java part.

Relates #46735

@przemekwitek przemekwitek force-pushed the classification branch 2 times, most recently from 0193e12 to 6f92943 Compare September 17, 2019 05:58
@przemekwitek przemekwitek force-pushed the classification branch 3 times, most recently from 0f6518c to 1429840 Compare September 20, 2019 08:20
@przemekwitek przemekwitek force-pushed the classification branch 4 times, most recently from ede5e3d to 07ffd52 Compare September 26, 2019 13:38
@przemekwitek przemekwitek removed the WIP label Sep 26, 2019
@przemekwitek przemekwitek marked this pull request as ready for review September 26, 2019 13:42
@przemekwitek przemekwitek force-pushed the classification branch 3 times, most recently from e40d5a3 to 1d71028 Compare September 26, 2019 14:01
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/bwc
run elasticsearch-ci/default-distro

if (analysis instanceof Classification) {
Classification classification = (Classification) analysis;
return new DatasetSplittingCustomProcessor(
fieldNames, classification.getDependentVariable(), classification.getTrainingPercent());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost seems like we need a new interface for the different analysis.

Unsupervised vs supervised... But that might be a future refactoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost seems like we need a new interface for the different analysis.

Yes, we may end up doing that.

But that might be a future refactoring

Agree, let's not add more interfaces too early.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to have @dimitris-athanasiou give it a once over :). I don't see any major problems

@przemekwitek przemekwitek force-pushed the classification branch 5 times, most recently from dde7a9f to 4a2583f Compare October 2, 2019 09:47
@przemekwitek przemekwitek force-pushed the classification branch 2 times, most recently from c35932e to ba8bd1d Compare October 2, 2019 10:09
@dimitris-athanasiou dimitris-athanasiou self-requested a review October 3, 2019 08:01
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few minor comments.

(Integer) a[7],
(Double) a[8]));
parser.declareString(constructorArg(), DEPENDENT_VARIABLE);
BoostedTreeParams.declareFields(parser);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever trick for reusing code.

However, this made me wonder whether those params should be in a nested object. It'd be ugly though, wouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of taste ;)
Parsing code would actually become a bit cleaner as I could just declare the BoostedTreeParams field here and it would have its own parser.

However, with nested object:

  1. we need to double-check which parameters we want to move there. I just picked the obvious ones but maybe e.g. dependentVariable should live there as well?
  2. we need to add BWC handling

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree we can leave it as is.

}

public Classification(String dependentVariable) {
this(dependentVariable, new BoostedTreeParams(null, null, null, null, null), null, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a default constructor for BoostedTreeParams to avoid those nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

this.dependentVariable = ExceptionsHelper.requireNonNull(dependentVariable, DEPENDENT_VARIABLE);
this.boostedTreeParams = ExceptionsHelper.requireNonNull(boostedTreeParams, BoostedTreeParams.NAME);
this.predictionFieldName = predictionFieldName;
this.numTopClasses = numTopClasses;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does num_top_classes have a fixed default value? If so we should set it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

I think the default value should be "0".

eta = in.readOptionalDouble();
maximumNumberTrees = in.readOptionalVInt();
featureBagFraction = in.readOptionalDouble();
boostedTreeParams = new BoostedTreeParams(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add BWC handling here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code (as it is written right now) is backward-compatible as the sequence of StreamInput reads in the old version is the same as in the new version (the new version has the reads wrapped in the new BoostedTreeParams(in) constructor.
It would change, however, if I introduced a nested object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true.

out.writeOptionalDouble(eta);
out.writeOptionalVInt(maximumNumberTrees);
out.writeOptionalDouble(featureBagFraction);
boostedTreeParams.writeTo(out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BWC handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/bwc

@przemekwitek przemekwitek force-pushed the classification branch 2 times, most recently from 374f275 to 18ee05b Compare October 4, 2019 07:26
@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/bwc

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/default-distro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants