Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow evaluation to consist of multiple steps. #46653

Merged
merged 1 commit into from
Sep 27, 2019

Conversation

przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Sep 12, 2019

This PR adds a possibility for an evaluation to consist of more than one search step.
This is needed when the results of one aggregation are input to another aggregation and pipeline aggregations cannot be used.
TypedChainExecutor is used to execute a (dynamically built) sequence of steps.

Relates #46735

@przemekwitek przemekwitek force-pushed the classification_evaluation branch 6 times, most recently from 5915583 to 72e6565 Compare September 16, 2019 09:01
@przemekwitek przemekwitek added :ml Machine learning and removed WIP labels Sep 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@przemekwitek przemekwitek marked this pull request as ready for review September 16, 2019 09:05
@dimitris-athanasiou
Copy link
Contributor

This should certainly live under a new evaluation type. We need to determine a suitable name for it.

@przemekwitek przemekwitek force-pushed the classification_evaluation branch 2 times, most recently from 50177fd to c269036 Compare September 16, 2019 09:54
@przemekwitek
Copy link
Contributor Author

This should certainly live under a new evaluation type. We need to determine a suitable name for it.

Ok, we can discuss the final naming offline.
I'll start with "hard_classification" and will introduce the new type in this PR.
Codewise, it will reuse most of the code for "regression" (the difference will be that "hard_classification" will require categorical actual and predicted fields and the default metric will be multiclass confusion matrix).

@dimitris-athanasiou
Copy link
Contributor

Let's just start with classification. I think we'll have some metrics requiring a probability and some that don't and I can't imagine how we're helping our users by forcing them to do 2 different API calls to gather them all. I think there's a chance we'll find that a classification evaluation can also do what our binary_soft_classification is now doing and that way replace it.

@przemekwitek przemekwitek force-pushed the classification_evaluation branch 4 times, most recently from f47e3c9 to ae64b7f Compare September 16, 2019 11:26
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add 1-2 yaml tests for coverage.

new TermsValuesSourceBuilder(ACTUAL_CLASS_FIELD).field(actualField),
new TermsValuesSourceBuilder(PREDICTED_CLASS_FIELD).field(predictedField)
)
).size(MAX_NUM_CLASSES * MAX_NUM_CLASSES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting hard limit to have.

@tveasey when it comes to classification, do you think we should support > 100 classes?

@przemekwitek if we do need to support > 100 classes, I think chaining together callbacks to scroll through the composite aggregation would be necessary. It is not overly complicated, but may cause some frustrating refactoring in the search execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are certainly cases where people would have more than 100 classes, but I think they'll be rare. We could consider this as an enhancement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's stick to the limit of 100 classes for now.
Increasing that limit may cause code refactoring but should be invisible from user's perspective.

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/2

Copy link
Contributor Author

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add 1-2 yaml tests for coverage.

Done

new TermsValuesSourceBuilder(ACTUAL_CLASS_FIELD).field(actualField),
new TermsValuesSourceBuilder(PREDICTED_CLASS_FIELD).field(predictedField)
)
).size(MAX_NUM_CLASSES * MAX_NUM_CLASSES)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's stick to the limit of 100 classes for now.
Increasing that limit may cause code refactoring but should be invisible from user's perspective.

@dimitris-athanasiou
Copy link
Contributor

I haven't had the chance to look through this as closely as I'd like. Could you please hold off merging it?

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a couple of points to consider.

@dimitris-athanasiou
Copy link
Contributor

Also, could you please add documentation for this?

@przemekwitek przemekwitek force-pushed the classification_evaluation branch from 486b5ec to fc56c98 Compare September 18, 2019 11:49
Map<String, Long> subCounts = new TreeMap<>();
counts.put(actualClass, subCounts);
Terms subAgg = bucket.getAggregations().get(AGGREGATE_BY_PREDICTED_CLASS);
maxSumOfOtherDocCounts = Math.max(maxSumOfOtherDocCounts, subAgg.getSumOfOtherDocCounts());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused about this part and how it'd work. Here are my thoughts.

The set of actual classes may differ from that of the predicted classes. We're working with the first 1000 actual classes. For each of them, we're working with the first 1000 predicted classes for a given class.

I think it's fine in terms of the result matrix. It won't be a symmetric matrix, but I don't think it matters as we can still answer the question "how many times was class X classified as Y?".

However, when it comes to reporting the number of unhandled classes, I think what we do now may be confusing. There are 2 different counts at play. First, the count of unhandled actual classes which we get from the outer aggregation. Second, the count of unhandled predicted classes for each actual class we handle. I am not sure how helpful the max of all those is. Let's think a bit about this and discuss a solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general idea...

I think the natural way to implement this would be as follows:

  1. Order the classes by frequency (unless there is some extrinsic notion of importance, i.e. user defined list),
  2. Limit to 100 classes subject to the order defined in 1,
  3. Introduce a new class "other" which is every class not selected in 2,
  4. Report errors statistics for "actual is selected class prediction is other" and "actual is other prediction is selected type"

I'd probably omit the other vs other diagonal entry. Filling this in implies the classification is correct, where as of course we can't determine that by examining the actual classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would need 2 searches: the first to figure out the most frequent actual classes and the second to get the predicted classes after filtering out classes not in the above set.

We'll need to stretch the framework a bit to allow multiple searches but it might be good to do anyhow for paving the road for auc_roc, etc.

Copy link
Contributor Author

@przemekwitek przemekwitek Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done. Metric evaluation can consist of many steps. Evaluation process gathers the results. PTAL

I'm also exploring if using TypedChainTaskExecutor would make sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:
I used TypedChainTaskExecutor to simplify the task chaining code.

@dimitris-athanasiou
Copy link
Contributor

I have discussed offline with @przemekwitek to do the following changes:

  • separate metric processing into 3 steps: 1. build search (aggs), 2. extract data from search response, 3. evaluate result
  • eventually split the multi-search refactoring into a separate PR

@przemekwitek przemekwitek force-pushed the classification_evaluation branch 2 times, most recently from 4ee034b to af09c1d Compare September 25, 2019 15:54
@przemekwitek przemekwitek changed the title Implement evaluation API for multiclass classification problem Allow evaluation to consist of multiple steps. Sep 25, 2019
@przemekwitek przemekwitek force-pushed the classification_evaluation branch 4 times, most recently from 75a1955 to 1162769 Compare September 26, 2019 07:27
@przemekwitek
Copy link
Contributor Author

przemekwitek commented Sep 26, 2019

I have discussed offline with @przemekwitek to do the following changes:

  • separate metric processing into 3 steps: 1. build search (aggs), 2. extract data from search response, 3. evaluate result
  • eventually split the multi-search refactoring into a separate PR

Done.
This PR becomes the refactoring PR and, as such is ready for review.
The actual work on classification evaluation is in a separate follow-up PR: #47126

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks much better! A few minor comments.

));
EvaluationExecutor evaluationExecutor = new EvaluationExecutor(threadPool, client, request);
// Add one task only. Other tasks will be added as needed during execution.
evaluationExecutor.add(evaluationExecutor.newTask());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add the first task in the constructor of the executor and then we won't need this at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

this.evaluation = request.getEvaluation();
}

private TypedChainTaskExecutor.ChainTask<Void> newTask() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call this nextTask? In a way this is like an iterator of sorts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

result = evaluate(aggs);
}

private EvaluationMetricResult evaluate(Aggregations aggs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this method now? We could inline that in process, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

result = evaluate(aggs);
}

private EvaluationMetricResult evaluate(Aggregations aggs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/bwc

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@przemekwitek przemekwitek force-pushed the classification_evaluation branch from ec88b42 to 782443b Compare September 26, 2019 19:22
This is groundwork for introducing classification evaluation which actually needs multistep evaluation.
@przemekwitek przemekwitek force-pushed the classification_evaluation branch from 782443b to aaf8206 Compare September 27, 2019 04:39
@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/2

@przemekwitek przemekwitek merged commit 41d82f6 into elastic:master Sep 27, 2019
@przemekwitek przemekwitek deleted the classification_evaluation branch September 27, 2019 07:29
przemekwitek added a commit to przemekwitek/elasticsearch that referenced this pull request Sep 27, 2019
This is groundwork for introducing classification evaluation which actually needs multistep evaluation.
przemekwitek added a commit that referenced this pull request Sep 27, 2019

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants