-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow evaluation to consist of multiple steps. #46653
Allow evaluation to consist of multiple steps. #46653
Conversation
5915583
to
72e6565
Compare
Pinging @elastic/ml-core |
This should certainly live under a new evaluation type. We need to determine a suitable name for it. |
50177fd
to
c269036
Compare
Ok, we can discuss the final naming offline. |
Let's just start with |
f47e3c9
to
ae64b7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add 1-2 yaml
tests for coverage.
...java/org/elasticsearch/xpack/core/ml/dataframe/evaluation/classification/Classification.java
Outdated
Show resolved
Hide resolved
new TermsValuesSourceBuilder(ACTUAL_CLASS_FIELD).field(actualField), | ||
new TermsValuesSourceBuilder(PREDICTED_CLASS_FIELD).field(predictedField) | ||
) | ||
).size(MAX_NUM_CLASSES * MAX_NUM_CLASSES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting hard limit to have.
@tveasey when it comes to classification, do you think we should support > 100 classes?
@przemekwitek if we do need to support > 100 classes, I think chaining together callbacks to scroll through the composite aggregation would be necessary. It is not overly complicated, but may cause some frustrating refactoring in the search execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are certainly cases where people would have more than 100 classes, but I think they'll be rare. We could consider this as an enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's stick to the limit of 100 classes for now.
Increasing that limit may cause code refactoring but should be invisible from user's perspective.
run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add 1-2
yaml
tests for coverage.
Done
...java/org/elasticsearch/xpack/core/ml/dataframe/evaluation/classification/Classification.java
Outdated
Show resolved
Hide resolved
new TermsValuesSourceBuilder(ACTUAL_CLASS_FIELD).field(actualField), | ||
new TermsValuesSourceBuilder(PREDICTED_CLASS_FIELD).field(predictedField) | ||
) | ||
).size(MAX_NUM_CLASSES * MAX_NUM_CLASSES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's stick to the limit of 100 classes for now.
Increasing that limit may cause code refactoring but should be invisible from user's perspective.
I haven't had the chance to look through this as closely as I'd like. Could you please hold off merging it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left a couple of points to consider.
...asticsearch/xpack/core/ml/dataframe/evaluation/classification/MulticlassConfusionMatrix.java
Outdated
Show resolved
Hide resolved
...asticsearch/xpack/core/ml/dataframe/evaluation/classification/MulticlassConfusionMatrix.java
Outdated
Show resolved
Hide resolved
Also, could you please add documentation for this? |
486b5ec
to
fc56c98
Compare
Map<String, Long> subCounts = new TreeMap<>(); | ||
counts.put(actualClass, subCounts); | ||
Terms subAgg = bucket.getAggregations().get(AGGREGATE_BY_PREDICTED_CLASS); | ||
maxSumOfOtherDocCounts = Math.max(maxSumOfOtherDocCounts, subAgg.getSumOfOtherDocCounts()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused about this part and how it'd work. Here are my thoughts.
The set of actual classes may differ from that of the predicted classes. We're working with the first 1000 actual classes. For each of them, we're working with the first 1000 predicted classes for a given class.
I think it's fine in terms of the result matrix. It won't be a symmetric matrix, but I don't think it matters as we can still answer the question "how many times was class X classified as Y?".
However, when it comes to reporting the number of unhandled classes, I think what we do now may be confusing. There are 2 different counts at play. First, the count of unhandled actual classes which we get from the outer aggregation. Second, the count of unhandled predicted classes for each actual class we handle. I am not sure how helpful the max of all those is. Let's think a bit about this and discuss a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general idea...
I think the natural way to implement this would be as follows:
- Order the classes by frequency (unless there is some extrinsic notion of importance, i.e. user defined list),
- Limit to 100 classes subject to the order defined in 1,
- Introduce a new class "other" which is every class not selected in 2,
- Report errors statistics for "actual is selected class prediction is other" and "actual is other prediction is selected type"
I'd probably omit the other vs other diagonal entry. Filling this in implies the classification is correct, where as of course we can't determine that by examining the actual classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would need 2 searches: the first to figure out the most frequent actual classes and the second to get the predicted classes after filtering out classes not in the above set.
We'll need to stretch the framework a bit to allow multiple searches but it might be good to do anyhow for paving the road for auc_roc
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done. Metric evaluation can consist of many steps. Evaluation process gathers the results. PTAL
I'm also exploring if using TypedChainTaskExecutor
would make sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
I used TypedChainTaskExecutor
to simplify the task chaining code.
I have discussed offline with @przemekwitek to do the following changes:
|
4ee034b
to
af09c1d
Compare
75a1955
to
1162769
Compare
Done. |
1162769
to
955c712
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks much better! A few minor comments.
)); | ||
EvaluationExecutor evaluationExecutor = new EvaluationExecutor(threadPool, client, request); | ||
// Add one task only. Other tasks will be added as needed during execution. | ||
evaluationExecutor.add(evaluationExecutor.newTask()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add the first task in the constructor of the executor and then we won't need this at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
this.evaluation = request.getEvaluation(); | ||
} | ||
|
||
private TypedChainTaskExecutor.ChainTask<Void> newTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we call this nextTask
? In a way this is like an iterator of sorts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
result = evaluate(aggs); | ||
} | ||
|
||
private EvaluationMetricResult evaluate(Aggregations aggs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this method now? We could inline that in process
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
result = evaluate(aggs); | ||
} | ||
|
||
private EvaluationMetricResult evaluate(Aggregations aggs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ec88b42
to
782443b
Compare
This is groundwork for introducing classification evaluation which actually needs multistep evaluation.
782443b
to
aaf8206
Compare
run elasticsearch-ci/2 |
This is groundwork for introducing classification evaluation which actually needs multistep evaluation.
This PR adds a possibility for an evaluation to consist of more than one search step.
This is needed when the results of one aggregation are input to another aggregation and pipeline aggregations cannot be used.
TypedChainExecutor is used to execute a (dynamically built) sequence of steps.
Relates #46735