-
Notifications
You must be signed in to change notification settings - Fork 36
add anomaly feature attribution to model output #232
Conversation
Codecov Report
@@ Coverage Diff @@
## master #232 +/- ##
============================================
- Coverage 72.25% 72.08% -0.18%
- Complexity 1278 1289 +11
============================================
Files 139 139
Lines 6045 5993 -52
Branches 469 476 +7
============================================
- Hits 4368 4320 -48
+ Misses 1465 1461 -4
Partials 212 212
Flags with carried forward coverage won't be shown. Click here to find out more.
|
vec.renormalize(1d); | ||
double[] attribution = new double[vec.getDimensions()]; | ||
for (int i = 0; i < attribution.length; i++) { | ||
attribution[i] = vec.getHighLowSum(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions:
First, how to interpret both high and low are non-zero? Is it really high or low? Does it mean RCF trees think the value can be both higher or lower than the recently observed data trends for that column? Do we need a majority win rule to say it is actually high or low?
Second, when doing a high low sum, we lose direction. Is there any way to preserve the direction?
Third, when users see two features' attribution like x: 1% and y 99%, it tells users y is the place anomaly happens. It might as well not to show x's 1%. I feel an attribution score less than 1/d (d is the number of features) is not useful to users. Any comments on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- for a given sum, the larger value indicates the direction/relative position. Note even data within normal range has non-zero values for both.
- since the current ux design only shows feature contribution, only contribution is computed here. The direction is not lost. It can be added when it's needed.
- 1/99 or 0/100 probably won't make a difference for users. In general, additional rules should be avoided for simplicity and correctness as they might introduce their own problems. As an extreme example, the contribution from two features is 49/51, if 49 is omitted, after normalization the result could be 0/100.
} else { | ||
double score = rcfResults.stream().mapToDouble(r -> r.getScore() * r.getForestSize()).sum() / totalForestSize; | ||
double confidence = rcfResults.stream().mapToDouble(r -> r.getConfidence() * r.getForestSize()).sum() / Math | ||
.max(rcfNumTrees, totalForestSize); | ||
combinedResult = new CombinedRcfResult(score, confidence); | ||
double[] attribution = combineAttribution(rcfResults, numFeatures, totalForestSize); | ||
combinedResult = new CombinedRcfResult(score, confidence, combineAttribution(rcfResults, numFeatures, totalForestSize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] It looks like the variable attribution
was meant to be used in line 229, but didn't get used and instead combineAttribution
was invoked again with identical args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I fixed it in the new commit.
Description of changes: This pr adds normalized anomaly score attribution to anomaly detection model output. Anomaly score is attributed to each feature dimension of a current single data point and normalized to 1 for easier consumption. Further changes to external API and persistence is not included in this pr.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.