Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

add anomaly feature attribution to model output #232

Merged
merged 2 commits into from
Oct 1, 2020
Merged

add anomaly feature attribution to model output #232

merged 2 commits into from
Oct 1, 2020

Conversation

wnbts
Copy link
Contributor

@wnbts wnbts commented Sep 14, 2020

Description of changes: This pr adds normalized anomaly score attribution to anomaly detection model output. Anomaly score is attributed to each feature dimension of a current single data point and normalized to 1 for easier consumption. Further changes to external API and persistence is not included in this pr.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov
Copy link

codecov bot commented Sep 14, 2020

Codecov Report

Merging #232 into master will decrease coverage by 0.17%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #232      +/-   ##
============================================
- Coverage     72.25%   72.08%   -0.18%     
- Complexity     1278     1289      +11     
============================================
  Files           139      139              
  Lines          6045     5993      -52     
  Branches        469      476       +7     
============================================
- Hits           4368     4320      -48     
+ Misses         1465     1461       -4     
  Partials        212      212              
Flag Coverage Δ Complexity Δ
#cli 78.11% <ø> (-2.20%) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...opendistroforelasticsearch/ad/ml/ModelManager.java 91.39% <100.00%> (+0.45%) 106.00 <7.00> (+6.00)
...on/opendistroforelasticsearch/ad/ml/RcfResult.java 100.00% <100.00%> (ø) 14.00 <3.00> (+2.00)
...oforelasticsearch/ad/ml/rcf/CombinedRcfResult.java 100.00% <100.00%> (ø) 12.00 <3.00> (+2.00)
...rch/ad/transport/AnomalyResultTransportAction.java 78.59% <100.00%> (+0.12%) 59.00 <0.00> (ø)
...relasticsearch/ad/transport/RCFResultResponse.java 100.00% <100.00%> (ø) 8.00 <2.00> (+1.00)
...csearch/ad/transport/RCFResultTransportAction.java 89.65% <100.00%> (+1.65%) 5.00 <0.00> (ø)
cli/internal/gateway/ad/ad.go 60.43% <0.00%> (-4.95%) 0.00% <0.00%> (ø%)
cli/internal/controller/ad/ad.go 75.17% <0.00%> (-2.22%) 0.00% <0.00%> (ø%)
cli/internal/gateway/es/es.go 84.61% <0.00%> (-2.06%) 0.00% <0.00%> (ø%)
...asticsearch/ad/cluster/ADClusterEventListener.java 92.00% <0.00%> (-2.00%) 14.00% <0.00%> (-1.00%)
... and 129 more

@wnbts wnbts marked this pull request as ready for review September 15, 2020 00:10
vec.renormalize(1d);
double[] attribution = new double[vec.getDimensions()];
for (int i = 0; i < attribution.length; i++) {
attribution[i] = vec.getHighLowSum(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions:
First, how to interpret both high and low are non-zero? Is it really high or low? Does it mean RCF trees think the value can be both higher or lower than the recently observed data trends for that column? Do we need a majority win rule to say it is actually high or low?
Second, when doing a high low sum, we lose direction. Is there any way to preserve the direction?
Third, when users see two features' attribution like x: 1% and y 99%, it tells users y is the place anomaly happens. It might as well not to show x's 1%. I feel an attribution score less than 1/d (d is the number of features) is not useful to users. Any comments on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. for a given sum, the larger value indicates the direction/relative position. Note even data within normal range has non-zero values for both.
  2. since the current ux design only shows feature contribution, only contribution is computed here. The direction is not lost. It can be added when it's needed.
  3. 1/99 or 0/100 probably won't make a difference for users. In general, additional rules should be avoided for simplicity and correctness as they might introduce their own problems. As an extreme example, the contribution from two features is 49/51, if 49 is omitted, after normalization the result could be 0/100.

} else {
double score = rcfResults.stream().mapToDouble(r -> r.getScore() * r.getForestSize()).sum() / totalForestSize;
double confidence = rcfResults.stream().mapToDouble(r -> r.getConfidence() * r.getForestSize()).sum() / Math
.max(rcfNumTrees, totalForestSize);
combinedResult = new CombinedRcfResult(score, confidence);
double[] attribution = combineAttribution(rcfResults, numFeatures, totalForestSize);
combinedResult = new CombinedRcfResult(score, confidence, combineAttribution(rcfResults, numFeatures, totalForestSize));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] It looks like the variable attribution was meant to be used in line 229, but didn't get used and instead combineAttribution was invoked again with identical args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I fixed it in the new commit.

@wnbts wnbts merged commit 2d29d47 into opendistro-for-elasticsearch:master Oct 1, 2020
@ohltyler ohltyler added the enhancement New feature or request label Oct 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants