-
Notifications
You must be signed in to change notification settings - Fork 36
Extract a new class for entity frequenty tracking #389
Extract a new class for entity frequenty tracking #389
Conversation
HC detectors use a 1-pass algorithm for estimating heavy hitters in a stream. Our method maintains a time-decayed count for each entity, which allows us to compare the frequencies of entities from different detectors in the stream. To reuse the code in historical detectors, I created a new class PriorityTracker and moved all related logic there. When an entity is hit, the caller can call PriorityTracker.updatePriority to update the entity's priority. The callers can find the most frequently occurring entities in the stream using PriorityTracker.getTopNEntities. This PR also adds tests for NodeStateManager. Testing done: 1. manually tested basic workflow of HC detectors still works. 2. added new tests for PriorityTracker.
Codecov Report
@@ Coverage Diff @@
## main #389 +/- ##
=========================================
Coverage 79.14% 79.15%
- Complexity 2662 2680 +18
=========================================
Files 247 248 +1
Lines 11717 11749 +32
Branches 1008 1010 +2
=========================================
+ Hits 9274 9300 +26
- Misses 1964 1971 +7
+ Partials 479 478 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
* @param n the number of entities to return. Can be less than n if there are not enough entities stored. | ||
* @return top entities in the descending order of priority | ||
*/ | ||
public List<String> getTopNEntities(int n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method going to be used elsewhere eventually? So far it's only called by tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, historical detectors are expected to call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a minor question I had about a currently unused new method, the refactor looks good.
* @param entityModelId Entity model Id | ||
* @param priority priority | ||
*/ | ||
protected void addPriority(String entityModelId, float priority) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial priority
of one entity should be 0? I see the line 210 is using new PriorityNode(entityModelId, 0f)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need to set the key as entityModelId
? Can we just put entity as key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial priority of one entity should be 0? I see the line 210 is using new PriorityNode(entityModelId, 0f)
Yes
Why need to set the key as entityModelId ? Can we just put entity as key?
You asked me to use this name :) How about entityId?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For realtime HC detector, it stores entity model id (detectorId + "entity" + entity ?). For historical HC detector, plan to use just entity value. As we discussed, as detector will have separate PriorityTracker, so we can just use entity, that can save some memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" as detector will have separate PriorityTracker, so we can just use entity, that can save some memory." Could you explain this sentence? I don't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, we store "<detector_id>+<entity_value>_entity" in cache, we can change to "<entity_value>" to avoid saving duplicate information of "<detector_id>" and "_entity" as the cache is on detector level.
* detector is enabled. i - L measures the elapsed periods since detector starts. | ||
* 0.125 is the decay constant. | ||
* | ||
* Since g(p−L) is changing and they are the same for all entities of the same detector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems all other places in this java doc are using g(i - L)
, change g(p-L)
to g(i - L)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Changed.
…odelId to entityId
Issue #, if available:
Description of changes:
HC detectors use a 1-pass algorithm for estimating heavy hitters in a stream. Our method maintains a time-decayed count for each entity, which allows us to compare the frequencies of entities from different detectors in the stream. To reuse the code in historical detectors, I created a new class PriorityTracker and moved all related logic there. When an entity is hit, the caller can call PriorityTracker.updatePriority to update the entity's priority. The callers can find the most frequently occurring entities in the stream using PriorityTracker.getTopNEntities.
This PR also adds tests for NodeStateManager.
Testing done:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.