-
Notifications
You must be signed in to change notification settings - Fork 36
Integration with Ultrawarm - Follow up #97
Integration with Ultrawarm - Follow up #97
Conversation
This is a follow up PR to address comments. Testing done: 1. gradle build passes 2. Verified AD runs only in hot nodes. 3. stats API and HourlyCron still works.
|
||
static class HotDataNodePredicate implements Predicate<DiscoveryNode> { | ||
@Override | ||
public boolean test(DiscoveryNode discoveryNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question. Just to confirm this if-and-only-if behavior since it is different from the last change, any data nodes that are marked but not marked hot are not eligible? if a data node is marked with some unknown new value, it would be ineligible based on this rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there are only hot and warm data nodes.
Previously, if a data node is marked with warm node, we ignore it
Now, if a data node is marked with hot node or not marked at all, we use it.
If a data node is marked with some unknown new value in the future, it would be ineligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion. Since this business logic/decision has important effects on system behavior, it is better to document it. Also, the documentation at line 49 is outdated based on this new change.
* @return whether we should use this node for AD | ||
*/ | ||
public boolean isEligibleNode(DiscoveryNode node) { | ||
return new HotDataNodePredicate().test(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue. this predicate should be instantiated, maybe injected, and reused for calls and should not be instantiated once for every call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. Changed.
ClusterState state = this.clusterService.state(); | ||
final List<DiscoveryNode> eligibleNodes = new ArrayList<>(); | ||
final HotDataNodePredicate eligibleNodeFilter = new HotDataNodePredicate(); | ||
for (DiscoveryNode node : state.nodes()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion. stream helps improve code efficiency (eliminating need for intermediary data structures, variables creation and operation) and readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is personal coding style. Thanks for the suggestions.
|
||
import com.amazon.opendistroforelasticsearch.ad.constant.CommonName; | ||
|
||
public class DiscoveryNodeFilterer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor. class documentation is missing. For public classes and methods, the responsibilities should be summarized for clients and readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* model partitions to all data nodes in the cluster randomly, which | ||
* could cause a model performance downgrade issue once warm nodes | ||
* are throttled due to resource limitations. The PR excludes warm nodes to place model partitions. | ||
* @return an array of eligible data nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: indentation is not needed in java docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
@@ -119,7 +119,7 @@ public String getName() { | |||
/** | |||
* Constructor. | |||
* | |||
* @param clusterStateUtils cluster info | |||
* @param nodeFilter utility class to select nodesr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: typo nodes r
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
|
||
import com.amazon.opendistroforelasticsearch.ad.constant.CommonName; | ||
|
||
public class DiscoveryNodeFilterer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any UT for this class. Would be great to add some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have coverage tool to check coverage (75% line and 60% branch coverage). Without enough coverage, build would fail. This class is covered by its callers’ tests.
This is a follow up PR to address comments. Testing done: 1. gradle build passes 2. Verified AD runs only in hot nodes. 3. stats API and HourlyCron still works.
* Integration with Ultrawarm (#95) Ultrawarm introduces warm nodes into the ES cluster. Currently, we distribute model partitions to all data nodes in the cluster randomly, which could cause a model performance downgrade issue once warm nodes are throttled due to resource limitations. The PR excludes warm nodes to place model partitions. Since index shards are hosted on hot nodes, AD's coordinating nodes are in hot nodes as well. We don't need to send HourlyCron job and stats requests to warm nodes anymore. This PR implements those changes. Testing done: 1. Verified AD runs only in hot nodes. 2. stats API and HourlyCron still works. * Integration with Ultrawarm - Follow up (#97) This is a follow up PR to address comments. Testing done: 1. gradle build passes 2. Verified AD runs only in hot nodes. 3. stats API and HourlyCron still works.
Issue #, if available:
Description of changes:
This is a follow up PR to address comments.
See context in #95
Testing done:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.