-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add listenable TransportRequestHandler in TransportNodesAction #15166
base: main
Are you sure you want to change the base?
Add listenable TransportRequestHandler in TransportNodesAction #15166
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15166 +/- ##
============================================
+ Coverage 72.03% 72.13% +0.10%
- Complexity 65230 65261 +31
============================================
Files 5318 5318
Lines 304051 304069 +18
Branches 43990 43991 +1
============================================
+ Hits 219021 219355 +334
+ Misses 67121 66747 -374
- Partials 17909 17967 +58 ☔ View full report in Codecov by Sentry. |
❌ Gradle check result for f17a5c0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 3c68c3f: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
@peternied @gaobinlong @dblock Can you help review this PR? |
This PR is stalled because it has been open for 30 days with no activity. |
3c68c3f
to
62c17d5
Compare
❌ Gradle check result for cdd58e2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zane-neo Could you give some use cases about this change to help understanding the intention, thanks!
@gaobinlong Thanks, Binlong. I'll add more cases in the description and the issue. |
cdd58e2
to
2e6023c
Compare
❌ Gradle check result for 2e6023c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@rajiv-kv @shwetathareja could you also help to take a look at this PR, thanks! |
c9a2500
to
067fc68
Compare
❕ Gradle check result for 067fc68: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
067fc68
to
a11a8a1
Compare
Description
Add listenable TransportRequestHandler in TransportNodesAction.
Use case:
In ml-commons, a scenario is to deploy model to the cluster, this action needs to read the model metadata from index first and then do time-consuming operation for locally running model and non time-consuming operations for remote running model.
The thing is the current
TransportNodesAction
limited the API that developers have to return a response when performing nodeOperation, so we have to workaround this by forwarding the deploy result to a listener to update the model status, the workflow looks like below:This works fine for the time-consuming models(run locally) but not for remote models because deploying remote models is lightweight which only needs create several object in memory after metadata read. Once we enhanced this API to support listenableTransportRequestHandler, then we can pass the listener to the noOperation and return actual model status instead of
deploying
.Also recently we found an issue of model deployment which is caused by the
forwarding deploy result to listener
part, simply put, if a node crash during the model deploy, then the listener will not able to receive all responses and not updating the model status, please refer: opensearch-project/ml-commons#2970.So enhancing this class offering a straightforward solution to this scenario and I believe it can benefit other similar cases which accept a listener during the nodeOperation.
Related Issues
#15165
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.