-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808
Comments
Pinging @elastic/ml-core (Team:ML) |
The linked failure was on a PR branch but not one which changes anything in this area, and there have been other related failures today e.g. https://gradle-enterprise.elastic.co/s/ecrinp2ljstju/tests/task/:x-pack:qa:full-cluster-restart:v8.12.0%23bwcTest/details/org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT |
This is probably related to #103591 as there is a timeout waiting for the process to finish
There is also an NPE in the model assignment code to investigate
Mute incoming |
The two linked build scans show different problems. The one in the issue description is an ML bug:
There's been a problem connecting the logging stream on a The problem in #103808 (comment) is different:
The nodes seem to be deciding to shut down part way through startup and trip assertions while doing this. |
The Both these failures were using the same kernel: |
Whatever the problem is, 8.11.4 doesn't seem to have it, while 8.13.0 does:
So I was thinking this would be caused by reverting IPEX, but |
A couple of test failures (elastic#103808 and elastic#103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.
A couple of test failures (#103808 and #103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.
A couple of test failures (elastic#103808 and elastic#103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.
…3911) A couple of test failures (#103808 and #103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.
The failure of elastic#103808 might be fixed by the upgrade to PyTorch 2.1.2. It's not good that we have this entire suite muted, so I'll try unmuting for a bit and see if we get any repeat failures.
I've opened #105728 to see if upgrading to PyTorch 2.1.2 has fixed the problem that caused |
The failure of #103808 might be fixed by the upgrade to PyTorch 2.1.2. It's not good that we have this entire suite muted, so I'll try unmuting for a bit and see if we get any repeat failures.
Seems that this test has been unmuted since February and hasn't caused any more failures. |
Build scan:
https://gradle-enterprise.elastic.co/s/d66ig3jihd6pq/tests/:x-pack:qa:full-cluster-restart:v8.11.4%23bwcTest/org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT/testDeploymentSurvivesRestart%20%7Bcluster=UPGRADED%7D
Reproduction line:
Applicable branches:
main
Reproduces locally?:
Didn't try
Failure history:
Failure dashboard for
org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT#testDeploymentSurvivesRestart {cluster=UPGRADED}
Failure excerpt:
The text was updated successfully, but these errors were encountered: