Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808

Closed
DaveCTurner opened this issue Jan 2, 2024 · 8 comments
Closed

[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808

DaveCTurner opened this issue Jan 2, 2024 · 8 comments
Labels
:ml Machine learning needs:risk Requires assignment of a risk label (low, medium, blocker) Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/d66ig3jihd6pq/tests/:x-pack:qa:full-cluster-restart:v8.11.4%23bwcTest/org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT/testDeploymentSurvivesRestart%20%7Bcluster=UPGRADED%7D

Reproduction line:

./gradlew ':x-pack:qa:full-cluster-restart:v8.11.4#bwcTest' -Dtests.class="org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT" -Dtests.method="testDeploymentSurvivesRestart {cluster=UPGRADED}" -Dtests.seed=339D93ECF7B44080 -Dtests.bwc=true -Dtests.locale=ga -Dtests.timezone=Africa/Algiers -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT#testDeploymentSurvivesRestart {cluster=UPGRADED}

Failure excerpt:

java.lang.AssertionError: Inference failed

  at __randomizedtesting.SeedInfo.seed([339D93ECF7B44080:DAC69A8C811AA6A0]:0)
  at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.lambda$testDeploymentSurvivesRestart$2(MLModelDeploymentFullClusterRestartIT.java:120)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1278)
  at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.testDeploymentSurvivesRestart(MLModelDeploymentFullClusterRestartIT.java:113)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:47)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

  Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://[::1]:37373], URI [/_ml/trained_models/trained-model-full-cluster-restart/deployment/_infer], status line [HTTP/1.1 408 Request Timeout]
  Warnings: [[POST /_ml/trained_models/{model_id}/deployment/_infer] is deprecated! Use [POST /_ml/trained_models/{model_id}/_infer] instead.]
  {"error":{"root_cause":[{"type":"status_exception","reason":"timeout [10s] waiting for inference result"}],"type":"status_exception","reason":"timeout [10s] waiting for inference result"},"status":408}

    at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.infer(MLModelDeploymentFullClusterRestartIT.java:233)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.assertInfer(MLModelDeploymentFullClusterRestartIT.java:145)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.lambda$testDeploymentSurvivesRestart$2(MLModelDeploymentFullClusterRestartIT.java:115)
    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1278)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.testDeploymentSurvivesRestart(MLModelDeploymentFullClusterRestartIT.java:113)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.lang.reflect.Method.invoke(Method.java:580)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:47)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
    at java.lang.Thread.run(Thread.java:1583)

@DaveCTurner DaveCTurner added :ml Machine learning >test-failure Triaged test failures from CI labels Jan 2, 2024
@elasticsearchmachine elasticsearchmachine added blocker Team:ML Meta label for the ML team labels Jan 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@DaveCTurner
Copy link
Contributor Author

The linked failure was on a PR branch but not one which changes anything in this area, and there have been other related failures today e.g. https://gradle-enterprise.elastic.co/s/ecrinp2ljstju/tests/task/:x-pack:qa:full-cluster-restart:v8.12.0%23bwcTest/details/org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT

@davidkyle
Copy link
Member

This is probably related to #103591 as there is a timeout waiting for the process to finish Timed out waiting for pytorch results processor to complete for model id trained-model-full-cluster-restart

[2024-01-02T16:31:39,769][WARN ][o.e.x.m.p.AbstractNativeProcess] [test-cluster-0] [trained-model-full-cluster-restart] Exception closing the running pytorch_inference process java.util.concurrent.TimeoutException: Timed out waiting for pytorch results processor to complete for model id trained-model-full-cluster-restart |  
-- | --
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.PyTorchResultProcessor.awaitCompletion(PyTorchResultProcessor.java:326) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$startAndLoad$1(DeploymentManager.java:503) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcess.afterProcessInStreamClose(NativePyTorchProcess.java:81) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.process.AbstractNativeProcess.close(AbstractNativeProcess.java:192) |  
  | at org.elasticsearch.base@8.13.0-SNAPSHOT/org.elasticsearch.core.IOUtils.close(IOUtils.java:71) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:94) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:31) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.startAndLoad(DeploymentManager.java:500) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$4(DeploymentManager.java:204) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) |  
  | at java.base/java.lang.Thread.run(Thread.java:1583)

There is also an NPE in the model assignment code to investigate


[2024-01-02T16:28:23,953][WARN ][o.e.c.s.ClusterApplierService] [test-cluster-0] failed to notify ClusterStateListener java.lang.NullPointerException |  
-- | --
  | at java.base/java.util.Objects.requireNonNull(Objects.java:233) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.detectReasonIfMlJobsStopped(TrainedModelAssignmentClusterService.java:1060) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.detectReasonToRebalanceModels(TrainedModelAssignmentClusterService.java:1037) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.clusterChanged(TrainedModelAssignmentClusterService.java:182) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) |  
  | at java.base/java.lang.Thread.run(Thread.java:1583)

Mute incoming

@droberts195
Copy link
Contributor

The two linked build scans show different problems. The one in the issue description is an ML bug:

[2024-01-02T16:31:39,773][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] uncaught exception in thread [elasticsearch[test-cluster-0][ml_utility][T#1]]
org.elasticsearch.ElasticsearchException: Failed to connect to pytorch process for job trained-model-full-cluster-restart
        at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:66) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:98) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:31) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.startAndLoad(DeploymentManager.java:500) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$4(DeploymentManager.java:204) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]
Caused by: java.io.FileNotFoundException: /dev/shm/bk/bk-agent-prod-gcp-1704208765331878120/elastic/elasticsearch-pull-request/x-pack/qa/full-cluster-restart/build/testrun/v8.11.4_bwcTest/temp/test-cluster11748905467589772195/test-cluster-0/tmp/pytorch_inference_trained-model-full-cluster-restart_log_29291 (No such file or directory)
        at java.io.FileInputStream.open0(Native Method) ~[?:?]
        at java.io.FileInputStream.open(FileInputStream.java:213) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:152) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:106) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
        at java.security.AccessController.doPrivileged(AccessController.java:319) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
        at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:158) ~[?:?]
        at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:92) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:89) ~[?:?]
        ... 7 more

There's been a problem connecting the logging stream on a pytorch_inference process, but that shouldn't end up in the uncaught exception handler and crash the whole node.

The problem in #103808 (comment) is different:

[2024-01-02T01:19:28,501][INFO ][o.e.n.Node               ] [test-cluster-0] starting ...
[2024-01-02T01:19:28,593][INFO ][o.e.x.s.c.f.PersistentCache] [test-cluster-0] persistent cache index loaded
[2024-01-02T01:19:28,609][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [test-cluster-0] deprecation component started
[2024-01-02T01:19:29,381][INFO ][o.e.t.TransportService   ] [test-cluster-0] publish_address {127.0.0.1:36689}, bound_addresses {127.0.0.1:36689}
[2024-01-02T01:19:29,534][WARN ][o.e.n.Node               ] [test-cluster-0] unexpected exception while waiting for http server to close
java.util.concurrent.ExecutionException: java.lang.AssertionError: INITIALIZED -> STOPPED
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
        at org.elasticsearch.node.Node.prepareForClose(Node.java:594) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.shutdown(Elasticsearch.java:468) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]   
Caused by: java.lang.AssertionError: INITIALIZED -> STOPPED 
        at org.elasticsearch.common.component.Lifecycle.canMoveToStopped(Lifecycle.java:127) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:73) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.node.Node.lambda$prepareForClose$25(Node.java:584) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        ... 1 more
[2024-01-02T01:19:29,561][INFO ][o.e.n.Node               ] [test-cluster-0] stopping ...
[2024-01-02T01:19:30,174][INFO ][o.e.n.Node               ] [test-cluster-0] stopped 
[2024-01-02T01:19:30,175][INFO ][o.e.n.Node               ] [test-cluster-0] closing ...
[2024-01-02T01:19:30,202][INFO ][o.e.c.c.ClusterBootstrapService] [test-cluster-0] this node has not joined a bootstrapped cluster yet; [cluster.initial_master_nodes] is set to [test-cluster-0, test-cluster-1]
[2024-01-02T01:19:30,204][ERROR][o.e.b.Elasticsearch      ] [test-cluster-0] fatal exception while booting Elasticsearch
java.lang.AssertionError: CLOSED -> STARTED
        at org.elasticsearch.common.component.Lifecycle.canMoveToStarted(Lifecycle.java:109) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:44) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.node.Node.start(Node.java:358) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.start(Elasticsearch.java:458) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:251) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:73) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
[2024-01-02T01:19:30,239][INFO ][o.e.n.Node               ] [test-cluster-0] closed

The nodes seem to be deciding to shut down part way through startup and trip assertions while doing this.

@droberts195
Copy link
Contributor

The pytorch_inference failure and subsequent uncaught exception happened on a different test in #103868.

Both these failures were using the same kernel: 5.15.0-1047-gcp. That might be related. It's from Ubuntu 22.04, so not that new though.

@droberts195
Copy link
Contributor

Whatever the problem is, 8.11.4 doesn't seem to have it, while 8.13.0 does:

bash-3.2$ grep controller build/testrun/v8.11.4_bwcTest/temp/test-cluster11748905467589772195/test-cluster-0/logs/test-cluster.log
[2024-01-02T16:26:15,931][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [Main.cc@123] controller (64 bit): Version 8.11.4-SNAPSHOT (Build 53b28b969d0b18) Copyright (c) 2024 Elasticsearch BV
[2024-01-02T16:26:46,237][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [CDetachedProcessSpawner.cc@301] Spawned './pytorch_inference' with PID 22035
[2024-01-02T16:26:46,240][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [CResponseJsonWriter.cc@42] Wrote controller response - id: 1 success: true reason: Process './pytorch_inference' started
[2024-01-02T16:26:47,364][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [Main.cc@176] ML controller exiting
[2024-01-02T16:26:47,365][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [CDetachedProcessSpawner.cc@158] Shutting down spawned process tracker thread
[2024-01-02T16:26:47,365][INFO ][o.e.x.m.p.NativeController] [test-cluster-0] Native controller process has stopped - no new native processes can be started
[2024-01-02T16:28:14,790][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [Main.cc@123] controller (64 bit): Version 8.13.0-SNAPSHOT (Build c9c232240dd04f) Copyright (c) 2023 Elasticsearch BV
[2024-01-02T16:28:29,453][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CDetachedProcessSpawner.cc@301] Spawned './pytorch_inference' with PID 34663
[2024-01-02T16:28:29,454][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CResponseJsonWriter.cc@42] Wrote controller response - id: 1 success: true reason: Process './pytorch_inference' started
[2024-01-02T16:28:29,503][WARN ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CDetachedProcessSpawner.cc@202] Child process with PID 34663 has exited with exit code 127
[2024-01-02T16:32:49,735][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [Main.cc@176] ML controller exiting
[2024-01-02T16:32:49,735][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CDetachedProcessSpawner.cc@158] Shutting down spawned process tracker thread
[2024-01-02T16:32:49,736][INFO ][o.e.x.m.p.NativeController] [test-cluster-0] Native controller process has stopped - no new native processes can be started

So I was thinking this would be caused by reverting IPEX, but c9c232240dd04f is actually the ml-cpp commit from 6th December before any of the IPEX shenanigans of the last couple of days. Another ml-cpp difference between 8.11 and 8.12/8.13 is the Boost upgrade. But something must have changed elsewhere as well, because nothing changed in ml-cpp for nearly a month.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 3, 2024
A couple of test failures (elastic#103808 and elastic#103868) reveal that
an exception thrown while connecting to the pytorch_inference
process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to
connect to the process that those issues relate to, but it
converts the error from one that crashes a whole node to one
that just fails the affected model deployment.
droberts195 added a commit that referenced this issue Jan 4, 2024
A couple of test failures (#103808 and #103868) reveal that
an exception thrown while connecting to the pytorch_inference
process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to
connect to the process that those issues relate to, but it
converts the error from one that crashes a whole node to one
that just fails the affected model deployment.
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 4, 2024
A couple of test failures (elastic#103808 and elastic#103868) reveal that
an exception thrown while connecting to the pytorch_inference
process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to
connect to the process that those issues relate to, but it
converts the error from one that crashes a whole node to one
that just fails the affected model deployment.
elasticsearchmachine pushed a commit that referenced this issue Jan 4, 2024
…3911)

A couple of test failures (#103808 and #103868) reveal that
an exception thrown while connecting to the pytorch_inference
process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to
connect to the process that those issues relate to, but it
converts the error from one that crashes a whole node to one
that just fails the affected model deployment.
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 22, 2024
The failure of elastic#103808 might be fixed by the upgrade to PyTorch 2.1.2.
It's not good that we have this entire suite muted, so I'll try
unmuting for a bit and see if we get any repeat failures.
@droberts195
Copy link
Contributor

I've opened #105728 to see if upgrading to PyTorch 2.1.2 has fixed the problem that caused pytorch_inference to crash.

droberts195 added a commit that referenced this issue Feb 22, 2024
The failure of #103808 might be fixed by the upgrade to PyTorch 2.1.2.
It's not good that we have this entire suite muted, so I'll try
unmuting for a bit and see if we get any repeat failures.
@elasticsearchmachine elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Apr 24, 2024
@maxhniebergall
Copy link
Member

Seems that this test has been unmuted since February and hasn't caused any more failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning needs:risk Requires assignment of a risk label (low, medium, blocker) Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants