[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808

DaveCTurner · 2024-01-02T16:49:57Z

Build scan:
https://gradle-enterprise.elastic.co/s/d66ig3jihd6pq/tests/:x-pack:qa:full-cluster-restart:v8.11.4%23bwcTest/org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT/testDeploymentSurvivesRestart%20%7Bcluster=UPGRADED%7D

Reproduction line:

./gradlew ':x-pack:qa:full-cluster-restart:v8.11.4#bwcTest' -Dtests.class="org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT" -Dtests.method="testDeploymentSurvivesRestart {cluster=UPGRADED}" -Dtests.seed=339D93ECF7B44080 -Dtests.bwc=true -Dtests.locale=ga -Dtests.timezone=Africa/Algiers -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT#testDeploymentSurvivesRestart {cluster=UPGRADED}

Failure excerpt:

java.lang.AssertionError: Inference failed

  at __randomizedtesting.SeedInfo.seed([339D93ECF7B44080:DAC69A8C811AA6A0]:0)
  at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.lambda$testDeploymentSurvivesRestart$2(MLModelDeploymentFullClusterRestartIT.java:120)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1278)
  at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.testDeploymentSurvivesRestart(MLModelDeploymentFullClusterRestartIT.java:113)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:47)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

  Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://[::1]:37373], URI [/_ml/trained_models/trained-model-full-cluster-restart/deployment/_infer], status line [HTTP/1.1 408 Request Timeout]
  Warnings: [[POST /_ml/trained_models/{model_id}/deployment/_infer] is deprecated! Use [POST /_ml/trained_models/{model_id}/_infer] instead.]
  {"error":{"root_cause":[{"type":"status_exception","reason":"timeout [10s] waiting for inference result"}],"type":"status_exception","reason":"timeout [10s] waiting for inference result"},"status":408}

    at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.infer(MLModelDeploymentFullClusterRestartIT.java:233)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.assertInfer(MLModelDeploymentFullClusterRestartIT.java:145)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.lambda$testDeploymentSurvivesRestart$2(MLModelDeploymentFullClusterRestartIT.java:115)
    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1278)
    at org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT.testDeploymentSurvivesRestart(MLModelDeploymentFullClusterRestartIT.java:113)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.lang.reflect.Method.invoke(Method.java:580)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:47)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
    at java.lang.Thread.run(Thread.java:1583)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-01-02T16:50:26Z

Pinging @elastic/ml-core (Team:ML)

DaveCTurner · 2024-01-02T16:51:58Z

The linked failure was on a PR branch but not one which changes anything in this area, and there have been other related failures today e.g. https://gradle-enterprise.elastic.co/s/ecrinp2ljstju/tests/task/:x-pack:qa:full-cluster-restart:v8.12.0%23bwcTest/details/org.elasticsearch.xpack.restart.MLModelDeploymentFullClusterRestartIT

davidkyle · 2024-01-03T11:17:08Z

This is probably related to #103591 as there is a timeout waiting for the process to finish Timed out waiting for pytorch results processor to complete for model id trained-model-full-cluster-restart

[2024-01-02T16:31:39,769][WARN ][o.e.x.m.p.AbstractNativeProcess] [test-cluster-0] [trained-model-full-cluster-restart] Exception closing the running pytorch_inference process java.util.concurrent.TimeoutException: Timed out waiting for pytorch results processor to complete for model id trained-model-full-cluster-restart |  
-- | --
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.PyTorchResultProcessor.awaitCompletion(PyTorchResultProcessor.java:326) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$startAndLoad$1(DeploymentManager.java:503) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcess.afterProcessInStreamClose(NativePyTorchProcess.java:81) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.process.AbstractNativeProcess.close(AbstractNativeProcess.java:192) |  
  | at org.elasticsearch.base@8.13.0-SNAPSHOT/org.elasticsearch.core.IOUtils.close(IOUtils.java:71) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:94) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:31) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.startAndLoad(DeploymentManager.java:500) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$4(DeploymentManager.java:204) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) |  
  | at java.base/java.lang.Thread.run(Thread.java:1583)

There is also an NPE in the model assignment code to investigate


[2024-01-02T16:28:23,953][WARN ][o.e.c.s.ClusterApplierService] [test-cluster-0] failed to notify ClusterStateListener java.lang.NullPointerException |  
-- | --
  | at java.base/java.util.Objects.requireNonNull(Objects.java:233) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.detectReasonIfMlJobsStopped(TrainedModelAssignmentClusterService.java:1060) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.detectReasonToRebalanceModels(TrainedModelAssignmentClusterService.java:1037) |  
  | at org.elasticsearch.ml@8.13.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.clusterChanged(TrainedModelAssignmentClusterService.java:182) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) |  
  | at org.elasticsearch.server@8.13.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) |  
  | at java.base/java.lang.Thread.run(Thread.java:1583)

Mute incoming

droberts195 · 2024-01-03T11:22:19Z

The two linked build scans show different problems. The one in the issue description is an ML bug:

[2024-01-02T16:31:39,773][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] uncaught exception in thread [elasticsearch[test-cluster-0][ml_utility][T#1]]
org.elasticsearch.ElasticsearchException: Failed to connect to pytorch process for job trained-model-full-cluster-restart
        at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:66) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:98) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:31) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.startAndLoad(DeploymentManager.java:500) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$4(DeploymentManager.java:204) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.13.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]
Caused by: java.io.FileNotFoundException: /dev/shm/bk/bk-agent-prod-gcp-1704208765331878120/elastic/elasticsearch-pull-request/x-pack/qa/full-cluster-restart/build/testrun/v8.11.4_bwcTest/temp/test-cluster11748905467589772195/test-cluster-0/tmp/pytorch_inference_trained-model-full-cluster-restart_log_29291 (No such file or directory)
        at java.io.FileInputStream.open0(Native Method) ~[?:?]
        at java.io.FileInputStream.open(FileInputStream.java:213) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:152) ~[?:?]
        at java.io.FileInputStream.<init>(FileInputStream.java:106) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
        at java.security.AccessController.doPrivileged(AccessController.java:319) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
        at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
        at org.elasticsearch.xpack.ml.process.ProcessPipes.connectLogStream(ProcessPipes.java:158) ~[?:?]
        at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.start(AbstractNativeProcess.java:92) ~[?:?]
        at org.elasticsearch.xpack.ml.inference.pytorch.process.NativePyTorchProcessFactory.createProcess(NativePyTorchProcessFactory.java:89) ~[?:?]
        ... 7 more

There's been a problem connecting the logging stream on a pytorch_inference process, but that shouldn't end up in the uncaught exception handler and crash the whole node.

The problem in #103808 (comment) is different:

[2024-01-02T01:19:28,501][INFO ][o.e.n.Node               ] [test-cluster-0] starting ...
[2024-01-02T01:19:28,593][INFO ][o.e.x.s.c.f.PersistentCache] [test-cluster-0] persistent cache index loaded
[2024-01-02T01:19:28,609][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [test-cluster-0] deprecation component started
[2024-01-02T01:19:29,381][INFO ][o.e.t.TransportService   ] [test-cluster-0] publish_address {127.0.0.1:36689}, bound_addresses {127.0.0.1:36689}
[2024-01-02T01:19:29,534][WARN ][o.e.n.Node               ] [test-cluster-0] unexpected exception while waiting for http server to close
java.util.concurrent.ExecutionException: java.lang.AssertionError: INITIALIZED -> STOPPED
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
        at org.elasticsearch.node.Node.prepareForClose(Node.java:594) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.shutdown(Elasticsearch.java:468) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]   
Caused by: java.lang.AssertionError: INITIALIZED -> STOPPED 
        at org.elasticsearch.common.component.Lifecycle.canMoveToStopped(Lifecycle.java:127) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:73) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.node.Node.lambda$prepareForClose$25(Node.java:584) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        ... 1 more
[2024-01-02T01:19:29,561][INFO ][o.e.n.Node               ] [test-cluster-0] stopping ...
[2024-01-02T01:19:30,174][INFO ][o.e.n.Node               ] [test-cluster-0] stopped 
[2024-01-02T01:19:30,175][INFO ][o.e.n.Node               ] [test-cluster-0] closing ...
[2024-01-02T01:19:30,202][INFO ][o.e.c.c.ClusterBootstrapService] [test-cluster-0] this node has not joined a bootstrapped cluster yet; [cluster.initial_master_nodes] is set to [test-cluster-0, test-cluster-1]
[2024-01-02T01:19:30,204][ERROR][o.e.b.Elasticsearch      ] [test-cluster-0] fatal exception while booting Elasticsearch
java.lang.AssertionError: CLOSED -> STARTED
        at org.elasticsearch.common.component.Lifecycle.canMoveToStarted(Lifecycle.java:109) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:44) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.node.Node.start(Node.java:358) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.start(Elasticsearch.java:458) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:251) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:73) ~[elasticsearch-8.12.0-SNAPSHOT.jar:?]
[2024-01-02T01:19:30,239][INFO ][o.e.n.Node               ] [test-cluster-0] closed

The nodes seem to be deciding to shut down part way through startup and trip assertions while doing this.

droberts195 · 2024-01-03T14:47:13Z

The pytorch_inference failure and subsequent uncaught exception happened on a different test in #103868.

Both these failures were using the same kernel: 5.15.0-1047-gcp. That might be related. It's from Ubuntu 22.04, so not that new though.

droberts195 · 2024-01-03T15:54:56Z

Whatever the problem is, 8.11.4 doesn't seem to have it, while 8.13.0 does:

bash-3.2$ grep controller build/testrun/v8.11.4_bwcTest/temp/test-cluster11748905467589772195/test-cluster-0/logs/test-cluster.log
[2024-01-02T16:26:15,931][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [Main.cc@123] controller (64 bit): Version 8.11.4-SNAPSHOT (Build 53b28b969d0b18) Copyright (c) 2024 Elasticsearch BV
[2024-01-02T16:26:46,237][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [CDetachedProcessSpawner.cc@301] Spawned './pytorch_inference' with PID 22035
[2024-01-02T16:26:46,240][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [CResponseJsonWriter.cc@42] Wrote controller response - id: 1 success: true reason: Process './pytorch_inference' started
[2024-01-02T16:26:47,364][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [Main.cc@176] ML controller exiting
[2024-01-02T16:26:47,365][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/14867] [CDetachedProcessSpawner.cc@158] Shutting down spawned process tracker thread
[2024-01-02T16:26:47,365][INFO ][o.e.x.m.p.NativeController] [test-cluster-0] Native controller process has stopped - no new native processes can be started
[2024-01-02T16:28:14,790][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [Main.cc@123] controller (64 bit): Version 8.13.0-SNAPSHOT (Build c9c232240dd04f) Copyright (c) 2023 Elasticsearch BV
[2024-01-02T16:28:29,453][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CDetachedProcessSpawner.cc@301] Spawned './pytorch_inference' with PID 34663
[2024-01-02T16:28:29,454][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CResponseJsonWriter.cc@42] Wrote controller response - id: 1 success: true reason: Process './pytorch_inference' started
[2024-01-02T16:28:29,503][WARN ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CDetachedProcessSpawner.cc@202] Child process with PID 34663 has exited with exit code 127
[2024-01-02T16:32:49,735][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [Main.cc@176] ML controller exiting
[2024-01-02T16:32:49,735][DEBUG][o.e.x.m.p.l.CppLogMessageHandler] [test-cluster-0] [controller/29868] [CDetachedProcessSpawner.cc@158] Shutting down spawned process tracker thread
[2024-01-02T16:32:49,736][INFO ][o.e.x.m.p.NativeController] [test-cluster-0] Native controller process has stopped - no new native processes can be started

So I was thinking this would be caused by reverting IPEX, but c9c232240dd04f is actually the ml-cpp commit from 6th December before any of the IPEX shenanigans of the last couple of days. Another ml-cpp difference between 8.11 and 8.12/8.13 is the Boost upgrade. But something must have changed elsewhere as well, because nothing changed in ml-cpp for nearly a month.

A couple of test failures (elastic#103808 and elastic#103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.

A couple of test failures (#103808 and #103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.

A couple of test failures (elastic#103808 and elastic#103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.

…3911) A couple of test failures (#103808 and #103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop. This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.

The failure of elastic#103808 might be fixed by the upgrade to PyTorch 2.1.2. It's not good that we have this entire suite muted, so I'll try unmuting for a bit and see if we get any repeat failures.

droberts195 · 2024-02-22T10:48:14Z

I've opened #105728 to see if upgrading to PyTorch 2.1.2 has fixed the problem that caused pytorch_inference to crash.

The failure of #103808 might be fixed by the upgrade to PyTorch 2.1.2. It's not good that we have this entire suite muted, so I'll try unmuting for a bit and see if we get any repeat failures.

maxhniebergall · 2024-07-25T16:14:39Z

Seems that this test has been unmuted since February and hasn't caused any more failures.

DaveCTurner added :ml Machine learning >test-failure Triaged test failures from CI labels Jan 2, 2024

elasticsearchmachine added blocker Team:ML Meta label for the ML team labels Jan 2, 2024

DaveCTurner added a commit that referenced this issue Jan 2, 2024

AwaitsFix for #103808

593be83

davidkyle mentioned this issue Jan 3, 2024

[ML] Mute MLModelDeploymentFullClusterRestartIT #103852

Closed

droberts195 mentioned this issue Jan 3, 2024

[CI] CoordinatedInferenceIngestIT testIngestWithMultipleModelTypes failing #103868

Closed

droberts195 mentioned this issue Jan 3, 2024

[ML] Catch exceptions during pytorch_inference startup #103873

Merged

droberts195 mentioned this issue Feb 22, 2024

[ML] Unmute model deployment BWC tests #105728

Merged

mark-vieira removed the blocker label Apr 24, 2024

elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Apr 24, 2024

maxhniebergall closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808

[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808

DaveCTurner commented Jan 2, 2024

elasticsearchmachine commented Jan 2, 2024

DaveCTurner commented Jan 2, 2024

davidkyle commented Jan 3, 2024

droberts195 commented Jan 3, 2024

droberts195 commented Jan 3, 2024

droberts195 commented Jan 3, 2024

droberts195 commented Feb 22, 2024

maxhniebergall commented Jul 25, 2024

[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808

[CI] MLModelDeploymentFullClusterRestartIT bwc tests failing #103808

Comments

DaveCTurner commented Jan 2, 2024

elasticsearchmachine commented Jan 2, 2024

DaveCTurner commented Jan 2, 2024

davidkyle commented Jan 3, 2024

droberts195 commented Jan 3, 2024

droberts195 commented Jan 3, 2024

droberts195 commented Jan 3, 2024

droberts195 commented Feb 22, 2024

maxhniebergall commented Jul 25, 2024