[CI] upgraded_cluster/30_ml_jobs_crud/Test open old jobs failed with .ml-state-write not an alias #59011

droberts195 · 2020-07-03T12:03:03Z

Build scan: https://gradle-enterprise.elastic.co/s/svde5ts33nfki

Repro line:

./gradlew ':x-pack:qa:rolling-upgrade:v7.6.2#upgradedClusterTest' --tests "org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=upgraded_cluster/30_ml_jobs_crud/Test open old jobs}" \
  -Dtests.seed=95C888B6E2499870 \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-EG \
  -Dtests.timezone=Pacific/Gambier \
  -Druntime.java=8 \
  -Dtests.rest.suite=upgraded_cluster

Reproduces locally?: No

Applicable branches: 7.8 (upgraded from 7.6)

Failure history:

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2020-03-03T13:01:09.914Z',mode:absolute,to:'2020-07-03T12:01:15.600Z'))&_a=(columns:!(_source),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'%22Test%20open%20old%20jobs%22'),sort:!(process.time-start,desc))

There are many failures of this test over the years, but it has failed for many completely different reasons. The earliest I could find with this problem was on 5th March 2020.

Failure excerpt:

java.lang.AssertionError: Failure at [upgraded_cluster/30_ml_jobs_crud:110]: field [] doesn't have a true value
   Expected: not a string equal to "false" ignoring case |  
   but: was "false"

The relevant bit of the YAML file is:

  - do:
      indices.exists_alias:
        name: ".ml-state-write"
  - is_true: ''

We have seen the problem of .ml-state-write being a concrete index rather than an alias in clusters outside of CI. It seems that, very occasionally, it can happen in CI too.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-07-03T12:03:04Z

Pinging @elastic/ml-core (:ml)

droberts195 · 2020-07-03T13:19:04Z

Interestingly, in x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/old_cluster/30_ml_jobs_crud.yml we have this:

setup:
  - do:
      index:
        index: .ml-state
        id: "dummy-document-to-make-index-creation-idempotent"
        body: >
          {
          }

  - do:
      cluster.health:
        index: [".ml-state"]
        wait_for_status: green

That strikes me as very dodgy, because it implies that our production code couldn't cope with possible race conditions with the creation of the .ml-state index.

More recently this won't even help, which probably explains why these tests sometimes fail now. The index we are actually using is .ml-state-000001 with its write alias .ml-state-write. .ml-state is tolerated as a possible leftover from old versions, but would not be expected to have the write alias .ml-state-write pointing at it in 7.7 and above.

Theory for what happened:

Start off in 7.6. .ml-state gets created and we wait for green status, so its replica is assigned and the index is guaranteed to survive a single node being replaced.
Upgrade some nodes to 7.8, but at least one still on 7.6. One of the 7.8 nodes decides to create .ml-state-000001 and moves alias .ml-state-write to point at it. The node that takes the primary is a 7.6 node. The replica still hasn't been assigned.
Upgrade remaining nodes to 7.8. The 7.6 node that took the .ml-state-000001 index is replaced with a 7.8 node. Since the replica had not yet been assigned the entire .ml-state-000001 index is lost. Since the alias is stored with the index, the .ml-state-write alias is also lost.

droberts195 · 2020-07-03T16:08:56Z

We also have quite a few open issues for "all shards failed" against our internal state or stats indices: #54887, #55221, #55807, #57102, and #58841

I am thinking that MlIndexAndAlias.createIndexAndAliasIfNecessary should wait for shards to be allocated for the index it creates before returning. I will create a PR for this.

There have been a few test failures that are likely caused by tests performing actions that use ML indices immediately after the actions that create those ML indices. Currently this can result in attempts to search the newly created index before its shards have initialized. This change makes the method that creates the internal ML indices that have been affected by this problem (state and stats) wait for the shards to be initialized before returning. Fixes elastic#54887 Fixes elastic#55221 Fixes elastic#55807 Fixes elastic#57102 Fixes elastic#58841 Fixes elastic#59011

przemekwitek · 2020-07-06T07:56:50Z

That strikes me as very dodgy, because it implies that our production code couldn't cope with possible race conditions with the creation of the .ml-state index.

I've added this setup stanza when I was renaming .ml-state to .ml-state-000001 (see https://github.com/elastic/elasticsearch/pull/52510/files#diff-90e5b005c474a5a63dafd0aa6ce50cbb). Back then I thought the realistic setup for the old (i.e. before upgrade) cluster was to have the .ml-state index created from the template and in a green state.

droberts195 · 2020-07-06T08:38:05Z

Back then I thought the realistic setup for the old (i.e. before upgrade) cluster was to have the .ml-state index created from the template and in a green state.

Oh, I see. Yes, I guess it does make sense in the case where the old cluster is on a version that will create .ml-state-000001 itself, because it simulates the case of a .ml-state left over from a much older version. So on the master branch it's fine. But the complication is that in the 7.x/7.8/7.7 branches sometimes the old cluster is going to be on a version that uses .ml-state itself, and in this case the extra setup code is making the old cluster go through a different code path to what it would have gone through if it had created the index itself. The interesting thing is that the setup of the .ml-state index for the old cluster creates the index but not the alias. So this forces the old cluster code to go through the path where it just creates the alias, not the index. Usually when starting from scratch it would create the index and alias simultaneously. So I bet what this failure shows is that we have a race condition on the code path where we need to create the .ml-state-write alias but not the .ml-state index. (The .ml-state-write alias predates ILM on the state index - it was added in 6.7 by #37483.)

…#59027) There have been a few test failures that are likely caused by tests performing actions that use ML indices immediately after the actions that create those ML indices. Currently this can result in attempts to search the newly created index before its shards have initialized. This change makes the method that creates the internal ML indices that have been affected by this problem (state and stats) wait for the shards to be initialized before returning. Fixes #54887 Fixes #55221 Fixes #55807 Fixes #57102 Fixes #58841 Fixes #59011

droberts195 added >test-failure Triaged test failures from CI :ml Machine learning labels Jul 3, 2020

droberts195 mentioned this issue Jul 3, 2020

[ML] Wait for shards to initialize after creating ML internal indices #59027

Merged

droberts195 closed this as completed in #59027 Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] upgraded_cluster/30_ml_jobs_crud/Test open old jobs failed with .ml-state-write not an alias #59011

[CI] upgraded_cluster/30_ml_jobs_crud/Test open old jobs failed with .ml-state-write not an alias #59011

droberts195 commented Jul 3, 2020

elasticmachine commented Jul 3, 2020

droberts195 commented Jul 3, 2020

droberts195 commented Jul 3, 2020

przemekwitek commented Jul 6, 2020

droberts195 commented Jul 6, 2020

[CI] upgraded_cluster/30_ml_jobs_crud/Test open old jobs failed with .ml-state-write not an alias #59011

[CI] upgraded_cluster/30_ml_jobs_crud/Test open old jobs failed with .ml-state-write not an alias #59011

Comments

droberts195 commented Jul 3, 2020

elasticmachine commented Jul 3, 2020

droberts195 commented Jul 3, 2020

droberts195 commented Jul 3, 2020

przemekwitek commented Jul 6, 2020

droberts195 commented Jul 6, 2020