Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Update index mappings on process start, not job open #37607

Closed
droberts195 opened this issue Jan 18, 2019 · 7 comments
Closed

[ML] Update index mappings on process start, not job open #37607

droberts195 opened this issue Jan 18, 2019 · 7 comments
Assignees
Labels
>bug :ml Machine learning

Comments

@droberts195
Copy link
Contributor

At present we update index mappings for the state and results indices in TransportOpenJobAction. This dates back to the 5.4 to 5.5 upgrade, when we knew 5.4 jobs would not run in 5.5 because there were special checks to prevent it.

Unfortunately doing the index mapping upgrades when opening a job is not sufficient to ensure the mappings are correct by the time documents requiring the mappings are indexed. Documents can be indexed before the mappings are correct when a rolling upgrade is done with ML jobs open. These cause dynamic mappings to be created, and then when a subsequent job open is called (possibly for a different job) an error results because mappings cannot be updated.

The solution is to update the mappings on process open, not on job open. This is similar to the change made in e194d8e on #37483. (Thankfully with that one we noticed the problem in the initial review phase.)

Although the problem has existed since 5.6, version 6.5 is more likely to suffer from it because (a) the validation for enabled=false has been tightened up in #33933 and (b) in 6.5 we introduced the multi_bucket_impact field with mapping type double.

The only workaround to recover from dynamic mappings that clash with the desired mappings is to reindex the affected index while preserving all aliases, and this is hard. Therefore we should fix this as a priority for 6.6.1.

The fix will only stop the mappings inconsistency being created in the future. It will not help anyone who has already suffered from mappings inconsistency. I will paste the steps to recover by reindexing into this issue once they are validated.

@droberts195 droberts195 added >bug :ml Machine learning labels Jan 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195 droberts195 self-assigned this Jan 18, 2019
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 22, 2019
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.

When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.

Closes elastic#37607
droberts195 added a commit that referenced this issue Jan 23, 2019
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.

When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.

Closes #37607
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 23, 2019
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.

When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.

Closes elastic#37607
droberts195 added a commit that referenced this issue Jan 28, 2019
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.

When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.

Closes #37607
droberts195 added a commit that referenced this issue Jan 29, 2019
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.

When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.

Closes #37607
@smalenfant
Copy link

I would love to see the manual steps if possible. I have a few ML jobs stuck in close due to this issue.

@droberts195
Copy link
Contributor Author

droberts195 commented Feb 12, 2019

Here are the steps needed to fix the issue (as run from the dev tools console):

  1. Stop all datafeeds and close all jobs and tell other users not to open any more until this process is complete. It is probably good to note which jobs were open in order to reopen them at the end of this process.
POST _xpack/ml/datafeeds/*/_stop
POST _xpack/ml/anomaly_detectors/*/_close
  1. Get aliases of ML results indices
GET .ml-anomalies-*/_alias

The response should be stored in a file (let's call it ml_results_alias_response.txt)

  1. Reindex all ML results indices into a temporary index. This has to be done for each index starting with .ml-anomalies-.
POST _reindex
{
  "source": {
    "index": "{index_name}"
  },
  "dest": {
    "index": "tmp-{index_name}"
  }
}
  1. Delete original indices. This has to be done for each index starting with .ml-anomalies-.
DELETE {index_name}
  1. Reindex temporary indices back to their original names.
POST _reindex
{
  "source": {
    "index": "tmp-{index_name}"
  },
  "dest": {
    "index": "{index_name}"
  }
}
  1. Restore aliases. This script generates the necessary body to the post aliases request. It is a bash script but it does depend on jq to be available. It can be used with the file from step 2.
./gen_post_aliases_body.sh < ml_results_alias_response.txt

Then copy the output and use it as the body in the following request:

POST _aliases
{BODY}
  1. Delete the temporary indices created

  2. Reopen any jobs that were closed in step 1.

This process will ensure the results indices have the correct mappings after the upgrade to 6.5.x.

@smalenfant
Copy link

@droberts195 Thanks for this info. 1 thing missing is the script, looks like I don't have access to that repo (404).

@droberts195
Copy link
Contributor Author

Sorry about that @smalenfant. I edited the big comment above to make the script an attachment of this issue. I had to rename it with a .txt extension to do this, so after downloading rename the file to remove the .txt extension and chmod +x it.

@smalenfant
Copy link

@droberts195 We didn't have time to go through the process of re-indexing although we did upgrade to 6.6.1. The mapping problem went away, although my jobs can't seem to open at all now. Might be a totally different issue.

POST _xpack/ml/anomaly_detectors/ttms/_open
{
 "statusCode": 504,
 "error": "Gateway Time-out",
 "message": "Client request timeout"
}

@droberts195
Copy link
Contributor Author

@smalenfant like you say, your new problem could be something completely different. If you have a support contract please open a support case for it. Then our support team can lead you through the process of gathering enough information to diagnose what's wrong. If you don't have a support contract please ask on the Discuss forum. Tag your question with the machine-learning tag so we don't miss it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

No branches or pull requests

3 participants