Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.fleet-actions-results data stream cannot be restored via the fleet feature state #89261

Open
romain-chanu opened this issue Aug 11, 2022 · 5 comments
Labels
>bug :Core/Infra/Core Core issues without another label Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Core/Infra Meta label for core/infra team

Comments

@romain-chanu
Copy link

romain-chanu commented Aug 11, 2022

Elasticsearch Version

8.3.3

Installed Plugins

No response

Java Version

bundled

OS Version

Deployment in ESS

Problem Description

.fleet-actions-results data stream cannot be restored via the fleet feature state.

Consider the following scenario (observed in the field in ESS):

  1. Due to unforeseen situation, cluster becomes red with the following red indices:
health status index                                                                                                                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size sth
green  open   .ds-.fleet-actions-results-2022.05.04-000002                                                                                    eZO3mXu3RYOZpygHvC2dgQ   1   1          0            0       450b           225b false
red    open   .ds-.fleet-actions-results-2022.06.03-000003                                                                                    iBbSWmHaQbqJFn_aBVqaYg   1   1                                                   false
red    open   .ds-.fleet-actions-results-2022.07.03-000004                                                                                    sF3-S4uoQkybpm7ujaZBVg   1   1                                                   false
red    open   .ds-.fleet-actions-results-2022.08.02-000006                                                                                    t-U-Wrd_RpqZUqSS2a3TqA   1   1                                                   false
red    open   .fleet-actions-7                                                                                                                8zgOKVzdQIeS_YGq_JX--w   1   1                                                   false
red    open   .fleet-agents-7                                                                                                                 p7sWhvhPRaWQ_unOHIJQTQ   1   1                                                   false
red    open   .fleet-artifacts-7                                                                                                              iingfeghRJ2bfqLAGFt0Aw   1   1                                                   false
red    open   .fleet-enrollment-api-keys-7                                                                                                    8J1tyEuJSfyhMxf5HsfU2A   1   1                                                   false
red    open   .fleet-policies-7                                                                                                               HufDBhgBQraUYlNosY1ysg   1   1                                                   false
red    open   .fleet-policies-leader-7                                                                                                        jpqhCaF9SL-S0AjlWqa6xg   1   1                                                   false
red    open   .fleet-servers-7                                                                                                                5xdgNy-kSXSdsWZbM8mRHw   1   1                                                   false
  1. User attempts to restore the fleet feature state using the following restore snapshot API:
POST _snapshot/found-snapshots/cloud-snapshot-2022.08.08-lywsv4teqe-zj3ygvjkria/_restore?wait_for_completion=false
{
  "indices": "-*",
  "ignore_unavailable": "true",
  "include_global_state": "false",
  "include_aliases": "false",
  "feature_states": [
   "fleet"
  ]
}
  1. Above API fails with the following error:
{
  "error": {
    "root_cause": [
      {
        "type": "snapshot_restore_exception",
        "reason": "[found-snapshots:cloud-snapshot-2022.08.08-lywsv4teqe-zj3ygvjkria/H3i28HlrSiKyrLaiDCE6uA] cannot restore index [.ds-.fleet-actions-results-2022.06.03-000003] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"
      }
    ],
    "type": "snapshot_restore_exception",
    "reason": "[found-snapshots:cloud-snapshot-2022.08.08-lywsv4teqe-zj3ygvjkria/H3i28HlrSiKyrLaiDCE6uA] cannot restore index [.ds-.fleet-actions-results-2022.06.03-000003] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"
  },
  "status": 500
}
  1. Checking the fleet feature state, it seems that the SystemIndexDescriptor (c.f code) does contain the .fleet-actions-results-* pattern. A couple of guesses about the reported problem:
  • The implementation only considers regular indices and not data streams?
  • The implementation considers the data stream but fails to close the backing indices before restoring them?

Steps to Reproduce

  1. Create a cluster version 8.3.3 and deploy an Elastic Agent with the Osquery Manager integration.
  2. Run a new live Osquery.
  3. Observe that the .fleet-actions-results data stream is created with the respective backing indices.
  4. Restore the fleet feature state using the restore snapshot API and observe the same error as above.

Workaround

  1. Create fleet_superuser role
POST _security/role/fleet_superuser
{
  "indices": [
    {
      "names": [
        ".fleet*"
      ],
      "privileges": [
        "all"
      ],
      "allow_restricted_indices": true
    }
  ]
}
  1. Create temp_user user with superuser, fleet_superuser roles:
POST _security/user/temp_user
{
  "password": "temp_password",
  "roles": [
    "superuser",
    "fleet_superuser"
  ]
}
  1. Close .fleet-actions-results backing indices using the below cURL command:
curl -k -XPOST --user temp_user:temp_password -H 'x-elastic-product-origin:fleet' https://$CLUSTER_ADDRESS/.ds-.fleet-actions-results-2022.05.04-000002,.ds-.fleet-actions-results-2022.06.03-000003,.ds-.fleet-actions-results-2022.07.03-000004,.ds-.fleet-actions-results-2022.08.02-000006/_close

Note: for users running the cURL command on Windows, make sure to use double quotes instead for the header: "x-elastic-product-origin:fleet"

  1. Restore fleet feature state:
POST _snapshot/found-snapshots/cloud-snapshot-2022.08.08-lywsv4teqe-zj3ygvjkria/_restore?wait_for_completion=false
{
  "indices": "-*",
  "ignore_unavailable": "true",
  "include_global_state": "false",
  "include_aliases": "false",
  "feature_states": [
    "fleet"
  ]
}
  1. Delete temp_user user
DELETE _security/user/temp_user
  1. Delete fleet_superuser role
DELETE _security/role/fleet_superuser

Logs (if relevant)

No response

@romain-chanu romain-chanu added >bug needs:triage Requires assignment of a team area label :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Aug 11, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. and removed needs:triage Requires assignment of a team area label labels Aug 12, 2022
@romain-chanu romain-chanu added the needs:triage Requires assignment of a team area label label Aug 12, 2022
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Aug 12, 2022
@arteam arteam added :Core/Infra/Core Core issues without another label and removed :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Aug 17, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Aug 17, 2022
@arteam arteam added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed :Core/Infra/Core Core issues without another label labels Aug 17, 2022
@elastic elastic deleted a comment from elasticsearchmachine Aug 17, 2022
@elasticsearchmachine elasticsearchmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. and removed Team:Core/Infra Meta label for core/infra team labels Aug 17, 2022
@williamrandolph
Copy link
Contributor

When restoring system indices (not data streams) from a snapshot, the user isn't able to close or delete the system index, so we delete the existing system indices as we restore. It looks like we need to do the same thing for system data streams (or, if it's something we're already supposed to do, hunt for a bug or race condition that could be causing the problem). Since core/infra added this logic as part of the system indices project, it's fine with me if this issue is assigned to us.

@williamrandolph williamrandolph self-assigned this Aug 24, 2022
@romain-chanu
Copy link
Author

@williamrandolph -

When restoring system indices (not data streams) from a snapshot, the user isn't able to close or delete the system index, so we delete the existing system indices as we restore. It looks like we need to do the same thing for system data streams (or, if it's something we're already supposed to do, hunt for a bug or race condition that could be causing the problem). Since core/infra added this logic as part of the system indices project, it's fine with me if this issue is assigned to us.

I believe the pull request is related (#75860). We probably have missed something in the restore logic.

@Leaf-Lin Leaf-Lin added :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team and removed :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Nov 7, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@Leaf-Lin
Copy link
Contributor

Leaf-Lin commented Nov 7, 2022

Based on comment above, I've relabeled this to Core/Infra.

Since core/infra added this logic as part of the system indices project, it's fine with me if this issue is assigned to us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

6 participants