Elasticsearch breaks as searchnodes relocates shards to manager #11062

EddieN17 · 2023-08-16T16:09:22Z

EddieN17
Aug 16, 2023

Hello,

I got a distributed layout with multiple sensors, a manager and a search node, using many elastic agent integrations. The manager has a disk of 200 GB, and the search node has multiple TBs. Over night the manager /nsm hit 100% usage and was completely full. This shut down Elasticsearch with the disk watermark. My errors were identical to this: https://discuss.elastic.co/t/how-to-solve-we-couldnt-log-you-in-please-try-again-error-in-kibana/332658/4

I see in /nsm/elasticsearch/indicies/ that full indicies are being stored on the manager, which to my understanding I thought the searchnodes should be the only ones storing them.

How does 2.4 handle deleting logs based off the directory size? I know in 2.3 it used to be handled by salt/curator/action/delete.yml but now looking at that file in 2.4 it seems entirely broken. The log_size_limit isn't being set properly. Then search nodes also don't have curator installed on them.

My manager is a VM and I had a snapshot so I was able to go back to before this error, but that means I don't have any of the SO logs. But I'm assuming this problem will arise again as the /nsm directory on the manager keeps increasing in storage. So is there a way to make sure this directory doesn't reach 100% storage utilization or a way to make the manager not store indicies and just utilize the search nodes?

Answered by TOoSmOotH

Dec 1, 2023

The following should keep indices and data streams from creating on your manager. Replace the IP with the IP address of your manager.

PUT _cluster/settings { "transient" : { "cluster.routing.allocation.exclude._ip" : "10.0.0.1" } }

View full answer

TOoSmOotH · 2023-08-16T17:44:19Z

TOoSmOotH
Aug 16, 2023
Maintainer

Can you run sudo so-elasticsearch-query _cat/shards

4 replies

EddieN17 Aug 16, 2023
Author

Currently don't have the problem since I went back to a different VM snapshot and a lot of the /nsm directory went away, so unsure if this is helpful, but here you go. If it's not helpful I can just see if the problem occurs again.

.kibana_security_session_1                                     0 p STARTED           7  45.5kb (ip) manager
elastalert_status                                              0 p STARTED       20454   4.9mb (ip) manager
.ds-logs-zeek-so-2023.08.16-000005                             0 p STARTED    27141473  29.7gb (ip)  search1
.ds-logs-zeek-so-2023.08.16-000005                             1 p STARTED    27140740  29.7gb (ip)  search1
.ds-logs-cisco_duo.admin-default-2023.08.15-000001             0 p STARTED        1389   2.3mb (ip)  search1
elastalert_silence                                             0 p STARTED           1     5kb (ip) manager
.internal.alerts-security.alerts-default-000001                0 p STARTED           0    247b (ip) manager
metrics-endpoint.metadata_current_default                      0 p STARTED           0    247b (ip) manager
.fleet-policies-7                                              0 p STARTED         149   1.8mb (ip) manager
.apm-agent-configuration                                       0 p STARTED           0    247b (ip) manager
.ds-logs-suricata-so-2023.08.14-000001                         0 p STARTED     3829119   7.1gb (ip)  search1
.kibana_alerting_cases_8.8.2_001                               0 p STARTED           1   6.7kb (ip) manager
.ds-logs-elastic_agent.osquerybeat-default-2023.08.14-000001   0 p UNASSIGNED                              
elastalert                                                     0 p STARTED      246627 301.1mb (ip) manager
.kibana_task_manager_8.8.2_001                                 0 p STARTED          25 260.1kb (ip) manager
.logs-osquery_manager.actions-default                          0 p STARTED           0    247b (ip) manager
.ds-.fleet-actions-results-2023.08.15-000001                   0 p STARTED           1   4.5kb (ip)  search1
.ds-.fleet-actions-results-2023.08.15-000001                   0 r STARTED           1   4.5kb (ip) manager
.ds-logs-elastic_agent.filebeat-default-2023.08.14-000001      0 p STARTED         818 526.2kb (ip)  search1
elastalert_past                                                0 p STARTED           0    247b (ip) manager
.ds-logs-kratos-so-2023.08.14-000001                           0 p STARTED       47781  68.4mb (ip)  search1
.ds-logs-soc-so-2023.08.14-000001                              0 p STARTED      128498 100.2mb (ip)  search1
.logs-osquery_manager.action.responses-default                 0 p STARTED           0    247b (ip) manager
.ds-logs-zeek-so-2023.08.15-000002                             0 p STARTED    45893798  50.2gb (ip)  search1
.ds-logs-zeek-so-2023.08.15-000002                             1 p STARTED    45899213  50.2gb (ip)  search1
.fleet-file-data-agent-000001                                  0 p STARTED           0    247b (ip) manager
.ds-logs-panw.panos-default-2023.08.15-000002                  0 p STARTED    79945423    50gb (ip)  search1
.ds-logs-okta.system-default-2023.08.15-000002                 0 p STARTED       38479  22.9mb (ip)  search1
.ds-logs-zeek-so-2023.08.15-000003                             0 p STARTED     7564446   8.5gb (ip) manager
.ds-logs-zeek-so-2023.08.15-000003                             1 p STARTED    46445842  51.2gb (ip)  search1
.ds-logs-okta.system-default-2023.08.15-000001                 0 p STARTED       13245   2.7mb (ip)  search1
.internal.alerts-observability.logs.alerts-default-000001      0 p STARTED           0    247b (ip) manager
.internal.alerts-observability.slo.alerts-default-000001       0 p STARTED           0    247b (ip) manager
.fleet-agents-7                                                0 p STARTED          16 233.1kb (ip) manager
.ds-logs-system.syslog-default-2023.08.14-000001               0 p STARTED      694978 259.7mb (ip)  search1
.async-search                                                  0 p STARTED           0 191.3kb (ip) manager
.ds-logs-cisco_duo.summary-default-2023.08.15-000001           0 p STARTED           1  11.7kb (ip)  search1
.kibana_8.8.2_001                                              0 p STARTED        1154 138.6kb (ip) manager
.ds-logs-elastic_agent.metricbeat-default-2023.08.14-000001    0 p STARTED          28  98.3kb (ip)  search1
.ds-logs-cisco_duo.telephony-default-2023.08.15-000002         0 p UNASSIGNED                              
.ds-logs-system.auth-default-2023.08.14-000001                 0 p STARTED       34063  32.1mb (ip)  search1
.ds-logs-strelka-so-2023.08.15-000001                          0 p STARTED       15554    88mb (ip)  search1
.internal.alerts-observability.uptime.alerts-default-000001    0 p STARTED           0    247b (ip) manager
.ds-logs-elasticsearch.server-default-2023.08.14-000001        0 p STARTED       94363  16.4mb (ip)  search1
.fleet-policies-leader-7                                       0 p STARTED           5  19.5kb (ip) manager
.fleet-files-endpoint-000001                                   0 p STARTED           0    247b (ip) manager
.ds-logs-infoblox_nios.log-default-2023.08.15-000001           0 p STARTED     9473137   1.7gb (ip)  search1
.fleet-file-data-endpoint-000001                               0 p STARTED           0    247b (ip) manager
elastalert_error                                               0 p STARTED        3139   4.7mb (ip) manager
.ds-logs-cisco_secure_endpoint.event-default-2023.08.15-000001 0 p STARTED        3800   1.2mb (ip)  search1
.ds-logs-zeek-so-2023.08.16-000004                             0 p STARTED    47030008  49.8gb (ip)  search1
.ds-logs-zeek-so-2023.08.16-000004                             1 p STARTED    47015672    50gb (ip)  search1
.transform-internal-007                                        0 p STARTED          45  41.3kb (ip) manager
.ds-logs-elastic_agent-default-2023.08.14-000001               0 p STARTED         396   422kb (ip)  search1
.ds-logs-cisco_duo.auth-default-2023.08.15-000002              0 p STARTED           2  14.1kb (ip)  search1
.security-profile-8                                            0 p STARTED           2  17.8kb (ip) manager
.kibana_analytics_8.8.2_001                                    0 p STARTED        4999   4.2mb (ip) manager
.kibana_ingest_8.8.2_001                                       0 p STARTED        4539  37.5mb (ip) manager
.kibana_security_solution_8.8.2_001                            0 p STARTED        1887   2.6mb (ip) manager
.ds-logs-cisco_duo.summary-default-2023.08.15-000002           0 p STARTED           1  11.7kb (ip)  search1
.apm-source-map                                                0 p STARTED           0    247b (ip) manager
.internal.alerts-observability.metrics.alerts-default-000001   0 p STARTED           0    247b (ip) manager
.ds-logs-panw.panos-default-2023.08.15-000001                  0 p STARTED       70393    16mb (ip)  search1
.internal.alerts-observability.apm.alerts-default-000001       0 p STARTED           0    247b (ip) manager
.fleet-files-agent-000001                                      0 p STARTED           0    247b (ip) manager
.ds-logs-panw.panos-default-2023.08.16-000003                  0 p STARTED    23376736   4.1gb (ip)  search1
.fleet-servers-7                                               0 p STARTED           2  54.1kb (ip) manager
.apm-custom-link                                               0 p STARTED           0    247b (ip) manager
.ds-logs-cisco_duo.summary-default-2023.08.15-000003           0 p STARTED        1669   1.5mb (ip)  search1
.ds-logs-system.security-default-2023.08.14-000001             0 p UNASSIGNED                              
.metrics-endpoint.metadata_united_default                      0 p STARTED           3  81.6kb (ip) manager
.ds-logs-cisco_duo.auth-default-2023.08.15-000001              0 p STARTED           3  15.2kb (ip)  search1
.ds-logs-cisco_secure_endpoint.event-default-2023.08.15-000002 0 p STARTED       14054   8.2mb (ip)  search1
.kibana-event-log-8.8.2-000001                                 0 p STARTED           3  20.6kb (ip) manager
.fleet-actions-7                                               0 p STARTED           1   6.1kb (ip) manager
.ds-ilm-history-5-2023.08.14-000001                            0 p STARTED         185 292.3kb (ip)  search1
.ds-ilm-history-5-2023.08.14-000001                            0 r STARTED         185 292.3kb (ip) manager
.ds-logs-elastic_agent.fleet_server-default-2023.08.14-000001  0 p UNASSIGNED                              
.fleet-enrollment-api-keys-7                                   0 p STARTED           5  32.8kb (ip) manager
.fleet-artifacts-7                                             0 p STARTED          15  14.8kb (ip) manager
.security-7                                                    0 p STARTED         161 462.7kb (ip) manager
.transform-notifications-000002                                0 p STARTED          38  68.6kb (ip) manager
.ds-logs-cisco_duo.telephony-default-2023.08.15-000001         0 p STARTED           4  13.5kb (ip)  search1
.ds-logs-cisco_duo.auth-default-2023.08.15-000003              0 p STARTED        7616  12.6mb (ip)  search1

TOoSmOotH Aug 16, 2023
Maintainer

What is strange is you have shards on your manager node that you shouldn't.

.ds-logs-zeek-so-2023.08.15-000003                             0 p STARTED     7564446   8.5gb (ip) manager
.ds-logs-zeek-so-2023.08.15-000003                             1 p STARTED    46445842  51.2gb (ip)  search1

Can you do so-elasticsearch-query .ds-logs-zeek-so-2023.08.15-000003/_settings | jq . and let me know the _tier_preference?

Also did you have the manager up for a while and have data coming in before the search node was attached?

Also can you look at /opt/so/conf/elasticsearch/elasticsearch.conf and confirm that it has data and not data_hot under roles?

EddieN17 Aug 16, 2023
Author

_tier_preference is data_hot.

The manager node does have the data role, not data_hot. Search has data_hot.

Search node was the first node I attached to the manager, but there’s a chance that the search node had some downtime while the manager and sensors were still up and running.

TOoSmOotH Aug 16, 2023
Maintainer

ok you will need to force the zeek shards from the manager node. Not sure why they are not re-allocating on their own.

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html

EddieN17 · 2023-08-17T20:08:18Z

EddieN17
Aug 17, 2023
Author

Search node is automatically relocating shards to the manager.

Caught this happening in the act, unsure what's causing this, but the result of so-elasticsearch-query _cat/shards | grep RELOCATING is:

.ds-logs-zeek-so-2023.08.16-000001                             1 p RELOCATING 46783951  50.1gb (search_ip) search1 -> (manager_ip) WwR9Zt1HRDO3_DpoFsHktA manager
.ds-logs-zeek-so-2023.08.17-000004                             1 p RELOCATING 11343169  13.1gb (search_ip) search1 -> (manager_ip) WwR9Zt1HRDO3_DpoFsHktA manager

Ran so-elasticsearch-query .ds-logs-zeek-so-2023.08.17-000004/_settings | jq after that index was completely relocated, and results are:

  ".ds-logs-zeek-so-2023.08.17-000004": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "so-zeek-logs"
        },
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "mapping": {
          "total_fields": {
            "limit": "5000"
          }
        },
        "refresh_interval": "30s",
        "hidden": "true",
        "number_of_shards": "2",
        "provided_name": ".ds-logs-zeek-so-2023.08.17-000004",
        "creation_date": "1692296274415",
        "priority": "100",
        "number_of_replicas": "0",
        "uuid": "xWNH9oNSRBmZSbtZaf_fvg",
        "version": {
          "created": "8080299"
        }
      }
    }
  }
}

A fix to stop the automatic relocating would be great, as it'll eventually fill the manager /nsm disk up and hit the elasticsearch data floor and therefore stop ingesting logs, when the search node still has plenty of space.

6 replies

EddieN17 Aug 17, 2023
Author

Yep, currently it's only has been the zeek indicies being relocated.
I have deleted the data stream, so we'll see if anything different happens.

EddieN17 Aug 17, 2023
Author

Deleted the data stream before your reply edit, so I lost all existing zeek indicies, so don't have an example of a proven "bad index", but my current zeek index outputs this for sudo so-elasticsearch-query _cluster/allocation/explain -d '{"index": ".ds-logs-zeek-so-2023.08.17-000001", "shard": "1", "primary": true}' | jq:

{
  "index": ".ds-logs-zeek-so-2023.08.17-000001",
  "shard": 1,
  "primary": true,
  "current_state": "started",
  "current_node": {
    "id": "ecq8V92FQaG2N*********",
    "name": "search1",
    "transport_address": "(ip):9300",
    "attributes": {
      "xpack.installed": "true"
    },
    "weight_ranking": 1
  },
  "can_remain_on_current_node": "yes",
  "can_rebalance_cluster": "yes",
  "can_rebalance_to_other_node": "no",
  "rebalance_explanation": "This shard is in a well-balanced location and satisfies all allocation rules so it will remain on this node. Elasticsearch cannot improve the cluster balance by moving it to another node. If you expect this shard to be rebalanced to another node, find the other node in the node-by-node explanation and address the reasons which prevent Elasticsearch from rebalancing this shard there.",
  "node_allocation_decisions": [
    {
      "node_id": "WwR9Zt1HRDO3**********",
      "node_name": "manager",
      "transport_address": "(ip):9300",
      "node_attributes": {
        "xpack.installed": "true"
      },
      "node_decision": "worse_balance",
      "weight_ranking": 2
    }
  ]
}

EddieN17 Aug 18, 2023
Author

Zeek shard got reallocated again. Ran sudo so-elasticsearch-query _cluster/allocation/explain -d ... command

Zeek shard that got allocated to the manager: https://gist.github.com/EddieN17/3e51c01c736545462c48e15c0d5b219c

Same zeek index, but different shard that is allocated to searchnode: https://gist.github.com/EddieN17/da1084f1a4edeeef45a54ebc45dadf56

TOoSmOotH Aug 18, 2023
Maintainer

What are your watermark settings on the searchnode?

EddieN17 Aug 18, 2023
Author

Watermark of searchnode:

watermark:
          low: 80%
          high: 85%
          flood_stage: 90%

df -h of searchnode:

Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                 4.0M     0  4.0M   0% /dev
tmpfs                    126G  8.0K  126G   1% /dev/shm
tmpfs                     51G  152M   51G   1% /run
/dev/mapper/system-root  558G   11G  547G   2% /
/dev/sda2                488M  251M  202M  56% /boot
/dev/mapper/nsm-nsm       38T  565G   37T   2% /nsm
overlay                  558G   11G  547G   2% /var/lib/docker/overlay2/3ad65d11108921f7cf1cebf248ccebe4d882cc94cdaf796392c035dcd3b27a6d/merged
overlay                  558G   11G  547G   2% /var/lib/docker/overlay2/5f98c03e22996845c2cac218bfe2e15fa38ef0a7aad5491adde09d91c95b6a34/merged
overlay                  558G   11G  547G   2% /var/lib/docker/overlay2/632274e2245a52d3e7eb3b50eb0a8dd8358c17c0d042b1a6bb56bacbd65c4495/merged
tmpfs                     26G     0   26G   0% /run/user/1001
overlay                  558G   11G  547G   2% /var/lib/docker/overlay2/de8b01f743cbcebae856f4310e7f324acc535feedd41bb4f2ed4a6389633af70/merged

EddieN17 · 2023-08-18T14:31:51Z

EddieN17
Aug 18, 2023
Author

Another update:

After running sudo so-elasticsearch-query _cluster/allocation/explain -d '{"index": ".ds-logs-zeek-so-2023.08.18-000003", "shard": "0", "primary": true}' | jq on a shard that is on the search node, it now is showing different results, saying it can rebalance to another node, in this case rebalancing it to the manager.
https://gist.github.com/EddieN17/fff4d36a607fe711e64e2c897d2ff5ce

1 reply

EddieN17 Aug 18, 2023
Author

As a current mitigation, I have adjusted the manager's watermark settings to a low of 45%, and a high of 50%, with the flood stage staying at 90%, and this has seemed to stop the search node reallocating the large zeek shards to the manager as moving any of the complete 50 GB shards would make the manager node over the high watermark.

argwfm · 2023-12-01T13:12:13Z

argwfm
Dec 1, 2023

Wanted to give this thread a bump... having the same issue for a fresh SOv2.4.30 distributed install. Like reported above, some Zeek data streams are being stored on the manager node (which led to /nsm reaching 100% and caused manager services to fail). Modifying the elasticsearch watermarks (thanks @EddieN17) appears to be preventing further /nsm space issues. Is there any way to prevent Zeek data streams from being stored on the manager?

so-elasticsearch-query _cat/shards | grep manager
.ds-logs-zeek-so-2023.12.01-000004                            0 p STARTED     8210953    11gb 172.28.56.80 manager
.ds-logs-zeek-so-2023.12.01-000004                            1 p STARTED     8240050    11gb 172.28.56.80 manager
.ds-logs-zeek-so-2023.12.01-000002                            0 p RELOCATING 43307701  51.4gb 172.28.56.80 manager -> search2

5 replies

TOoSmOotH Dec 1, 2023
Maintainer

The following should keep indices and data streams from creating on your manager. Replace the IP with the IP address of your manager.

PUT _cluster/settings { "transient" : { "cluster.routing.allocation.exclude._ip" : "10.0.0.1" } }

Answer selected by TOoSmOotH

oneCrazyAdmin Dec 13, 2023

Is this something that is new in the balancing process? Do we need to add this to the salt stack somewhere?

TOoSmOotH Dec 13, 2023
Maintainer

This is due to the manager having the data role for the initial install. Once that setting is made and the indices migrate off you are able to take away the data role from the manager. For some reason even though we are telling elastic to put everything on the data_hot tier it ends up putting things there even though it is not marked as hot.

argwfm Dec 13, 2023

I had to make this persistent vs transient, ended up losing this config following container restarts/etc.
PUT _cluster/settings { "persistent" : { "cluster.routing.allocation.exclude._ip" : "10.0.0.1" } }

Bal33p Mar 5, 2024

Just to be clear you're saying that the data role should be removed from the manager node. True?
When I did this Elasticsearch would not start on the manager node, and it would show a fault on the grid. IS this normal?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch breaks as searchnodes relocates shards to manager #11062

{{title}}

Replies: 4 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Elasticsearch breaks as searchnodes relocates shards to manager #11062

EddieN17 Aug 16, 2023

Replies: 4 comments · 16 replies

TOoSmOotH Aug 16, 2023 Maintainer

EddieN17 Aug 16, 2023 Author

TOoSmOotH Aug 16, 2023 Maintainer

EddieN17 Aug 16, 2023 Author

TOoSmOotH Aug 16, 2023 Maintainer

EddieN17 Aug 17, 2023 Author

EddieN17 Aug 17, 2023 Author

EddieN17 Aug 17, 2023 Author

EddieN17 Aug 18, 2023 Author

TOoSmOotH Aug 18, 2023 Maintainer

EddieN17 Aug 18, 2023 Author

EddieN17 Aug 18, 2023 Author

EddieN17 Aug 18, 2023 Author

argwfm Dec 1, 2023

TOoSmOotH Dec 1, 2023 Maintainer

oneCrazyAdmin Dec 13, 2023

TOoSmOotH Dec 13, 2023 Maintainer

argwfm Dec 13, 2023

Bal33p Mar 5, 2024

EddieN17
Aug 16, 2023

Replies: 4 comments 16 replies

TOoSmOotH
Aug 16, 2023
Maintainer

EddieN17 Aug 16, 2023
Author

TOoSmOotH Aug 16, 2023
Maintainer

EddieN17 Aug 16, 2023
Author

TOoSmOotH Aug 16, 2023
Maintainer

EddieN17
Aug 17, 2023
Author

EddieN17 Aug 17, 2023
Author

EddieN17 Aug 17, 2023
Author

EddieN17 Aug 18, 2023
Author

TOoSmOotH Aug 18, 2023
Maintainer

EddieN17 Aug 18, 2023
Author

EddieN17
Aug 18, 2023
Author

EddieN17 Aug 18, 2023
Author

argwfm
Dec 1, 2023

TOoSmOotH Dec 1, 2023
Maintainer

TOoSmOotH Dec 13, 2023
Maintainer