Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All storages is offline after restart nebula services #5398

Closed
mxsavchenko opened this issue Mar 13, 2023 · 15 comments
Closed

All storages is offline after restart nebula services #5398

mxsavchenko opened this issue Mar 13, 2023 · 15 comments
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected

Comments

@mxsavchenko
Copy link

  • Installation: Docker
  • OS: AlmaLinux 8.5
  • CPU: Intel xeon 4116
  • Commit id (db3c1b3)
  • Database size: ~400Gb
  • Settings: default

Hi, i have Nebula cluster on 3 nodes (graph/meta/storage), which was installed in v3.2.1 version.
A few days ago, i wanted to upgrade to version 3.4.0, i stopped all services (graph/meta/storage) on all nodes, then update docker image version to 3.4.0 and started the services again, but storage is not state ONLINE, after load parts, when switching to version 3.2.1 - the same problem. In logs storage, the leader is constantly being re-elected, and it seems that each node randomly takes the role of leader all the time, the console keeps switching storage OFFLINE/ONLINE, and then when all 3 storages have loaded parts, they go OFFLINE.

show hosts graph;
+-----------+------+----------+---------+--------------+---------+
| Host | Port | Status | Role | Git Info Sha | Version |
+-----------+------+----------+---------+--------------+---------+
| "graphd0" | 9669 | "ONLINE" | "GRAPH" | "db3c1b3" | "3.4.0" |
| "graphd1" | 9669 | "ONLINE" | "GRAPH" | "db3c1b3" | "3.4.0" |
| "graphd2" | 9669 | "ONLINE" | "GRAPH" | "db3c1b3" | "3.4.0" |
+-----------+------+----------+---------+--------------+---------+

#####################

show hosts meta;
+----------+------+----------+--------+--------------+---------+
| Host | Port | Status | Role | Git Info Sha | Version |
+----------+------+----------+--------+--------------+---------+
| "metad2" | 9559 | "ONLINE" | "META" | "db3c1b3" | "3.4.0" |
| "metad0" | 9559 | "ONLINE" | "META" | "db3c1b3" | "3.4.0" |
| "metad1" | 9559 | "ONLINE" | "META" | "db3c1b3" | "3.4.0" |
+----------+------+----------+--------+--------------+---------+

#####################

show hosts storage;
+-------------+------+-----------+-----------+--------------+---------+
| Host | Port | Status | Role | Git Info Sha | Version |
+-------------+------+-----------+-----------+--------------+---------+
| "storaged0" | 9779 | "OFFLINE" | "STORAGE" | "db3c1b3" | "3.4.0" |
| "storaged1" | 9779 | "OFFLINE" | "STORAGE" | "db3c1b3" | "3.4.0" |
| "storaged2" | 9779 | "OFFLINE" | "STORAGE" | "db3c1b3" | "3.4.0" |
+-------------+------+-----------+-----------+--------------+---------+

logs from storaged0/storaged1 in zip archive:
logs.zip

@mxsavchenko mxsavchenko added the type/bug Type: something is unexpected label Mar 13, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Mar 13, 2023
@mxsavchenko
Copy link
Author

Errors in storage, after increased the level of logging:

I20230314 09:48:25.417989 43 RaftPart.cpp:1256] [Port: 9780, Space: 69, Part: 10] Receive response about askForVote from "storaged2":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20230314 09:48:25.418040 43 RaftPart.cpp:1283] [Port: 9780, Space: 69, Part: 10] Did not get enough votes from election of term 11, isPreVote = 1
I20230314 09:48:26.581475 73 RaftPart.cpp:1289] [Port: 9780, Space: 64, Part: 13] Start leader election...
I20230314 09:48:26.582363 73 RaftPart.cpp:1317] [Port: 9780, Space: 64, Part: 13] Sending out an election request (space = 64, part = 13, term = 14, lastLogId = 304921783, lastLogTerm = 13, candidateIP = storaged0, candidatePort = 9780), isPreVote = 1
I20230314 09:48:26.582995 43 RaftPart.cpp:1256] [Port: 9780, Space: 64, Part: 13] Receive response about askForVote from "storaged2":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20230314 09:48:26.583040 43 RaftPart.cpp:1283] [Port: 9780, Space: 64, Part: 13] Did not get enough votes from election of term 14, isPreVote = 1
I20230314 09:48:28.321467 73 RaftPart.cpp:1289] [Port: 9780, Space: 64, Part: 13] Start leader election...
I20230314 09:48:28.321529 73 RaftPart.cpp:1317] [Port: 9780, Space: 64, Part: 13] Sending out an election request (space = 64, part = 13, term = 14, lastLogId = 304921783, lastLogTerm = 13, candidateIP = storaged0, candidatePort = 9780), isPreVote = 1
I20230314 09:48:28.322351 43 RaftPart.cpp:1256] [Port: 9780, Space: 64, Part: 13] Receive response about askForVote from "storaged2":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20230314 09:48:28.322397 43 RaftPart.cpp:1283] [Port: 9780, Space: 64, Part: 13] Did not get enough votes from election of term 14, isPreVote = 1

@mxsavchenko
Copy link
Author

and another question, is there any way to speed up the loading of parts after restarting the nebula storage? maybe some parameter in the configuration is responsible for this... Currently, it takes me about 3 hours to loading parts ((

@wenhaocs
Copy link
Contributor

E_RAFT_UNKNOWN_PART typically indicates the part is not found in your storaged2. Let me check why it happened. BTW, how many parts do you have?

@pengweisong
Copy link
Contributor

pengweisong commented Mar 15, 2023

and another question, is there any way to speed up the loading of parts after restarting the nebula storage? maybe some parameter in the configuration is responsible for this... Currently, it takes me about 3 hours to loading parts ((

How many replicas do you set for each part? From the log, it looks like 2 instead of 3?
Which disk type you used? HDD or SSD? The latter may be slow to start, especially when encounter compaction of RocksDB.

@wey-gu
Copy link
Contributor

wey-gu commented Mar 15, 2023

How many replicas do you set for each part? From the log, it looks like 2 instead of 3?

We should consider rejecting even number in replication factor

#5380

@mxsavchenko
Copy link
Author

mxsavchenko commented Mar 15, 2023

and another question, is there any way to speed up the loading of parts after restarting the nebula storage? maybe some parameter in the configuration is responsible for this... Currently, it takes me about 3 hours to loading parts ((

How many replicas do you set for each part? From the log, it looks like 2 instead of 3? Which disk type you used? HDD or SSD? The latter may be slow to start, especially when encounter compaction of RocksDB.

on every space have 16 partitions and replication factor 2, disks SSD.

@wey-gu
Copy link
Contributor

wey-gu commented Mar 15, 2023

We should not configure the replication factor as an even number, maybe we should have banned such configuration when creating spaces.

Could you wipe the cluster and recreate space with replication factor 1(non-ha) or 3(ha)?

@mxsavchenko
Copy link
Author

We should not configure the replication factor as an even number, maybe we should have banned such configuration when creating spaces.

Could you wipe the cluster and recreate space with replication factor 1(non-ha) or 3(ha)?

I can wipe the cluster but I have no backups ( Is there any other way to recover my data?

@wey-gu
Copy link
Contributor

wey-gu commented Mar 15, 2023

@wenhaocs @pengweisong I think copying data from some of the storaged to others will do the job, right?

@pengweisong
Copy link
Contributor

pengweisong commented Mar 17, 2023

Do you have executed balance data command?

@kikimo
Copy link
Contributor

kikimo commented Mar 17, 2023

Is the network stable, or what about the I/O, CPU load of storage server?

@mxsavchenko
Copy link
Author

Do you have executed balance data command?

no, but all storages are OFFLINE, will that help?
try BALANCE LEADER?

@mxsavchenko
Copy link
Author

Is the network stable, or what about the I/O, CPU load of storage server?

yes, netwok sis table, and other resources also (

@pengweisong
Copy link
Contributor

pengweisong commented Mar 17, 2023

no, but all storages are OFFLINE, will that help? try BALANCE LEADER?

no, do not execute any balance data command, it will be a disaster when you only have 2 copies.

@QingZ11
Copy link
Contributor

QingZ11 commented May 5, 2023

@mxsavchenko Hi, I have noticed that the issue you created hasn’t been updated for nearly a month, so I have to close it for now. If you have any new updates, you are welcome to reopen this issue anytime.

Thanks a lot for your contribution anyway 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

6 participants