Extra parts found after performing balance operation #3657

kikimo · 2022-01-07T03:48:47Z

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Extra parts found after performing balance operation. We have a space with 512part * 3replicas = 1536parts:

(root@nebula) [ttos_3p3r]> show hosts;                                                                                                              [529/1459]
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:206"        |
| "store2" | 9779 | "ONLINE"  | 24           | "ttos_3p3r:24"       | "ttos_3p3r:340"        |
| "store3" | 9779 | "ONLINE"  | 303          | "ttos_3p3r:303"      | "ttos_3p3r:319"        |
| "store4" | 9779 | "ONLINE"  | 99           | "ttos_3p3r:99"       | "ttos_3p3r:329"        |
| "store5" | 9779 | "ONLINE"  | 86           | "ttos_3p3r:86"       | "ttos_3p3r:342"        |
| "Total"  |      |           | 512          | "ttos_3p3r:512"      | "ttos_3p3r:1536"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 7345/7948 us)

Thu, 06 Jan 2022 15:33:25 CST

after some balance operation we found that 1536parts grows into 1537parts and 1538parts:

(root@nebula) [ttos_3p3r]> show hosts;
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:175"        |
| "store2" | 9779 | "ONLINE"  | 81           | "ttos_3p3r:81"       | "ttos_3p3r:342"        |
| "store3" | 9779 | "ONLINE"  | 31           | "ttos_3p3r:31"       | "ttos_3p3r:330"        |
| "store4" | 9779 | "ONLINE"  | 307          | "ttos_3p3r:307"      | "ttos_3p3r:336"        |
| "store5" | 9779 | "ONLINE"  | 93           | "ttos_3p3r:93"       | "ttos_3p3r:354"        |
| "Total"  |      |           | 512          | "ttos_3p3r:512"      | "ttos_3p3r:1537"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 6975/7626 us)

Thu, 06 Jan 2022 16:24:45 CST

(root@nebula) [ttos_3p3r]> show hosts;                                                                                                              [259/1459]
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:153"        |
| "store2" | 9779 | "ONLINE"  | 114          | "ttos_3p3r:114"      | "ttos_3p3r:349"        |
| "store3" | 9779 | "ONLINE"  | 3            | "ttos_3p3r:3"        | "ttos_3p3r:335"        |
| "store4" | 9779 | "ONLINE"  | 275          | "ttos_3p3r:275"      | "ttos_3p3r:342"        |
| "store5" | 9779 | "ONLINE"  | 120          | "ttos_3p3r:120"      | "ttos_3p3r:359"        |
| "Total"  |      |           | 512          | "ttos_3p3r:512"      | "ttos_3p3r:1538"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 7809/8397 us)

Thu, 06 Jan 2022 16:39:33 CST

and we can also found that two storages are offline later:

(root@nebula) [ttos_3p3r]> show hosts;
+----------+------+-----------+--------------+----------------------+------------------------+
| Host     | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+----------+------+-----------+--------------+----------------------+------------------------+
| "store1" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:120"        |
| "store2" | 9779 | "ONLINE"  | 24           | "ttos_3p3r:24"       | "ttos_3p3r:361"        |
| "store3" | 9779 | "ONLINE"  | 321          | "ttos_3p3r:321"      | "ttos_3p3r:347"        |
| "store4" | 9779 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:354"        |
| "store5" | 9779 | "ONLINE"  | 128          | "ttos_3p3r:128"      | "ttos_3p3r:356"        |
| "Total"  |      |           | 473          | "ttos_3p3r:473"      | "ttos_3p3r:1538"       |
+----------+------+-----------+--------------+----------------------+------------------------+
Got 6 rows (time spent 11798/12566 us)

Thu, 06 Jan 2022 16:58:36 CST

and we found that the two offline storaged are still running with very high CPU, store1 CPU usage:

top - 11:31:08 up 8 days, 23:16,  1 user,  load average: 112.18, 108.17, 107.48
Tasks:   8 total,   2 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 58.9 us, 15.0 sy,  0.0 ni, 23.9 id,  0.0 wa,  0.0 hi,  2.2 si,  0.0 st
MiB Mem : 257583.7 total,  88205.3 free,  76023.6 used,  93354.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 179741.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     14 root      20   0 9514504 109448  36152 R  3801   0.0  48245:06 nebula-storaged
      1 root      20   0    1104      4      0 S   0.0   0.0   0:01.07 docker-init
      7 root      20   0    3976   2960   2716 S   0.0   0.0   0:00.00 run-storage.sh
     12 root      20   0   12176   4328   3408 S   0.0   0.0   0:00.00 sshd
     15 root      20   0    2508    520    460 S   0.0   0.0   0:00.00 sleep
   5602 root      20   0   13580   8868   7452 S   0.0   0.0   0:00.01 sshd
   5613 root      20   0    4240   3444   2976 S   0.0   0.0   0:00.00 bash
   5618 root      20   0    6180   3296   2752 R   0.0   0.0   0:00.00 top

store2 CPU usage:

top - 11:33:10 up 8 days, 23:18,  1 user,  load average: 108.27, 107.95, 107.48
Tasks:   8 total,   2 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 61.4 us, 16.6 sy,  0.0 ni, 19.5 id,  0.0 wa,  0.0 hi,  2.5 si,  0.0 st
MiB Mem : 257583.7 total,  88130.5 free,  76069.1 used,  93384.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 179696.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     14 root      20   0 9795.5m 596444  36496 R  3586   0.2  39474:19 nebula-storaged
      1 root      20   0    1104      4      0 S   0.0   0.0   0:01.08 docker-init
      7 root      20   0    3976   2996   2748 S   0.0   0.0   0:00.00 run-storage.sh
     12 root      20   0   12176   4184   3264 S   0.0   0.0   0:00.00 sshd
     15 root      20   0    2508    588    524 S   0.0   0.0   0:00.00 sleep
   6604 root      20   0   13584   8892   7472 S   0.0   0.0   0:00.02 sshd
   6615 root      20   0    4240   3412   2952 S   0.0   0.0   0:00.00 bash
   6620 root      20   0    6180   3280   2732 R   0.0   0.0   0:00.00 top

pstack shows that a lot of thread within the storage are acquiring RWSpinlock:

root@store4:~# cat new.txt | grep -i spinlock
#3  0x000000000353ae1b in folly::RWSpinLock::lock_shared()+74 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:212
#4  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  inlined            in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0  0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  inlined            in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0  0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  inlined            in folly::RWSpinLock::fetch_add at atomic_base.h:541
#0  0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
#2  0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320
#0  0x000000000353af9d in folly::RWSpinLock::try_lock_shared()+107 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:281
#1  0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210

we failed to enable storage log(we disable storage log when starting), the stderr.txt shows:

Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045600.14!Could not create logging file: Too many open files

and the storage failed to exit after recieving SIGTERM, stack dump before issueing SIGTERM:

before.txt

stack dump after issueing SIGTERM:

after.txt

Your Environments (required)

OS: uname -a
Compiler: g++ --version or clang++ --version
CPU: lscpu
Commit id 1147e89

How To Reproduce(required)

Steps to reproduce the behavior:

create a cluster with 3 storages
create a space with 512parts * 3 replicas in the 3 storages
add two new storges, balance data between all these 5storages, wait until the job finish
balance remove one storage

and we find that the storage cluster go insane duiring the last step.

Expected behavior

Additional context

The text was updated successfully, but these errors were encountered:

liwenhui-soul · 2022-01-13T03:24:06Z

fix in #3668

kikimo added the type/bug Type: something is unexpected label Jan 7, 2022

kikimo assigned critical27 and liwenhui-soul Jan 7, 2022

kikimo mentioned this issue Jan 7, 2022

Storage go crazy while balancing data #3664

Closed

jamieliu1023 mentioned this issue Jan 8, 2022

Weekly Report 2022-01-07 vesoft-inc/nebula-community#84

Closed

Sophie-Xie added this to the v3.0.0 milestone Jan 11, 2022

Sophie-Xie linked a pull request Jan 17, 2022 that will close this issue

Balance fix #3668

Merged

11 tasks

critical27 closed this as completed in #3668 Jan 20, 2022

jamieliu1023 mentioned this issue Jan 22, 2022

Weekly Report 2022-01-21 vesoft-inc/nebula-community#86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra parts found after performing balance operation #3657

Extra parts found after performing balance operation #3657

kikimo commented Jan 7, 2022

liwenhui-soul commented Jan 13, 2022

Extra parts found after performing balance operation #3657

Extra parts found after performing balance operation #3657

Comments

kikimo commented Jan 7, 2022

liwenhui-soul commented Jan 13, 2022