You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and we found that the two offline storaged are still running with very high CPU, store1 CPU usage:
top - 11:31:08 up 8 days, 23:16, 1 user, load average: 112.18, 108.17, 107.48
Tasks: 8 total, 2 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 58.9 us, 15.0 sy, 0.0 ni, 23.9 id, 0.0 wa, 0.0 hi, 2.2 si, 0.0 st
MiB Mem : 257583.7 total, 88205.3 free, 76023.6 used, 93354.8 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 179741.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14 root 20 0 9514504 109448 36152 R 3801 0.0 48245:06 nebula-storaged
1 root 20 0 1104 4 0 S 0.0 0.0 0:01.07 docker-init
7 root 20 0 3976 2960 2716 S 0.0 0.0 0:00.00 run-storage.sh
12 root 20 0 12176 4328 3408 S 0.0 0.0 0:00.00 sshd
15 root 20 0 2508 520 460 S 0.0 0.0 0:00.00 sleep
5602 root 20 0 13580 8868 7452 S 0.0 0.0 0:00.01 sshd
5613 root 20 0 4240 3444 2976 S 0.0 0.0 0:00.00 bash
5618 root 20 0 6180 3296 2752 R 0.0 0.0 0:00.00 top
store2 CPU usage:
top - 11:33:10 up 8 days, 23:18, 1 user, load average: 108.27, 107.95, 107.48
Tasks: 8 total, 2 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 61.4 us, 16.6 sy, 0.0 ni, 19.5 id, 0.0 wa, 0.0 hi, 2.5 si, 0.0 st
MiB Mem : 257583.7 total, 88130.5 free, 76069.1 used, 93384.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 179696.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14 root 20 0 9795.5m 596444 36496 R 3586 0.2 39474:19 nebula-storaged
1 root 20 0 1104 4 0 S 0.0 0.0 0:01.08 docker-init
7 root 20 0 3976 2996 2748 S 0.0 0.0 0:00.00 run-storage.sh
12 root 20 0 12176 4184 3264 S 0.0 0.0 0:00.00 sshd
15 root 20 0 2508 588 524 S 0.0 0.0 0:00.00 sleep
6604 root 20 0 13584 8892 7472 S 0.0 0.0 0:00.02 sshd
6615 root 20 0 4240 3412 2952 S 0.0 0.0 0:00.00 bash
6620 root 20 0 6180 3280 2732 R 0.0 0.0 0:00.00 top
pstack shows that a lot of thread within the storage are acquiring RWSpinlock:
root@store4:~# cat new.txt | grep -i spinlock
#3 0x000000000353ae1b in folly::RWSpinLock::lock_shared()+74 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:212#4 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320#0 inlined in folly::RWSpinLock::fetch_add at atomic_base.h:541#0 0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210#2 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320#0 inlined in folly::RWSpinLock::fetch_add at atomic_base.h:541#0 0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210#2 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320#0 inlined in folly::RWSpinLock::fetch_add at atomic_base.h:541#0 0x000000000353af62 in folly::RWSpinLock::try_lock_shared()+48 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:278#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210#2 0x000000000353b017 in ReadHolder()+44 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:320#0 0x000000000353af9d in folly::RWSpinLock::try_lock_shared()+107 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:281#1 0x000000000353adf3 in folly::RWSpinLock::lock_shared()+34 in /root/src/new-balance-nebula/build/bin/nebula-storaged at RWSpinLock.h:210
we failed to enable storage log(we disable storage log when starting), the stderr.txt shows:
Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045351.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045423.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045455.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045528.14!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20220107-045600.14!Could not create logging file: Too many open files
and the storage failed to exit after recieving SIGTERM, stack dump before issueing SIGTERM:
Please check the FAQ documentation before raising an issue
Describe the bug (required)
Extra parts found after performing balance operation. We have a space with 512part * 3replicas = 1536parts:
after some balance operation we found that 1536parts grows into 1537parts and 1538parts:
and we can also found that two storages are offline later:
and we found that the two offline storaged are still running with very high CPU, store1 CPU usage:
store2 CPU usage:
pstack shows that a lot of thread within the storage are acquiring RWSpinlock:
we failed to enable storage log(we disable storage log when starting), the stderr.txt shows:
and the storage failed to exit after recieving SIGTERM, stack dump before issueing SIGTERM:
before.txt
stack dump after issueing SIGTERM:
after.txt
Your Environments (required)
uname -a
g++ --version
orclang++ --version
lscpu
How To Reproduce(required)
Steps to reproduce the behavior:
and we find that the storage cluster go insane duiring the last step.
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: