Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metad not receive heartbeat from storaged #4890

Closed
robcinko opened this issue Nov 16, 2022 · 7 comments
Closed

Metad not receive heartbeat from storaged #4890

robcinko opened this issue Nov 16, 2022 · 7 comments

Comments

@robcinko
Copy link

Hi,

we have to reboot k8s workers (for some maintenance) where nebula cluster is running, after boot workers up all nebula services are running, but 2 of 3 storaged is not sending heartbeat to metad and we see 2 of 3 storaged pods offline (SHOW HOSTS in nebula-console). Sometimes this problem storaged pods sends heartbeat and goes ONLINE but after some time goes OFFLINE again.

Do you have any suggestions how to fix this ?

Thank you very much.

@wey-gu
Copy link
Contributor

wey-gu commented Nov 16, 2022

Dear @robcinko

Could you plz check logs under log volumes of the storaged pods?

@robcinko
Copy link
Author

Hello @wey-gu

If you think that I have to check logs from pvc which using storaged pod as log store volume, then there are no events.
Events: <none>

@wey-gu
Copy link
Contributor

wey-gu commented Nov 16, 2022

@robcinko the log was not wired to k8s, it's file based, could you access the shell of the pod and print logs there?

[root@nebula-storaged-0 nebula]# pwd
/usr/local/nebula
[root@nebula-storaged-0 nebula]# ls logs
nebula-storaged.ERROR                                               nebula-storaged.nebula-storaged-0.root.log.INFO.20221115-042115.1
nebula-storaged.INFO                                                nebula-storaged.nebula-storaged-0.root.log.INFO.20221116-065927.1
nebula-storaged.WARNING                                             nebula-storaged.nebula-storaged-0.root.log.WARNING.20221114-094645.1
nebula-storaged.nebula-storaged-0.root.log.ERROR.20221114-094645.1  nebula-storaged.nebula-storaged-0.root.log.WARNING.20221115-042115.1
nebula-storaged.nebula-storaged-0.root.log.ERROR.20221115-042125.1  nebula-storaged.nebula-storaged-0.root.log.WARNING.20221116-065927.1
nebula-storaged.nebula-storaged-0.root.log.ERROR.20221116-065947.1  storaged-stderr.log
nebula-storaged.nebula-storaged-0.root.log.INFO.20221114-094645.1   storaged-stdout.log

@robcinko
Copy link
Author

robcinko commented Nov 16, 2022

Oh, my apologies for misunderstanding.

Only pod which is sending heartbeat is nebula-cluster-storaged-1

[root@nebula-cluster-storaged-0 nebula]# cat logs/nebula-storaged.ERROR
Log file created at: 2022/11/16 06:33:06
Running on machine: nebula-cluster-storaged-0
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20221116 06:33:06.538713 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539002 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539022 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539039 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539057 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539074 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539091 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539109 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539126 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539152 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539170 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539187 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539204 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539222 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539239 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.539256 39 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:39:10.150209 72 MetaClient.cpp:490] Get edge schemas failed for spaceId 2, Space not existed!
E20221116 06:39:10.150283 72 MetaClient.cpp:318] Load Schemas Failed
E20221116 08:19:32.309896 178 MetaClient.cpp:1902] Space 167 not found!
E20221116 08:20:17.762274 178 MetaClient.cpp:1902] Space 167 not found!
E20221116 08:55:42.313542 18 Serializer.h:43] Thrift serialization is only defined for structs and unions, not containers thereof. Attemping to deserialize a value of type nebula::HostAddr.

[root@nebula-cluster-storaged-1 nebula]# cat logs/nebula-storaged.ERROR
Log file created at: 2022/11/15 13:03:17
Running on machine: nebula-cluster-storaged-1
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20221115 13:03:17.143180 51 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-0.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:10:49.631786 71 MetaClient.cpp:744] Send request to "nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local":9559, exceed retry limit
E20221115 13:10:49.631862 71 MetaClient.cpp:745] RpcResponse exception: apache::thrift::transport::TTransportException: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: write timed out during connection, type = Timed out
E20221115 13:10:49.631933 72 MetaClient.cpp:178] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: write timed out during connection, type = Timed out
E20221115 13:11:00.651211 60 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:01.660354 60 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:02.664425 72 MetaClient.cpp:178] Heartbeat failed, status:Machine not existed!
E20221115 13:11:12.679991 67 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:13.687848 67 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:14.694576 67 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:15.699323 67 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:15.699527 67 MetaClient.cpp:744] Send request to "nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local":9559, exceed retry limit
E20221115 13:11:15.699550 67 MetaClient.cpp:745] RpcResponse exception: apache::thrift::transport::TTransportException: Connection not open: apache::thrift::transport::TTransportException: AsyncSocketException: setReadCallback() called with socket in invalid state, type = Socket not open
E20221115 13:11:15.699678 72 MetaClient.cpp:178] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Connection not open: apache::thrift::transport::TTransportException: AsyncSocketException: setReadCallback() called with socket in invalid state, type = Socket not open
E20221116 08:55:41.079568 25 Serializer.h:43] Thrift serialization is only defined for structs and unions, not containers thereof. Attemping to serialize a value of type nebula::HostAddr.

Log file created at: 2022/11/15 13:11:10
Running on machine: nebula-cluster-storaged-2
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20221115 13:11:10.855696 51 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:12.864549 51 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
E20221115 13:11:12.864735 51 MetaClient.cpp:744] Send request to "nebula-cluster-metad-2.nebula-cluster-metad-headless.nebula.svc.cluster.local":9559, exceed retry limit
E20221115 13:11:12.864763 51 MetaClient.cpp:745] RpcResponse exception: apache::thrift::transport::TTransportException: Connection not open: apache::thrift::transport::TTransportException: AsyncSocketException: setReadCallback() called with socket in invalid state, type = Socket not open
E20221115 13:11:12.864846 1 MetaClient.cpp:98] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Connection not open: apache::thrift::transport::TTransportException: AsyncSocketException: setReadCallback() called with socket in invalid state, type = Socket not open
E20221116 06:33:06.157292 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157379 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157399 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157418 44 AddEdgesProcessor.cpp:116] Space 165, Edge 170 invalid
E20221116 06:33:06.157435 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157454 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157472 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157490 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157508 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:33:06.157526 44 AddEdgesProcessor.cpp:116] Space 165, Edge -170 invalid
E20221116 06:39:12.657861 72 MetaClient.cpp:295] Get parts allocation failed for spaceId 2, status Key not existed!
E20221116 08:16:22.724211 72 MetaClient.cpp:295] Get parts allocation failed for spaceId 165, status Key not existed!
E20221116 08:55:42.731007 16 Serializer.h:43] Thrift serialization is only defined for structs and unions, not containers thereof. Attemping to deserialize a value of type nebula::HostAddr.

@wey-gu
Copy link
Contributor

wey-gu commented Nov 16, 2022

It seems that it could not resolve the metad-0 's fqdn nebula-cluster-metad-0.nebula-cluster-metad-headless.nebula.svc.cluster.local, could you troubleshoot the network/dns of the cluster?

@robcinko
Copy link
Author

Yes, it seems we had a problem with coredns, now its solved.

Thank you for help.😍

@wey-gu
Copy link
Contributor

wey-gu commented Nov 16, 2022

You are always welcome @robcinko :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants