I think ThriftClientManager's ClientMap has some problem, it will be caused `Connection reset by peer` #5413

Milittle · 2023-03-18T16:45:29Z

Describe the bug (required)

As a title and reference the follow topic:
https://discuss.nebula-graph.com.cn/t/topic/10803

How To Reproduce(required)

Occasionally

Expected behavior

graphd send rpc to storage, it can be successful.

Additional context

If the cluster run long time, graphd send rpc to storage, it will be show the Connection reset by peer, if we restart graphd, do not touch any other things, the error was gone, so I guess from the following code:

nebula/src/common/thrift/ThriftClientManager-inl.h

Line 42 in 9fee385

if (transport == nullptr || transport->hangup()) {

the ThriftClientManager's ClientMap AsyncClient judgement, it has some problem here:

I try to dig the folly AsyncSocket, find some method to judge the connection normal or not, maybe be can help, but need you guys to confirm this worked or not, thx:

The AsyncSocket has some method:

bool good() const override;

virtual bool isClosedByPeer() const {
  return (
      state_ == StateEnum::CLOSED &&
      (readErr_ == READ_EOF || readErr_ == READ_ERROR));
}

virtual bool isClosedBySelf() const {
  return (
      state_ == StateEnum::CLOSED &&
      (readErr_ != READ_EOF && readErr_ != READ_ERROR));
}

I think the above check in

nebula/src/common/thrift/ThriftClientManager-inl.h

Line 42 in 9fee385

if (transport == nullptr || transport->hangup()) {

this line can be changed such like this can be fixed this problem, Raise this problem becuase our ClientMap connection has some problem can not be removed in here:

if (transport == nullptr || transport->hangup() || !transport->good() || transport->isClosedByPeer() || transport->isClosedBySelf()) {

Anyway, thanks, and pls discuss this problem.

The text was updated successfully, but these errors were encountered:

critical27 · 2023-03-20T08:05:03Z

I can't recall it correctly, it has been a long while. IIRC, channel->good() will call transport->good(), you could read the PR vesoft-inc/nebula-common#431.

As for isClosedByPeer and isClosedBySelf, maybe it will help in your case. You'd could add it in your code, and test is in your env. You could limit the io thread in graphd and storaged to 1, set loglevel to v=5 (or v=6, don't remember clearly) to see the log in AsyncSocket, or a tcpdump from both graphd and storaged will help a lot to analyze the problem.

critical27 · 2023-03-20T08:13:54Z

BTW, if you retry to query a lot of times, the Connection reset by peer will be gone, because the channel is a thread_local object, you need to refresh every channel in the ThriftClientMananger.

Milittle · 2023-03-21T15:13:51Z

thx~

Milittle added the type/bug Type: something is unexpected label Mar 18, 2023

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Mar 18, 2023

Sophie-Xie closed this as completed Mar 24, 2023

github-actions bot added the process/fixed Process of bug label Mar 24, 2023

wey-gu mentioned this issue Mar 25, 2023

Weekly Report 2023-03-24 vesoft-inc/nebula-community#393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I think ThriftClientManager's ClientMap has some problem, it will be caused `Connection reset by peer` #5413

I think ThriftClientManager's ClientMap has some problem, it will be caused `Connection reset by peer` #5413

Milittle commented Mar 18, 2023 •

edited

Loading

critical27 commented Mar 20, 2023

critical27 commented Mar 20, 2023

Milittle commented Mar 21, 2023

I think ThriftClientManager's ClientMap has some problem, it will be caused Connection reset by peer #5413

I think ThriftClientManager's ClientMap has some problem, it will be caused Connection reset by peer #5413

Comments

Milittle commented Mar 18, 2023 • edited Loading

critical27 commented Mar 20, 2023

critical27 commented Mar 20, 2023

Milittle commented Mar 21, 2023

I think ThriftClientManager's ClientMap has some problem, it will be caused `Connection reset by peer` #5413

I think ThriftClientManager's ClientMap has some problem, it will be caused `Connection reset by peer` #5413

Milittle commented Mar 18, 2023 •

edited

Loading