Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think ThriftClientManager's ClientMap has some problem, it will be caused Connection reset by peer #5413

Closed
Milittle opened this issue Mar 18, 2023 · 3 comments
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected

Comments

@Milittle
Copy link
Contributor

Milittle commented Mar 18, 2023

Describe the bug (required)

As a title and reference the follow topic:
https://discuss.nebula-graph.com.cn/t/topic/10803

How To Reproduce(required)

Occasionally

Expected behavior

graphd send rpc to storage, it can be successful.

Additional context

If the cluster run long time, graphd send rpc to storage, it will be show the Connection reset by peer, if we restart graphd, do not touch any other things, the error was gone, so I guess from the following code:

if (transport == nullptr || transport->hangup()) {

the ThriftClientManager's ClientMap AsyncClient judgement, it has some problem here:

I try to dig the folly AsyncSocket, find some method to judge the connection normal or not, maybe be can help, but need you guys to confirm this worked or not, thx:

The AsyncSocket has some method:

bool good() const override;

virtual bool isClosedByPeer() const {
  return (
      state_ == StateEnum::CLOSED &&
      (readErr_ == READ_EOF || readErr_ == READ_ERROR));
}

virtual bool isClosedBySelf() const {
  return (
      state_ == StateEnum::CLOSED &&
      (readErr_ != READ_EOF && readErr_ != READ_ERROR));
}

I think the above check in

if (transport == nullptr || transport->hangup()) {

this line can be changed such like this can be fixed this problem, Raise this problem becuase our ClientMap connection has some problem can not be removed in here:

if (transport == nullptr || transport->hangup() || !transport->good() || transport->isClosedByPeer() || transport->isClosedBySelf()) {

Anyway, thanks, and pls discuss this problem.

@Milittle Milittle added the type/bug Type: something is unexpected label Mar 18, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Mar 18, 2023
@critical27
Copy link
Contributor

I can't recall it correctly, it has been a long while. IIRC, channel->good() will call transport->good(), you could read the PR vesoft-inc/nebula-common#431.

As for isClosedByPeer and isClosedBySelf, maybe it will help in your case. You'd could add it in your code, and test is in your env. You could limit the io thread in graphd and storaged to 1, set loglevel to v=5 (or v=6, don't remember clearly) to see the log in AsyncSocket, or a tcpdump from both graphd and storaged will help a lot to analyze the problem.

@critical27
Copy link
Contributor

BTW, if you retry to query a lot of times, the Connection reset by peer will be gone, because the channel is a thread_local object, you need to refresh every channel in the ThriftClientMananger.

@Milittle
Copy link
Contributor Author

thx~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

3 participants