Handle epipe errors #871

n-oden · 2023-01-13T22:04:32Z

Presently, if the tcp session to the server closes remotely (e.g. if the clickhouse-server process restarts), we will emit broken pipe errors potentially until we reach ConnMaxLifetime.

Since EPIPE means that the tcp session is dead (ie we received a RST packet from the server), there is no point in attempting to proceed past that point: set the connection as closed.

n-oden · 2023-01-13T22:08:35Z

n.b. I got a bit lost in the weeds trying to create a reasonable test case for this: the container-management setup in your test suite can't serialize testcontainer.Container, so env.Container is nil by the time you're on the far side of GetTestEnvironment(), and even if you work around that by directly instantiating a new env in the test so that you can call container.Stop(), I couldn't think of a reliable way to assert that we had caught the EPIPE and closed the connection -- but if you have any ideas I'm all ears!

rf · 2023-01-13T22:22:20Z

#844 is most likely related!

n-oden · 2023-01-17T14:57:04Z

@jkaflik Could I trouble you for a look at this?

jkaflik · 2023-01-17T15:42:24Z

@n-oden I will have a look if we can have it reproducible in tests

n-oden · 2023-01-17T17:43:35Z

Reproducing it is in conception simple: create a client that continuously writes to a clickhouse db, and then restart the clickhouse server while the client is writing and observe the infinite flood of broken pipe errors in the logs.

In practice with the integration test framework you have here, I'm not sure how we'd implement that while still keeping test times reasonable and results deterministic: you'd want to stop and start the container while writes were happening and I'm not sure how you'd do that with any precision. I'll keep poking at it but am very open to suggestions!

jkaflik · 2023-01-17T18:12:55Z

@n-oden I looked briefly at the issue. Can you please let me know the version you use? Also, I assume you are using the database/sql interface as your proposal implements a driver.ErrBadConn.

jkaflik · 2023-01-17T18:32:06Z

I agree we should mark the connection as closed. I will look at how we can achieve the same result for database/sql driver, but without remapping an error on a native interface level.

n-oden · 2023-01-17T18:49:17Z

@jkaflik we're using v2.5.0, having recently updated from v1. And correct, we're using the database/sql interface.

jkaflik · 2023-01-18T14:09:35Z

@n-oden while I agree the connection should be marked as closed and the SQL driver on top should receive driver.ErrBadConn, I struggle to reproduce the following:

we will emit broken pipe errors potentially until we reach ConnMaxLifetime.

having CH server restart just before writing to the socket (conn.go:259) causes me to get the broken pipe, but later on, I get the connection established back and everything works fine. What happens is I get a few connection refused during dial. This tells me that the conn is abandoned and the new one is being established until the server is back to accept conn.

It has the same behavior no matter if we introduce your change or not.

My use case is:

conn := clickhouse.OpenDB( ... )
conn.SetMaxOpenConns(1)

for {
  scope, err := conn.Begin()
  stmt, err := scope.Prepare("INSERT....")
  stmt.Exec(...)
  scope.Commit()
}

I think there is some difference in how you do that and how you are able to reproduce it.

n-oden · 2023-01-18T15:47:56Z

@jkaflik I've been having the exact same issue trying to reproduce the error in the test harness here, which is frustrating, but we have definitely seen the described behavior on our production systems. Happy to do a screenshare or post a recording somewhere if that would help!

jkaflik · 2023-01-18T17:16:46Z

@n-oden anything that brings me closer to this issue is welcome. Can you also let me know what/when library functions you call?

n-oden · 2023-01-18T17:51:23Z

drop me a line -- n@oden.io

Presently, if the tcp session to the server closes, we will emit broken pipe errors until either the connection comes back to "idle" state or until we reach ConnMaxLifetime. Since EPIPE means that the tcp session is dead (ie we received a RST packet from the server), there is no point in attempting to proceed past that point: set the connection as closed.

n-oden · 2023-01-19T19:31:22Z

@jkaflik updated as we discussed!

jkaflik · 2023-01-19T19:35:59Z

@n-oden thanks! as we agreed, merging it and merge in order to mitigate the issue on your side.

Let's continue with #844 since it might have the same root cause.

n-oden force-pushed the handle-epipe branch from a51c9cd to 6328c62 Compare January 13, 2023 22:16

jkaflik self-assigned this Jan 17, 2023

jkaflik added the bug label Jan 17, 2023

n-oden force-pushed the handle-epipe branch from 6328c62 to ff1f0ac Compare January 19, 2023 19:30

jkaflik merged commit 936a983 into ClickHouse:main Jan 19, 2023

n-oden deleted the handle-epipe branch January 19, 2023 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle epipe errors #871

Handle epipe errors #871

n-oden commented Jan 13, 2023

n-oden commented Jan 13, 2023

rf commented Jan 13, 2023

n-oden commented Jan 17, 2023

jkaflik commented Jan 17, 2023

n-oden commented Jan 17, 2023

jkaflik commented Jan 17, 2023 •

edited

Loading

jkaflik commented Jan 17, 2023

n-oden commented Jan 17, 2023

jkaflik commented Jan 18, 2023

n-oden commented Jan 18, 2023

jkaflik commented Jan 18, 2023

n-oden commented Jan 18, 2023

n-oden commented Jan 19, 2023

jkaflik commented Jan 19, 2023

Handle epipe errors #871

Handle epipe errors #871

Conversation

n-oden commented Jan 13, 2023

n-oden commented Jan 13, 2023

rf commented Jan 13, 2023

n-oden commented Jan 17, 2023

jkaflik commented Jan 17, 2023

n-oden commented Jan 17, 2023

jkaflik commented Jan 17, 2023 • edited Loading

jkaflik commented Jan 17, 2023

n-oden commented Jan 17, 2023

jkaflik commented Jan 18, 2023

n-oden commented Jan 18, 2023

jkaflik commented Jan 18, 2023

n-oden commented Jan 18, 2023

n-oden commented Jan 19, 2023

jkaflik commented Jan 19, 2023

jkaflik commented Jan 17, 2023 •

edited

Loading