Skip to content

Proposal: Pool/Conn, handling connection timeouts. #2586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
monkey92t opened this issue May 7, 2023 · 12 comments
Open

Proposal: Pool/Conn, handling connection timeouts. #2586

monkey92t opened this issue May 7, 2023 · 12 comments
Labels

Comments

@monkey92t
Copy link
Collaborator

monkey92t commented May 7, 2023

Recently, we have received many questions regarding creating a large number of connections in go-redis. When a connection is deemed no longer available due to context or write/read timeouts, we remove it from the connection pool. We do this without any issues because the connection may contain dirty read/write data, which could potentially contaminate subsequent command invocations.

However, when ReadTimeout and WriteTimeout are set to relatively small values and the network status with the redis-server is poor, a large number of command timeouts occur, leading go-redis to create numerous new connections, further worsening the network environment.

  1. Adding a connection recycling mechanism and marking network connections, such as "timeout - unread residual data" (which is the most common error). We would place these connections into a recycle bin, where periodic checks are performed to read and handle any residual dirty data. Once the connection is confirmed to be usable, it would be placed back into the connection pool. However, this approach would maintain a higher number of network connections than the value specified by pool.Poolsize, which might surprise users. Additionally, its effectiveness is limited when the network conditions are severely degraded.

  2. Adding a network connection firewall that prevents the creation of new connections when a widespread connection rebuilding is detected, periodic health checks would be initiated. It is also possible to limit the creation of new connections by periodically injecting a certain number of healthy connections into the connection pool (similar to a token bucket algorithm).

  3. We changing the mechanism of go-redis command read/write by implementing a read/write separation approach. Network connections would no longer have timeout significance, and the execution commands and expected results would be passed through mechanisms like go-chan. In other words, the connection becomes a stateless transport device.

@rueian
Copy link
Contributor

rueian commented May 8, 2023

Upvote for proposal 3. That is how rueidis works for concurrent requests and enables higher throughput by pipelining. It will also make implementing client-side caching easier.

@monkey92t
Copy link
Collaborator Author

@rueian tks..

Plan 3 requires us to make extensive code changes and conduct further testing.

I tested Plan 1, and it provided some assistance in cases of minor fluctuations in the network. However, when the network environment changed from 1ms to 10ms, Plan 1 no longer made sense.

I think maybe we should give Plan 3 a try.

Additionally, I have been hesitant about the client-side cache you mentioned. I believe that the client-side cache consumes some resources of the Redis server, so we need to handle it with caution. We should make different considerations for different keys. For example, for frequently modified keys or complex data structures such as hashes and lists, we should not use caching because any change in any field of a hash would trigger an update of the cache.

However, in some cases, users store static data such as order information in a hash. This hash table will not be modified in the business logic, so it is no different from a regular key and can be included in the cache.

Last year, I conducted some tests, and tracking all keys in a project without considering their nature actually increased the burden on the Redis server. If we use a broadcasting mode, most clients receive notifications that are irrelevant to them. My suggestion is that we should provide a cache API or a lower-level API to assist users in implementing the caching they need.

@rueian
Copy link
Contributor

rueian commented May 8, 2023

My suggestion is that we should provide a cache API or a lower-level API to assist users in implementing the caching they need.

Totally agree with you. I also have the same concern about burdens added on servers or on clients by client-side caching.

As a result, in rueidis, I made a separate API to allow users to only opt-in keys they want to cache locally, not all keys.

@SoulPancake
Copy link
Contributor

SoulPancake commented May 10, 2023

Some idea to proceed with proposal 3 :

We can have a command queue of some sort and instead of sending commands directly via the connection,we can put them into a queue. Each command could be a struct containing the command itself, a context (to handle timeouts at the command level), and a chan to send the result back.

Then we can have a command dispatcher which may be a separate goroutine (or multiple goroutines for concurrency) responsible for dispatching commands from the queue to available connections. The dispatcher could select a connection from the pool, send the command, and then put the connection back into the pool.

After sending a command, the dispatcher could start a goroutine to listen for the response on that connection. Once a response is received, it could be sent back to the caller via the go chan in the Command struct.

Thoughts regarding timeout :
If the context associated with a command times out before the response is received, the response listener could simply discard the response when it eventually arrives to ensure that the connection is cleanfor the subsequent commands.

there may still be edge cases to think about.
@monkey92t
Should we design a System here ( We can use excalidraw if that helps) before proceeding with some implementation as I'm interested in working on this/collab. on this.

(edit : Can we perhaps implement the client side caching disjoint from this )

@monkey92t
Copy link
Collaborator Author

Yes, my only concern is that Solution 3 requires a significant amount of code modification and testing, which could introduce instability factors.

@shaolei
Copy link

shaolei commented May 17, 2023

Yes, my only concern is that Solution 3 requires a significant amount of code modification and testing, which could introduce instability factors.

最近在压测项目时也碰到这个问题,出现大量的connection pool timeout,而且在链路追踪中显示有些命令执行时间超过3秒,
使用的是client默认配置。
请问有没有什么方案能临时解决这个问题?比如回退到v8版本或者更改一些client配置?

@vmihailenco
Copy link
Collaborator

vmihailenco commented May 17, 2023

We would place these connections into a recycle bin, where periodic checks are performed to read and handle any residual dirty data. Once the connection is confirmed to be usable

I doubt this can be done, because the data can arrive a minute later after the timeout - see https://redis.uptrace.dev/guide/go-redis-debugging.html#large-number-of-open-connections and tcp_fin_timeout setting.

Adding a network connection firewall that prevents the creation of new connections when a widespread connection rebuilding is detected, periodic health checks would be initiated.

This sounds a bit like dialErrorsNum + lastDialError we already have in the pool. Perhaps we should improve it.

We changing the mechanism of go-redis command read/write by implementing a read/write separation approach. Network connections would no longer have timeout significance, and the execution commands and expected results would be passed through mechanisms like go-chan. In other words, the connection becomes a stateless transport device.

I believe you are trading latency for throughput here. Some users might like it, some might not.

Also, it is a big change and timeouts will still happen so it is not 100% bulletproof solution.

🤷

@monkey92t
Copy link
Collaborator Author

@vmihailenco

I doubt this can be done, because the data can arrive a minute later after the timeout - see https://redis.uptrace.dev/guide/go-redis-debugging.html#large-number-of-open-connections and tcp_fin_timeout setting.

Here, what I mean is that instead of closing the connection when a timeout occurs, we place it in a recycle bin where there are goroutines constantly performing health checks on the connection. In other words, we might keep more connections than the specified PoolSize.

I believe you are trading latency for throughput here. Some users might like it, some might not.

Here, we utilize two goroutines, one for reading and one for writing, for each net.Conn. Each command is immediately sent without any delay, and we wait for the message results using channels. This approach requires an additional two channel operations for reading and writing. By doing this, it becomes more convenient to handle context and timeout waiting.

@ethanvc
Copy link

ethanvc commented May 19, 2023

Upvote for proposal 3 too, this proposal is just what http2 did, and the main difference between http2 is lack the function to blance all request on one connection.

But, we still face the same problem because WATCH command need a dedicated connection to execute.

@vmihailenco
Copy link
Collaborator

Here, what I mean is that instead of closing the connection when a timeout occurs, we place it in a recycle bin where there are goroutines constantly performing health checks on the connection. In other words, we might keep more connections than the specified PoolSize.

What I mean is that you would need to keep the connection in recycle bin for at least one minute to make sure no unsolicited data will arrive. I'd assume it is too much time to be really useful...

Each command is immediately sent without any delay, and we wait for the message results using channels.

Okay, without buffering/pipelining you don't get extra throughput :) The latency will be better, but each command will cost an additional chan send and receive just to better handle timeouts. Such trade-off does not look attractive to me, but others might disagree.

Async execution with buffering/pipelining looks more interesting, but it should be optional / configurable since there are trade-offs / decisions to be made. Perhaps we might even make buffering optional...

@chayim
Copy link
Contributor

chayim commented Jun 5, 2023

Agreed with @vmihailenco . The first is probably not valuable by the time it's occurred, and number 3 creates (even) higher memory usage, especially in highly scaled applications. I like retries + health checks, which are features available in other clients.

Async execution is great - but there are several use cases where sync is vital, specifically data types and applications that are reliant on syncronous reads. I myself have this need today.

@sgp
Copy link
Contributor

sgp commented Jul 31, 2024

I'm curious if option (3) is a version of what StackExchange.Redis has implemented with its approach to multiplexing. It seems to be a good idea for many applications except for things like the blocking calls like BPOP, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants