Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUST-556 POC of maxConnecting #259

Merged
merged 15 commits into from
Nov 2, 2020

Conversation

patrickfreed
Copy link
Contributor

RUST-556

This PR implements a POC of the maxConnecting requirement of the "Avoiding Connection Storms" project. In doing so, it also re-architects the connection pool to use a channel based WaitQueue rather than semaphore based one. This was done to allow for easier and more effective maintenance of the wait queue now that we need to support arbitrary conditions for exiting the queue (e.g. maxConnecting).

I realize that a rewrite of this magnitude typically goes through a design phase, so I apologize for suddenly dropping it like this. I decided it was the preferable choice for this POC largely because a counting semaphore wait queue was frustrating to extend to use in this case, and that this lockless + channel-based pattern is a much more common aproach in the async world for accessing shared state than our previous lock heavy one. Maintaining the old pool and adding support for maxConnecting would introduce even more locking while also producing a suboptimal solution (impossible or difficult to ensure fairness with multiple semaphores/locks in this case), so this seemed like an good time for a rewrite. Additionally, this is a proof-of-concept after all, so I figured it would be a good time to proof-of-concept what an idiomatic async pool would look like too. I think that I should have opted for a pool like this back when I originally converted the old pool to async, but I was lacking in experience with async patterns at the time. Ironically, the original sync pool with its queue of condvars was somewhat similar to this one.

Overview of new pool

I'll give a brief overview of the layout of the new pool here to make digesting the actual code a little easier. We have the same clonable ConnectionPool type as before with the same functionality; however, when we create a new pool, instead of creating a ConnectionPoolInner type that is arc'd and wrapping that, we start a worker task (ConnectionPoolWorker) that opens up a few different channels, and we store senders to those channels (ConnectionRequester, PoolManager) on the ConnectionPool. The worker task is in charge of all changes and access to the connection pool state; to do anything with the pool, a you need to send a request via one of the two senders mentioned before. To check out a connection for example, a thread sends a ConnectionRequest via a ConnectionRequester and waits for the worker to respond. To perform any other pool interactions (e.g. clearing, checking in), a thread uses the PoolManager to send the appropriate requests. The ConnectionRequesters act like strong references to the pool in that they keep the worker alive until they are all dropped, whereas PoolManagers are like weak references and can still be in scope when the worker ceases execution. A ConnectionPool instance holds onto both a ConnectionRequester and a PoolManager, but a Connection for example only holds on to a PoolManager in order to check itself back in. Instead of having a separate background task for ensuring minPoolSize, the worker task just takes care of that.

In summary:

  • ConnectionPoolInner gone, replaced with a background worker task ConnectionPoolWorker that has exclusive access to pool state
  • ConnectionPoolWorker listens on a few channels for incoming requests, responding to them as necessary
  • Threads can check out connections via ConnectionRequesters. The ConnectionPoolWorker quits once all the requesters are dropped
  • Threads can modify the pool state via PoolManagers. These do not keep the worker running and may no-op if outlive the worker.
  • Connections use PoolManagers to check themselves back in

Some notes

Given this architecture, it was trivially easy to introduce maxConnecting: we simply don't poll the ConnectionRequestReceiver when the pool is empty and has too many pending connections. The actual changes required to implement that part are all in the last few commits or so. Additionally, we now have a context which always has exclusive access to the pool state, making it much easier to reason about. Lastly, the drop implementations of the various connection types have been simplified with the removal of ConnectionPoolInner and DroppedConnectionState.

Copy link
Contributor Author

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clippy failures seem unrelated to this PR. The other failures are either due to the max files issue or should be fixed in #258 I think.

Cargo.toml Show resolved Hide resolved

const TEST_DESCRIPTIONS_TO_SKIP: &[&str] = &[
"must destroy checked in connection if pool has been closed",
"must throw error if checkOut is called on a closed pool",
];

const EVENT_DELAY: Duration = Duration::from_millis(500);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a temporary solution until the CMAP spec test changes in #258 go in. I wanted to avoid duplicating them here and just opted for the easy solution. Even with this generous delay, we still sometimes don't see the events on Windows with async-std, which is a bit concerning. The failures should go away once the aforementioned changes are merged in though.

@patrickfreed patrickfreed marked this pull request as ready for review October 10, 2020 01:07
Copy link
Contributor

@saghm saghm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the refactor of the pool and added comments for it. I'll come back and do a second review of the updates for maxConnecting later this week

src/cmap/connection_requester.rs Outdated Show resolved Hide resolved
src/cmap/test/mod.rs Outdated Show resolved Hide resolved
src/runtime/join_handle.rs Show resolved Hide resolved
src/cmap/mod.rs Outdated Show resolved Hide resolved
src/cmap/worker.rs Outdated Show resolved Hide resolved
src/cmap/worker.rs Outdated Show resolved Hide resolved
src/cmap/manager.rs Outdated Show resolved Hide resolved
@kmahar
Copy link
Contributor

kmahar commented Oct 15, 2020

I’m going to mainly focus my review on the design you describe and the big picture here, rather than the code. I trust that Isabel and Sam can confirm your code does what you say it does 🙂

I mainly just have questions about the overall architecture. I don't think I necessarily know enough about CMAP or what an idiomatic async pool would look like to have strong opinions, so it might be worthwhile having Matt weigh in on this since he is familiar with the past pool designs, and also with maxConnecting, but I will defer to your judgement on whether that is needed.

We have the same clonable ConnectionPool type

This isn’t that important but just curious, IIUC something being clonable means you can make a deep copy of it, right? Why do we need that capability for our connection pool?

So what I understand so far is that (I think) there is a 1:1 pool-worker relationship. Is there also a single manager and a single requester per pool? I thought so from the overview, but then this sentence confused me, because it sounds like there are multiple requesters which a single worker is paying attention to:

The ConnectionPoolWorker quits once all the requesters are dropped

Also, at what point is a requester dropped?

@saghm
Copy link
Contributor

saghm commented Oct 16, 2020

IIUC something being clonable means you can make a deep copy of it, right? Why do we need that capability for our connection pool?

Generally, yes, this is correct. However, Rust's mechanism for reference counted types is to define types in the standard library that implement the same Clone trait as other clonable types but with semantics that share the underlying data. In this case, we utilize those types for all of the mutable state in the connection pool to make it act as if there's a shallow copy. We use the same strategy for Client, Database, and Collection types to allow users to make copies to send to different threads/tasks.

@kmahar
Copy link
Contributor

kmahar commented Oct 16, 2020

@saghm Ahh I see, that makes sense.

Tangential question: it makes sense to me that a client would be a reference type, since you want to be able to share all of its underlying resources between objects that reference it. But why make databases and collections reference types, too? Would doing so mean that copying them ends up deep copying everything they store references to (such clients)?

I'm comparing this to Swift where if you copy a value type - say MongoDatabase which has a property that is a reference type - say MongoClient - the copy of the database will automatically refer to the same client as the original database referred to.

@saghm
Copy link
Contributor

saghm commented Oct 16, 2020

You raise a good point here; I don't think we technically need to make the Database and Collection reference counted right now since the only mutable data they have (the internals of the underlying client) are already reference counted, so it doesn't actually make any semantic difference from the users perspective. This is all internal implementation right now though, so the only real effect of making Database and Collection reference counted right now is that it causes the small amount of data specific to the Database and Collection types (their names, read preferences, and read/write concerns) allocated on the heap rather than the stack.

@kmahar
Copy link
Contributor

kmahar commented Oct 16, 2020

Gotcha, thanks for the explanation!

Copy link
Contributor Author

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I understand so far is that (I think) there is a 1:1 pool-worker relationship. Is there also a single manager and a single requester per pool? I thought so from the overview, but then this sentence confused me, because it sounds like there are multiple requesters which a single worker is paying attention to. Also, at what point is a requester dropped?

This is partially explained by the cloning discussion above. While there is a 1:1 relationship between requesters and ConnectionPools, there can be many requesters for a single worker because each instance of ConnectionPool has one and can be cloned. A requester is dropped when the ConnectionPool instance that owns it is dropped. ConnectionPools also contain managers, so there can be many managers per worker, and since checked out Connections also have managers, there isn't a 1:1 relationship between pools and managers for a given worker.

src/cmap/connection_requester.rs Outdated Show resolved Hide resolved
src/cmap/mod.rs Outdated Show resolved Hide resolved
src/cmap/worker.rs Outdated Show resolved Hide resolved
src/runtime/join_handle.rs Show resolved Hide resolved
src/cmap/manager.rs Outdated Show resolved Hide resolved
src/cmap/test/mod.rs Outdated Show resolved Hide resolved
src/cmap/worker.rs Outdated Show resolved Hide resolved
@saghm
Copy link
Contributor

saghm commented Oct 21, 2020

The lint failures are likely due to the new minor version of Rust that came out since you started this PR; Isabel merged a PR to fix them last week, so if you rebase with master, I think they should go away

@patrickfreed patrickfreed force-pushed the RUST-556/maxConnecting branch from dbe3ad5 to 45eda2c Compare October 21, 2020 17:42
async fn execute(mut self) {
let mut maintenance_interval = RUNTIME.interval(Duration::from_millis(500));

loop {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the unfinished request cache to the receiver itself, which made this loop/select a little simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this does make it a lot easier to follow!

@patrickfreed
Copy link
Contributor Author

Just rebased, hopefully that fixes them.

@patrickfreed patrickfreed force-pushed the RUST-556/maxConnecting branch from 45eda2c to 09b1fc8 Compare October 21, 2020 18:42
Copy link
Contributor

@saghm saghm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the questions I had about this have been answered, and the fully green build is super nice, so I think this is ready for my LGTM!

@kmahar
Copy link
Contributor

kmahar commented Nov 2, 2020

@patrickfreed Thanks for the explanation, that makes sense. Design SGTM then.

@patrickfreed patrickfreed merged commit 89a8e9c into mongodb:master Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants