-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(page_service): Timeline gate guard holding + cancellation + shutdown #8339
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](#8286) for the high level idea.
problame
force-pushed
the
problame/slow-detach-fix
branch
from
July 11, 2024 14:46
8bdea5c
to
5f78c8f
Compare
3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
38b0f3c at 2024-07-31T14:17:39.538Z :recycle: |
problame
force-pushed
the
problame/slow-detach-fix
branch
from
July 12, 2024 18:56
ddfb450
to
e157ea8
Compare
…ler::process_query" This reverts commit fef082b. Actually, we can just store the cancellation token in PageServiceHandler
problame
force-pushed
the
problame/slow-detach-fix
branch
from
July 12, 2024 19:01
e157ea8
to
1cfe9d8
Compare
problame
added a commit
that referenced
this pull request
Jul 15, 2024
This operation isn't used in practice, so let's remove it. Context: in #8339
Trouble with the sync mutex now :( Async mutex doesn't work because we need to lock from destructors.
problame
changed the title
refactor(page_service): decouple from Mgr/Tenant/Timeline lifecycle
refactor(page_service): Timeline gate guard holding + cancellation + shutdown
Jul 29, 2024
jcsp
reviewed
Jul 31, 2024
jcsp
approved these changes
Jul 31, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach makes sense.
It's quite verbose, but that's justified by unit testing & the general encapsulation of the cache/handle concept in handle.rs
Just one request for change: let's not decrease the active tenant timeout in this PR.
arpad-m
pushed a commit
that referenced
this pull request
Aug 5, 2024
…shutdown (#8339) Since the introduction of sharding, the protocol handling loop in `handle_pagerequests` cannot know anymore which concrete `Tenant`/`Timeline` object any of the incoming `PagestreamFeMessage` resolves to. In fact, one message might resolve to one `Tenant`/`Timeline` while the next one may resolve to another one. To avoid going to tenant manager, we added the `shard_timelines` which acted as an ever-growing cache that held timeline gate guards open for the lifetime of the connection. The consequence of holding the gate guards open was that we had to be sensitive to every cached `Timeline::cancel` on each interaction with the network connection, so that Timeline shutdown would not have to wait for network connection interaction. We can do better than that, meaning more efficiency & better abstraction. I proposed a sketch for it in * #8286 and this PR implements an evolution of that sketch. The main idea is is that `mod page_service` shall be solely concerned with the following: 1. receiving requests by speaking the protocol / pagestream subprotocol 2. dispatching the request to a corresponding method on the correct shard/`Timeline` object 3. sending response by speaking the protocol / pagestream subprotocol. The cancellation sensitivity responsibilities are clear cut: * while in `page_service` code, sensitivity to page_service cancellation is sufficient * while in `Timeline` code, sensitivity to `Timeline::cancel` is sufficient To enforce these responsibilities, we introduce the notion of a `timeline::handle::Handle` to a `Timeline` object that is checked out from a `timeline::handle::Cache` for **each request**. The `Handle` derefs to `Timeline` and is supposed to be used for a single async method invocation on `Timeline`. See the lengthy doc comment in `mod handle` for details of the design.
koivunej
reviewed
Aug 7, 2024
koivunej
reviewed
Aug 7, 2024
koivunej
reviewed
Aug 7, 2024
koivunej
added a commit
that referenced
this pull request
Aug 7, 2024
We've noticed increased memory usage with the latest release. Drain the joinset of `page_service` connection handlers to avoid leaking them until shutdown. An alternative would be to use a TaskTracker. TaskTracker was not discussed in original PR #8339 review, so not hot fixing it in here either.
arpad-m
pushed a commit
that referenced
this pull request
Aug 7, 2024
We've noticed increased memory usage with the latest release. Drain the joinset of `page_service` connection handlers to avoid leaking them until shutdown. An alternative would be to use a TaskTracker. TaskTracker was not discussed in original PR #8339 review, so not hot fixing it in here either.
problame
added a commit
that referenced
this pull request
Aug 20, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since the introduction of sharding, the protocol handling loop in
handle_pagerequests
cannot know anymore which concreteTenant
/Timeline
object any of the incomingPagestreamFeMessage
resolves to.In fact, one message might resolve to one
Tenant
/Timeline
whilethe next one may resolve to another one.
To avoid going to tenant manager, we added the
shard_timelines
which acted as an ever-growing cache that held timeline gate guards open for the lifetime of the connection.The consequence of holding the gate guards open was that we had to be sensitive to every cached
Timeline::cancel
on each interaction with the network connection, so that Timeline shutdown would not have to wait for network connection interaction.We can do better than that, meaning more efficiency & better abstraction.
I proposed a sketch for it in
and this PR implements an evolution of that sketch.
The main idea is is that
mod page_service
shall be solely concerned with the following:Timeline
objectThe cancellation sensitivity responsibilities are clear cut:
page_service
code, sensitivity to page_service cancellation is sufficientTimeline
code, sensitivity toTimeline::cancel
is sufficientTo enforce these responsibilities, we introduce the notion of a
timeline::handle::Handle
to aTimeline
object that is checked out from atimeline::handle::Cache
for each request.The
Handle
derefs toTimeline
and is supposed to be used for a single async method invocation onTimeline
.See the lengthy doc comment in
mod handle
for details of the design.