Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(page_service): Timeline gate guard holding + cancellation + shutdown #8339

Merged
merged 51 commits into from
Jul 31, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Jul 10, 2024

Since the introduction of sharding, the protocol handling loop in handle_pagerequests cannot know anymore which concrete Tenant/Timeline object any of the incoming PagestreamFeMessage resolves to.
In fact, one message might resolve to one Tenant/Timeline while
the next one may resolve to another one.

To avoid going to tenant manager, we added the shard_timelines which acted as an ever-growing cache that held timeline gate guards open for the lifetime of the connection.
The consequence of holding the gate guards open was that we had to be sensitive to every cached Timeline::cancel on each interaction with the network connection, so that Timeline shutdown would not have to wait for network connection interaction.

We can do better than that, meaning more efficiency & better abstraction.
I proposed a sketch for it in

and this PR implements an evolution of that sketch.

The main idea is is that mod page_service shall be solely concerned with the following:

  1. receiving requests by speaking the protocol / pagestream subprotocol
  2. dispatching the request to a corresponding method on the correct shard/Timeline object
  3. sending response by speaking the protocol / pagestream subprotocol.

The cancellation sensitivity responsibilities are clear cut:

  • while in page_service code, sensitivity to page_service cancellation is sufficient
  • while in Timeline code, sensitivity to Timeline::cancel is sufficient

To enforce these responsibilities, we introduce the notion of a timeline::handle::Handle to a Timeline object that is checked out from a timeline::handle::Cache for each request.
The Handle derefs to Timeline and is supposed to be used for a single async method invocation on Timeline.
See the lengthy doc comment in mod handle for details of the design.

The remaining use of the `Tenant` object is to check `tenant.cancel`.
That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)).
I'll fix that in a future PR where I completely eliminate the holding
of `Tenant/Timeline` objects across requests.
See [my code RFC](#8286) for the
high level idea.
Copy link

github-actions bot commented Jul 11, 2024

3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)


Code coverage* (full report)

  • functions: 32.7% (7069 of 21609 functions)
  • lines: 50.1% (56430 of 112548 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
38b0f3c at 2024-07-31T14:17:39.538Z :recycle:

problame added a commit that referenced this pull request Jul 15, 2024
This operation isn't used in practice, so let's remove it.

Context: in #8339
@problame problame changed the title refactor(page_service): decouple from Mgr/Tenant/Timeline lifecycle refactor(page_service): Timeline gate guard holding + cancellation + shutdown Jul 29, 2024
@problame problame marked this pull request as ready for review July 29, 2024 13:39
@problame problame requested a review from a team as a code owner July 29, 2024 13:39
@problame problame requested a review from jcsp July 29, 2024 13:39
Copy link
Collaborator

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach makes sense.

It's quite verbose, but that's justified by unit testing & the general encapsulation of the cache/handle concept in handle.rs

Just one request for change: let's not decrease the active tenant timeout in this PR.

@problame problame merged commit 4e3b70e into main Jul 31, 2024
65 checks passed
@problame problame deleted the problame/slow-detach-fix branch July 31, 2024 15:05
arpad-m pushed a commit that referenced this pull request Aug 5, 2024
…shutdown (#8339)

Since the introduction of sharding, the protocol handling loop in
`handle_pagerequests` cannot know anymore which concrete
`Tenant`/`Timeline` object any of the incoming `PagestreamFeMessage`
resolves to.
In fact, one message might resolve to one `Tenant`/`Timeline` while
the next one may resolve to another one.

To avoid going to tenant manager, we added the `shard_timelines` which
acted as an ever-growing cache that held timeline gate guards open for
the lifetime of the connection.
The consequence of holding the gate guards open was that we had to be
sensitive to every cached `Timeline::cancel` on each interaction with
the network connection, so that Timeline shutdown would not have to wait
for network connection interaction.

We can do better than that, meaning more efficiency & better
abstraction.
I proposed a sketch for it in

* #8286

and this PR implements an evolution of that sketch.

The main idea is is that `mod page_service` shall be solely concerned
with the following:
1. receiving requests by speaking the protocol / pagestream subprotocol
2. dispatching the request to a corresponding method on the correct
shard/`Timeline` object
3. sending response by speaking the protocol / pagestream subprotocol.

The cancellation sensitivity responsibilities are clear cut:
* while in `page_service` code, sensitivity to page_service cancellation
is sufficient
* while in `Timeline` code, sensitivity to `Timeline::cancel` is
sufficient

To enforce these responsibilities, we introduce the notion of a
`timeline::handle::Handle` to a `Timeline` object that is checked out
from a `timeline::handle::Cache` for **each request**.
The `Handle` derefs to `Timeline` and is supposed to be used for a
single async method invocation on `Timeline`.
See the lengthy doc comment in `mod handle` for details of the design.
skyzh added a commit that referenced this pull request Aug 7, 2024
koivunej added a commit that referenced this pull request Aug 7, 2024
We've noticed increased memory usage with the latest release. Drain the
joinset of `page_service` connection handlers to avoid leaking them
until shutdown. An alternative would be to use a TaskTracker.
TaskTracker was not discussed in original PR #8339 review, so not hot
fixing it in here either.
arpad-m pushed a commit that referenced this pull request Aug 7, 2024
We've noticed increased memory usage with the latest release. Drain the
joinset of `page_service` connection handlers to avoid leaking them
until shutdown. An alternative would be to use a TaskTracker.
TaskTracker was not discussed in original PR #8339 review, so not hot
fixing it in here either.
problame added a commit that referenced this pull request Aug 20, 2024
… to dis-incentivize global tasks via task_mgr in the future

(As of #8339 all remaining
task_mgr usage is tenant or timeline scoped.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants