Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: stuck detach operation #7427

Closed
20 of 23 tasks
problame opened this issue Apr 18, 2024 · 12 comments
Closed
20 of 23 tasks

pageserver: stuck detach operation #7427

problame opened this issue Apr 18, 2024 · 12 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver p/high High priority: use for bugs that need prompt attention, such as crashes or possible corruptions t/bug Issue Type: Bug t/on_call_followup triaged bugs that were already triaged

Comments

@problame
Copy link
Contributor

problame commented Apr 18, 2024

Stuck /location_config operation while transitioning from attached to secondary (i.e. Tenant::shutdown), clearly a bug in pageserver. We mitigated by restarting pageserver, but, we should debug this.

https://neondb.slack.com/archives/C03H1K0PGKH/p1713453288707669?thread_ts=1713444234.032949&cid=C03H1K0PGKH

This occurred during a /location_config call transitioning a pageserver from attached to secondary.

Root Cause Analysis

#7427 (comment)

tl;dr: no direct smoking gun, but, symptoms so far point to page_service keeping gate open for too long.

Fixing It

Refactor page_service so that gate guard lifetime is easy to understand and obviously bounded to the requests that are ongoing at time of timeline shutdown.

Sketch: #8286

Tasks

Follow-Ups

@problame problame added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver t/on_call_followup labels Apr 18, 2024
@problame
Copy link
Contributor Author

Happened again during that migration.

@jcsp
Copy link
Collaborator

jcsp commented Apr 18, 2024

Suspected regression from changes around shutdown recently: #7233

@jcsp jcsp changed the title stuck detach operation pageserver: stuck detach operation Apr 18, 2024
@jcsp jcsp added the triaged bugs that were already triaged label May 2, 2024
@jcsp
Copy link
Collaborator

jcsp commented May 2, 2024

Next steps: look carefully at timeline shutdown code.

@jcsp
Copy link
Collaborator

jcsp commented Jun 20, 2024

We have plenty of examples of kept the gate from closing messages in the field.

We see them in handle_get_page_at_lsn_request, and initial_size_calculation{...}:logical_size_calculation_task:get_or_maybe_download

Those are signals that something in those code paths is not promptly respecting a cancellation token. However, they may not be responsible for actual hangs, as that's a message printed when a gate guard eventually releases the gate, not if it holds it forever.

@jcsp jcsp added the p/high High priority: use for bugs that need prompt attention, such as crashes or possible corruptions label Jun 20, 2024
@jcsp
Copy link
Collaborator

jcsp commented Jun 28, 2024

Let's invest some time next week in looking for issues that might have been introduced in #7233 -- I think a couple of hours of quality time staring at the code might yield results.

This week I spent some time writing a test (https://github.com/neondatabase/neon/tree/jcsp/shutdown-under-load-test) to try and reproduce shutdown hangs, without any success: this isn't completely surprising, because in the field we see one hang out of every ~10,000 migrations, so it's clearly something pretty niche.

problame added a commit that referenced this issue Jul 2, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:
https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
problame added a commit that referenced this issue Jul 2, 2024
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
problame added a commit that referenced this issue Jul 5, 2024
…ncellationToken

Preliminary refactoring while working on #7427
and specifically #8286
problame added a commit that referenced this issue Jul 5, 2024
…ncellationToken

Preliminary refactoring while working on #7427
and specifically #8286
problame added a commit that referenced this issue Jul 5, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
VladLazar pushed a commit that referenced this issue Jul 8, 2024
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs #7427
problame added a commit that referenced this issue Jul 9, 2024
…ncellationToken (#8295)

Preliminary refactoring while working on
#7427
and specifically #8286
problame added a commit that referenced this issue Jul 9, 2024
…to mgmt API (#8292)

I want to fix bugs in `page_service`
([issue](#7427)) and the
`import basebackup` / `import wal` stand in the way / make the
refactoring more complicated.

We don't use these methods anyway in practice, but, there have been some
objections to removing the functionality completely.

So, this PR preserves the existing functionality but moves it into the
HTTP management API.

Note that I don't try to fix existing bugs in the code, specifically not
fixing
* it only ever worked correctly for unsharded tenants
* it doesn't clean up on error

All errors are mapped to `ApiError::InternalServerError`.
@problame
Copy link
Contributor Author

After lengthy investigation, I have no definitive smoking gun, but the hottest lead based on log message correlation is that HandlerTimeline holds the Timeline::gate open.

More details here (internal doc): https://www.notion.so/neondatabase/2024-07-01-stuck-detach-investigation-f0d4ae0247b347ab9bff95355fce1b25?pvs=4#c93179d978bc41d1a47f4b3821b12b81

@problame
Copy link
Contributor Author

Also, the check for the first (as in ShardSelector::First) tenant in handle_pagerequests is insufficient if a pageserver hosts multiple tenant shards. For example, if we're hosting shard 1 and shard 2, and ::First resolved to shard 1, but we want to detach shard 2, then we're sensitive to the wrong cancellation token.

problame added a commit that referenced this issue Jul 11, 2024
`trace_read_requests` is a per `Tenant`-object option.
But the `handle_pagerequests` loop doesn't know which
`Tenant` object (i.e., which shard) the request is for.

The remaining use of the `Tenant` object is to check `tenant.cancel`.
That check is incorrect [if the pageserver hosts multiple
shards](#7427 (comment)).
I'll fix that in a future PR where I completely eliminate the holding
of `Tenant/Timeline` objects across requests.
See [my code RFC](#8286) for
the
high level idea.

Note that we can always bring the tracing functionality if we need it.
But since it's actually about logging the `page_service` wire bytes,
it should be a `page_service`-level config option, not per-Tenant.
And for enabling tracing on a single connection, we can implement
a `set pageserver_trace_connection;` option.
problame added a commit that referenced this issue Jul 11, 2024
The remaining use of the `Tenant` object is to check `tenant.cancel`.
That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)).
I'll fix that in a future PR where I completely eliminate the holding
of `Tenant/Timeline` objects across requests.
See [my code RFC](#8286) for the
high level idea.
skyzh pushed a commit that referenced this issue Jul 15, 2024
…ncellationToken (#8295)

Preliminary refactoring while working on
#7427
and specifically #8286
skyzh pushed a commit that referenced this issue Jul 15, 2024
…to mgmt API (#8292)

I want to fix bugs in `page_service`
([issue](#7427)) and the
`import basebackup` / `import wal` stand in the way / make the
refactoring more complicated.

We don't use these methods anyway in practice, but, there have been some
objections to removing the functionality completely.

So, this PR preserves the existing functionality but moves it into the
HTTP management API.

Note that I don't try to fix existing bugs in the code, specifically not
fixing
* it only ever worked correctly for unsharded tenants
* it doesn't clean up on error

All errors are mapped to `ApiError::InternalServerError`.
skyzh pushed a commit that referenced this issue Jul 15, 2024
`trace_read_requests` is a per `Tenant`-object option.
But the `handle_pagerequests` loop doesn't know which
`Tenant` object (i.e., which shard) the request is for.

The remaining use of the `Tenant` object is to check `tenant.cancel`.
That check is incorrect [if the pageserver hosts multiple
shards](#7427 (comment)).
I'll fix that in a future PR where I completely eliminate the holding
of `Tenant/Timeline` objects across requests.
See [my code RFC](#8286) for
the
high level idea.

Note that we can always bring the tracing functionality if we need it.
But since it's actually about logging the `page_service` wire bytes,
it should be a `page_service`-level config option, not per-Tenant.
And for enabling tracing on a single connection, we can implement
a `set pageserver_trace_connection;` option.
@problame
Copy link
Contributor Author

problame commented Jul 15, 2024

This week: finish preliminary refactorings & get ready for review the big PR so we can soak it in staging next week

@problame
Copy link
Contributor Author

This week:

  • PR got reviewed and merged on Wednesday
  • => observe staging, watch out for growing connection count metrics / slow detach / deletions

Next week:

  • PR will roll into prod.
  • Should be safe to revert.

Week after:

@problame
Copy link
Contributor Author

problame commented Aug 12, 2024

Status update:

This week:

@problame
Copy link
Contributor Author

Status update:

@problame
Copy link
Contributor Author

Calling victory on this, there were some minor issues exposed by the chaosInterval enablement, but none related to slow detaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver p/high High priority: use for bugs that need prompt attention, such as crashes or possible corruptions t/bug Issue Type: Bug t/on_call_followup triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

2 participants