-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: stuck detach operation #7427
Comments
Happened again during that migration. |
Suspected regression from changes around shutdown recently: #7233 |
Next steps: look carefully at timeline shutdown code. |
We have plenty of examples of We see them in Those are signals that something in those code paths is not promptly respecting a cancellation token. However, they may not be responsible for actual hangs, as that's a message printed when a gate guard eventually releases the gate, not if it holds it forever. |
Let's invest some time next week in looking for issues that might have been introduced in #7233 -- I think a couple of hours of quality time staring at the code might yield results. This week I spent some time writing a test (https://github.com/neondatabase/neon/tree/jcsp/shutdown-under-load-test) to try and reproduce shutdown hangs, without any success: this isn't completely surprising, because in the field we see one hang out of every ~10,000 migrations, so it's clearly something pretty niche. |
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427
Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427
…to mgmt API (#8292) I want to fix bugs in `page_service` ([issue](#7427)) and the `import basebackup` / `import wal` stand in the way / make the refactoring more complicated. We don't use these methods anyway in practice, but, there have been some objections to removing the functionality completely. So, this PR preserves the existing functionality but moves it into the HTTP management API. Note that I don't try to fix existing bugs in the code, specifically not fixing * it only ever worked correctly for unsharded tenants * it doesn't clean up on error All errors are mapped to `ApiError::InternalServerError`.
After lengthy investigation, I have no definitive smoking gun, but the hottest lead based on log message correlation is that More details here (internal doc): https://www.notion.so/neondatabase/2024-07-01-stuck-detach-investigation-f0d4ae0247b347ab9bff95355fce1b25?pvs=4#c93179d978bc41d1a47f4b3821b12b81 |
Also, the check for the first (as in |
`trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](#8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.
The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](#8286) for the high level idea.
…to mgmt API (#8292) I want to fix bugs in `page_service` ([issue](#7427)) and the `import basebackup` / `import wal` stand in the way / make the refactoring more complicated. We don't use these methods anyway in practice, but, there have been some objections to removing the functionality completely. So, this PR preserves the existing functionality but moves it into the HTTP management API. Note that I don't try to fix existing bugs in the code, specifically not fixing * it only ever worked correctly for unsharded tenants * it doesn't clean up on error All errors are mapped to `ApiError::InternalServerError`.
`trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](#8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.
This week: finish preliminary refactorings & get ready for review the big PR so we can soak it in staging next week |
This week:
Next week:
Week after:
|
Status update:
This week:
|
Status update:
|
Calling victory on this, there were some minor issues exposed by the |
Stuck /location_config operation while transitioning from attached to secondary (i.e. Tenant::shutdown), clearly a bug in pageserver. We mitigated by restarting pageserver, but, we should debug this.
https://neondb.slack.com/archives/C03H1K0PGKH/p1713453288707669?thread_ts=1713444234.032949&cid=C03H1K0PGKH
This occurred during a /location_config call transitioning a pageserver from attached to secondary.
Root Cause Analysis
#7427 (comment)
tl;dr: no direct smoking gun, but, symptoms so far point to page_service keeping gate open for too long.
Fixing It
Refactor page_service so that gate guard lifetime is easy to understand and obviously bounded to the requests that are ongoing at time of timeline shutdown.
Sketch: #8286
Tasks
page_service
'simport basebackup
/import wal
to mgmt API #8292trace_read_requests
#8338get_last_record_rlsn
#8244show <tenant_id>
#8372task_mgr
for most global tasks #8449Follow-Ups
TenantId
#8462The text was updated successfully, but these errors were encountered: