Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some() #3387

Closed
LizardWizzard opened this issue Jan 19, 2023 · 17 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@LizardWizzard
Copy link
Contributor

LizardWizzard commented Jan 19, 2023

Steps to reproduce

Investigating.

Expected result

No panics

Actual result

One task panicked. Additionally this caused problems for other tenants. Other tenants became unavailable.

Part of the log containing stacktrace:

2023-01-19T16:15:20.672271Z  INFO download complete: /storage/pageserver/data/tenants/X/timelines/Y/000000067F00004002000004EB0000000007-030000000000000000000000000000000002__0000>
2023-01-19T16:15:20.576264Z  INFO synthetic_size_worker:calculate_synthetic_size{tenant_id=X}:gather_size_inputs{tenant_id=X}: on-demand downloading remote layer remote 73e1f731e8>
thread 'background op worker' panicked at 'assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some()', /home/nonroot/pageserver/src/tenant/layer_map.rs:397:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:65:14
   2: core::panicking::panic
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:115:5
   3: pageserver::tenant::layer_map::LayerMap<L>::remove_historic
             at /home/nonroot/pageserver/src/tenant/layer_map.rs:397:9
   4: pageserver::tenant::timeline::Timeline::download_remote_layer::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/timeline.rs:3086:25
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/future/mod.rs:91:19
   6: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:296:9
   7: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.25/src/future/future/catch_unwind.rs:36:42
   8: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:271:9
   9: std::panicking::try::do_call
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:483:40
  10: std::panicking::try
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:447:19
  11: std::panic::catch_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panic.rs:137:14
  12: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.25/src/future/future/catch_unwind.rs:36:9
  13: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:348:35
  14: tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:233:19
  15: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:345:13
  16: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:348:35
  17: tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:233:19
  18: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/task/task_local.rs:345:13
  19: pageserver::task_mgr::task_wrapper::{{closure}}
             at /home/nonroot/pageserver/src/task_mgr.rs:326:9
  20: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/future/mod.rs:91:19
  21: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/core.rs:223:17
  22: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/loom/std/unsafe_cell.rs:14:9
  23: tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/core.rs:212:13
  24: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:476:19
  25: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:271:9
  26: std::panicking::try::do_call
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:483:40
  27: std::panicking::try
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:447:19
  28: std::panic::catch_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panic.rs:137:14
  29: tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:464:18
  30: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:198:27
  31: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.1/src/runtime/task/harness.rs:152:15
  32: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  33: tokio::runtime::scheduler::multi_thread::worker::Context::run
  34: tokio::macros::scoped_tls::ScopedKey<T>::set
  35: tokio::runtime::scheduler::multi_thread::worker::run
  36: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  37: tokio::runtime::task::core::Core<T,S>::poll
  38: tokio::runtime::task::harness::Harness<T,S>::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2023-01-19T16:15:20.873463Z ERROR Task 'download layer 000000067F00004002000004EB0000000007-030000000000000000000000000000000002__000000001E6DE619-000000001E7B0B41' tenant_id: Some(X),  timeline_id: Some(Y) panicked: Any { .. }
2023-01-19T16:15:20.873542Z ERROR synthetic_size_worker: failed to calculate synthetic size for tenant 1acf99ebac0

Environment

prod

Logs, links

@LizardWizzard LizardWizzard added the t/bug Issue Type: Bug label Jan 19, 2023
@LizardWizzard
Copy link
Contributor Author

Most probably this assertion doesnt play well with on-demand download

@knizhnik
Copy link
Contributor

Most probably this assertion doesnt play well with on-demand download

But it seems to me that even with on-demand download layer map at page server should contain all layers, shouldn't it?

@LizardWizzard
Copy link
Contributor Author

It should. The error happens when we've downloaded a layer and trying to remove existing remote layer and replace it with downloaded one. Looking further

@LizardWizzard
Copy link
Contributor Author

LizardWizzard commented Jan 20, 2023

This is the sequence:

2023-01-19T13:11:06.337285Z  INFO attach{tenant_id=X}:load_remote_timeline{timeline_id=Y}:reconcile_with_remote: remote layer does not exist locally, creating remote layer: 000000067F00004002000004EB0000000007-030000000000000000000000000000000002__000000001E6DE619-000000001E7B0B41
2023-01-19T16:15:20.576264Z  INFO synthetic_size_worker:calculate_synthetic_size{tenant_id=X}:gather_size_inputs{tenant_id=X}: on-demand downloading remote layer remote Y/000000067F00004002000004EB0000000007-030000000000000000000000000000000002__000000001E6DE619-000000001E7B0B41
2023-01-19T16:15:20.576240Z  INFO synthetic_size_worker:calculate_synthetic_size{tenant_id=X}:gather_size_inputs{tenant_id=X}: need remote layer remote Y/000000067F00004002000004EB0000000007-030000000000000000000000000000000002__000000001E6DE619-000000001E7B0B41
2023-01-19T16:15:20.672271Z  INFO download complete: /storage/pageserver/data/tenants/X/timelines/Y/000000067F00004002000004EB0000000007-030000000000000000000000000000000002__000000001E6DE619-000000001E7B0B41
thread 'background op worker' panicked at 'assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some()', /home/nonroot/pageserver/src/tenant/layer_map.rs:397:9
stack backtrace:

@LizardWizzard
Copy link
Contributor Author

I have a hypothesis. In get_reconstruct_data here the comment says that download happens without holding the lock for layermap, which makes sense from one side, but on the other side while we're downloading the layer some other task (e g compaction) can remove that layer from the map, and when download finishes it will fail to replace old layer with a new one. This explanation makes sense to me, but I do not have the proof that this is the case

@LizardWizzard
Copy link
Contributor Author

Update:

Per sync with @shanyp and @problame

The reason why this lead to unavailability is the incorrectly set shutdown_on_error flag for synthetic size calculation and shutdown_on_error not working (#3402).

The above hypothesis still looks valid to me. @hlinnaka, wdyt? Do you think there can be other reasons for this behavior?

We need to decide what would be the proper fix. Available options:

  • Treat this as a normal scenario, insert the layer even if there is no layer exists to replace it
    • We need to verify that this will work correctly since it may resurrect layers after the deletion (most probably in compaction, less probably with gc).
    • Theoretically it can cause some perf degradation in case this resurrected layer participates in lookups (it shouldnt)
    • It will occupy space on disk, who and when will clean it up? Eviction? Gc? Tempfile + restart?
  • Come up with something like buffer pinning in postgres page cache. So we wont remove layers that have in-progress download
    • Can cause contention if we will wait until things get downloaded
    • Will be probably easier to reason about
  • We can do pinning, but cancel the download task in case we delete the layer
    • Need to be careful with cancellation, not leaving temp files behind, etc

The idea is to not rush the fix and do it properly. In the meantime I'll use a workaround for migration (download all layers script by @problame)

@shanyp shanyp added the c/storage Component: storage label Jan 23, 2023
@koivunej koivunej self-assigned this Jan 24, 2023
@shanyp shanyp added c/storage/pageserver Component: storage: pageserver and removed c/storage Component: storage labels Jan 24, 2023
@koivunej
Copy link
Member

koivunej commented Jan 24, 2023

Correct me if I misremember, but we discussed we should first try to figure out how to reproduce this, then move on to ignoring/fixing it. Triple checked the logs, there's nothing to support the above hypothesis there for the today's case (assuming there was nothing for the earlier).

Doesn't reproduce locally with:

  • attach, background compaction, tenant_size
  • attach, tenant_size
  • attach, running compute, tenant_size
  • attach, running compute, compact, tenant_size

Did manage to reproduce the compute refuses to start with sent tarball but didn't record it. This was on a local dev build with incremental compilation (many cgus => better chance of hitting Arc::ptr_eq).

@koivunej

This comment was marked as outdated.

@koivunej
Copy link
Member

koivunej commented Jan 25, 2023

@problame suggested I use failpoint to grow the logical_size_calculation length to cover gc but I don't think that should be enough, even if layers were removed because we didn't see the affected layer being removed anywhere.

I am thinking of following solutions:

@koivunej
Copy link
Member

Don't want to edit #3387 (comment) completly, so adding fixes here:

Looking now, it seems that Calculate logical size for timeline Y at LSN happened every 10min. On the logs it appears that the last gc completed around 15s before the next tenant_size calculation, for which we observe:

  1. need to on-demand download
  2. on-demand download completion
  3. panic while trying to replace on layer map
  4. finally we have a log message describing that it was the Tenant::refresh_gc_info which failed:
    2023-01-24T04:59:32.805866Z ERROR synthetic_size_worker: failed to calculate synthetic size for tenant 1acf99ebac07d01ea7abf3bd104e8e55: Failed to refresh gc_info before gathering inputs
    

Tenant::gather_size_inputs calls Tenant::refresh_gc_info first, only later it starts to calculate logical sizes, which explains why there is no Calculating logical size for Y at LSN message.

So to clarify, logical size calculation was no running throughout the gc's. Marking my previous comment outdated. Will now inspect the pitr calculation to see if there's anything obviously wrong there.

@koivunej
Copy link
Member

koivunej commented Jan 25, 2023

After a bit of detour with #3446 my latest idea for reproduction: 91b6ac2 (in commit message). Still not there. Correct gc_horizon value is 87772040 (I have zero intuition for these), but no layers end up downloaded after gc at the same cutoff 0/1F3B4320 by tenant_size's refresh_gc_info.

@koivunej

This comment was marked as outdated.

@koivunej
Copy link
Member

koivunej commented Jan 26, 2023

Reproduced the layer download itself with the release-2722 binary (sha256: dbbde2386583ba8f138ab813a116c84e499d53e73d229233b313daf0f8657e22) by just the refresh_gc_info, no panic. This required a highly hacky time-specific tenant configuration:

[tenant_config]
pitr_interval = "713007s"
gc_horizon = 87772040

Where pitr interval is near select extract (epoch from ('7 days' + (('now'::timestamptz) - ('2023-01-25 06:34:32.436+00:00'::timestamp))));.

@koivunej
Copy link
Member

koivunej commented Feb 13, 2023

As later found out in #3589 we always fail with "cannot iterate a remote layer" for RemoteLayer's so this assertion error could not be compaction related (should compaction had touched a RemoteLayer, it would had failed).

@problame
Copy link
Contributor

Let's close this issue because the code is gone?

@koivunej
Copy link
Member

Related problematic/missed log line is now changed after #3664 as #3431 (comment).

I think this could be closed, unless we want to still try reproduce it with the old binary. To recap, since then or inspired by this issue we've:

@shanyp
Copy link
Contributor

shanyp commented Feb 22, 2023

lets close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

5 participants