Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_pageserver_chaos: duplicate L1 in test #4088

Closed
koivunej opened this issue Apr 26, 2023 · 5 comments
Closed

test_pageserver_chaos: duplicate L1 in test #4088

koivunej opened this issue Apr 26, 2023 · 5 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

https://neon-github-public-dev.s3.amazonaws.com/reports/main/debug/4809176454/index.html#suites/094ca0c3798926b0b3676f7b1a8a5bdb/28eced2609521275/

Tripped on a ERROR message, reproduced here in full (except stacktrace):

2023-04-26T13:40:16.307329Z ERROR compaction_loop{tenant_id=3da6434c11e2f4a29310e74fd4ae849d}: Compaction failed, retrying in 2s: Attempt to insert duplicate layer 000000067F000032AC000000000000000001-000000067F000032AC000040000000000064__000000000AAE7281-0000000014A8F1C9 in layer map

Stack backtrace:
   0: <pageserver::tenant::layer_map::LayerMap<dyn pageserver::tenant::storage_layer::PersistentLayer>>::insert_historic_noflush
             at pageserver/src/tenant/layer_map.rs:279:13
   1: <pageserver::tenant::timeline::Timeline>::compact_level0::{closure#0}
             at pageserver/src/tenant/timeline.rs:3370:13
   2: <pageserver::tenant::timeline::Timeline>::compact_inner::{closure#0}
             at pageserver/src/tenant/timeline.rs:784:21
   3: <pageserver::tenant::timeline::Timeline>::compact::{closure#0}
             at pageserver/src/tenant/timeline.rs:642:46
   4: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::compact::{closure#0}> as core::future::future::Future>::poll
             at .cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
   5: <pageserver::tenant::Tenant>::compaction_iteration::{closure#0}
             at pageserver/src/tenant.rs:1301:17
   6: pageserver::tenant::tasks::compaction_loop::{closure#0}::{closure#0}
             at pageserver/src/tenant/tasks.rs:93:66
   7: pageserver::tenant::tasks::compaction_loop::{closure#0}
             at pageserver/src/tenant/tasks.rs:113:5
   8: <tracing::instrument::Instrumented<pageserver::tenant::tasks::compaction_loop::{closure#0}> as core::future::future::Future>::poll
             at .cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
   9: pageserver::tenant::tasks::start_background_loops::{closure#0}
             at pageserver/src/tenant/tasks.rs:29:17

While inspecting the logs we noted:

2023-04-26T13:40:09.809567Z  INFO load{tenant_id=3da6434c11e2f4a29310e74fd4ae849d}:load_local_timeline{timeline_id=df40998dc92e8dc44fdef54dc6b39a10}: removing temp timeline file at /tmp/test_output/test_pageserver_chaos/repo/tenants/3da6434c11e2f4a29310e74fd4ae849d/timelines/df40998dc92e8dc44fdef54dc6b39a10/000000067F000032AC0000400000000000C9-XXX__000000000AAE7281-0000000014A8F1C9.64FgQ0WE.___temp

Which we deduced to be a different L1 than the duplicate, because it has a different key_start. The name is produced by

conf.timeline_path(&timeline_id, &tenant_id).join(format!(
"{}-XXX__{:016X}-{:016X}.{}.{}",
key_start,
u64::from(lsn_range.start),
u64::from(lsn_range.end),
rand_string,
TEMP_FILE_SUFFIX,
))

Slack thread: https://neondb.slack.com/archives/C033QLM5P7D/p1682521379341769?thread_ts=1682519017.949719&cid=C033QLM5P7D

@koivunej koivunej added the c/storage/pageserver Component: storage: pageserver label Apr 26, 2023
@koivunej
Copy link
Member Author

Questions for gauging the severity with regards to #4086:

  • we didn't get to see the next compaction, what would had happened? permanent error would had been bad

@koivunej koivunej mentioned this issue Apr 26, 2023
11 tasks
@koivunej
Copy link
Member Author

The duplicate check was introduced in #3869.

koivunej added a commit that referenced this issue Apr 28, 2023
This reverts commit 732acc5.

Reverted PR: #3869

As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.

Cc: #4094, #4088
vadim2404 pushed a commit that referenced this issue Apr 28, 2023
This reverts commit 732acc5.

Reverted PR: #3869

As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.

Cc: #4094, #4088
@koivunej
Copy link
Member Author

I created a duplicate #4690 so it seems it has been re-introduced again, so I guess we have to revert it again?

@koivunej
Copy link
Member Author

koivunej commented Jul 12, 2023

Closing the #4690, reproducing the desc here:


https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4688/5529507997/index.html#suites/094ca0c3798926b0b3676f7b1a8a5bdb/7f9758333d0ceb71?attachment=2dadde58b3b49194

2023-07-12T08:59:17.461023Z ERROR compaction_loop{tenant_id=bf09b81bcfb7b2b97b9515a1ee45c6b2}:compact_timeline{timeline=7cb49508f7e06cc3d398c14b74c46203}:panic{thread=background op worker location=pageserver/src/tenant/timeline.rs:146:13}: overwriting a layer: PersistentLayerDesc { tenant_id: bf09b81bcfb7b2b97b9515a1ee45c6b2, timeline_id: 7cb49508f7e06cc3d398c14b74c46203, key_range: Key { field1: 0, field2: 0, field3: 0, field4: 0, field5: 0, field6: 0 }..Key { field1: 0, field2: 1663, field3: 1, field4: 2701, field5: 1, field6: 0 }, lsn_range: 0/149F078..0/17D6FFE9, is_delta: true, is_incremental: true, file_size: 5013504 }

Stack backtrace:
   0: utils::logging::tracing_panic_hook
             at /libs/utils/src/logging.rs:166:21
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/alloc/src/boxed.rs:1987:9
   2: std::panicking::rust_panic_with_hook
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:695:13
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:582:13
   4: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:150:18
   5: rust_begin_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
   6: core::panicking::panic_fmt
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
   7: <pageserver::tenant::timeline::LayerFileManager<dyn pageserver::tenant::storage_layer::PersistentLayer>>::insert
             at /pageserver/src/tenant/timeline.rs:146:13
   8: <pageserver::tenant::timeline::Timeline>::insert_historic_layer
             at /pageserver/src/tenant/timeline.rs:2285:9
   9: <pageserver::tenant::timeline::Timeline>::compact_level0::{closure#0}
             at /pageserver/src/tenant/timeline.rs:3949:13
  10: <pageserver::tenant::timeline::Timeline>::compact_inner::{closure#0}
             at /pageserver/src/tenant/timeline.rs:844:21
  11: <pageserver::tenant::timeline::Timeline>::compact::{closure#0}
             at /pageserver/src/tenant/timeline.rs:703:46
let key = Key { field1: 0, field2: 1663, field3: 1, field4: 2701, field5: 1, field6: 0 };
assert_eq!("000000067F0000000100000A8D0100000000", format!("{key}"));

So this was an L1.

skyzh added a commit that referenced this issue Jul 17, 2023
## Problem

Compactions might generate files of exactly the same name as before
compaction due to our naming of layer files. This could have already
caused some mess in the system, and is known to cause some issues like
#4088. Therefore, we now
consider duplicated layers in the post-compaction process to avoid
violating the layer map duplicate checks.

related previous works: close
#4094
error reported in: #4690,
#4088

## Summary of changes

If a file already exists in the layer map before the compaction, do not
modify the layer map and do not delete the file. The file on disk at
that time should be the new one overwritten by the compaction process.

This PR also adds a test case with a fail point that produces exactly
the same set of files.

This bypassing behavior is safe because the produced layer files have
the same content / are the same representation of the original file.

An alternative might be directly removing the duplicate check in the
layer map, but I feel it would be good if we can prevent that in the
first place.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
@koivunej
Copy link
Member Author

koivunej commented Aug 4, 2023

This is now handled by #4696. Last recorded flakyness was 2023-07-18 and the fix was commited 2023-07-17 10:26:29 -0400. However the flakyness was in a branch, so this is all good, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants