Mixed shard repair reproducer #8435

Deexie · 2024-08-26T14:17:44Z

Reproducer for mixed shard repair to choose the best solution for scylladb/scylladb#18269.

Sets up a 3-node cluster on AWS with 1TB of data and runs repair.

It will be run with jenkins with the following configurations:

on master; each node has the same number of shards (60)
on master; nodes will have 60, 59, 58 shards
POC1 (repair: mixed shard repair - poc1 scylladb#20295); 60, 59, 58
POC2 (repair: mixed shard repair - poc2 scylladb#20296); 60, 59, 58

denesb · 2024-08-29T07:28:02Z

I am not familiar with the SCT code, but the description looks good to me.
Did you get a chance to run the test? How do the numbers look?

Deexie · 2024-09-02T15:43:08Z

change instance type to i3.16xlarge

Deexie · 2024-09-03T12:51:58Z

change shards count

Deexie · 2024-09-04T07:23:22Z

change loaders instance
split data population

Deexie · 2024-09-05T14:20:09Z

master-60-59-58
test duration: 1h51m
repair time: 936.2321102619171s (15min)
argus: https://argus.scylladb.com/test_runs?state=WyI4YTI3MGEyYS05OWE2LTQwYzYtODkxNS1lMzhiOGFiOGQ2OGMiXQ
non-LSA memory:

Deexie · 2024-09-06T09:06:51Z

master 60-60-60
test duration: 1h 24min
repair time: 553.5129013061523 (9 min)

Deexie · 2024-09-06T09:09:35Z

poc1-60-59-58
test duration: 3h 2min
repair time: 5473.398208618164 (1.5h)

Deexie · 2024-09-06T09:11:22Z

poc2-60-59-58
failed after: 7h 45min

02:09:38  error running operation: std::system_error (error system:104, recv: Connection reset by peer)
02:09:38  ----- LAST WARNING EVENT -----------------------------------------------------
02:09:38  2024-09-05 19:38:43.928 <2024-09-05 19:38:43.699>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=c571cbf1-5131-4da8-8563-3d5ed86ec7bb: type=WARNING regex=(^WARNING|!\s*?WARNING).*\[shard.*\] line_number=109442 node=ubuntu-mixed-sh-db-node-a7786369-2
02:09:38  2024-09-05T19:38:43.699+00:00 ubuntu-mixed-sh-db-node-a7786369-2  !WARNING | scylla[15842]:  [shard  0: gms] seastar_memory - oversized allocation: 1069056 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x6102d3e 0x6103350 0x6103658 0x5bafde2 0x5bb2585 0x45e9158 0x45e255a 0x5c0591f 0x5c06e9a 0x5c08077 0x5c07428 0x5b97593 0x5b968f3 0x13cf2f5 0x13d0cb0 0x13cd713 /opt/scylladb/libreloc/libc.so.6+0x2a087 /opt/scylladb/libreloc/libc.so.6+0x2a14a 0x13cad94
02:09:38  ----- LAST NORMAL EVENT ------------------------------------------------------
02:09:38  2024-09-05 19:38:10.473: (PrometheusAlertManagerEvent Severity.NORMAL) period_type=end event_id=d69143fc-2503-4eb1-b248-61a7a9171077 duration=1h35m59s: alert_name=InstanceDown node=10.4.1.235 start=2024-09-05T18:02:07.408Z end=2024-09-05T18:06:07.408Z description=10.4.1.235 has been down for more than 30 seconds. updated=2024-09-05T18:02:07.412Z state=active fingerprint=45469a7e312b47e8 labels={'alertname': 'InstanceDown', 'cluster': 'my-cluster', 'dc': 'eu-west-1', 'instance': '10.4.1.235', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '3'}
02:09:38  ================================================================================

decoded:

[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:97
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:148
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:181
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:849
 (inlined by) seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:912
 (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:919
 (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1542
 (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1688
malloc at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1707
service::raft_sys_table_storage::load_log() at ././seastar/include/seastar/core/sstring.hh:167
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<boost::container::deque<seastar::lw_shared_ptr<raft::log_entry const>, void, void> >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<boost::container::deque<seastar::lw_shared_ptr<raft::log_entry const>, void, void> >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2577
seastar::reactor::run_some_tasks() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3043
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3211
seastar::reactor::run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3101
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ././main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
main at ././main.cc:2246
/data/scylla-s3-reloc.cache/by-build-id/f8ada775ee7b1210127d4237f218442ce59c3ae3/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8f53abaad945a669f2bdcd25f471d80e077568ef, for GNU/Linux 3.2.0, not stripped

__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

asias · 2024-09-20T00:42:12Z

@Deexie How did you execute the new sct test introduced in this PR? Do you run through Jenkins? Could you share the details?

denesb · 2024-12-12T09:10:32Z

@pehala please review.

pehala

I see we have db_nodes_shards_selection option, that we use in asymetrical test cases, would this work here? Or could be extended to work here, without so many changes in the core?

Deexie · 2024-12-12T13:36:00Z

I see we have db_nodes_shards_selection option, that we use in asymetrical test cases, would this work here? Or could be extended to work here, without so many changes in the core?

With this option, we can have different numbers of shards, but they are taken randomly. Here, we can specify the exact number of shards per node. If we reuse db_nodes_shards_selection, then I think we still need to propagate the number of shards on each node.

Originally, it was done for AWS only. Maybe that's a good way.

roydahan · 2024-12-12T17:57:07Z

Please note that "features" kind of tests are tests that aren't being triggered regularly (especially not automatically).
It's a good way to test a specific feature quickly, but not part of regression testing of releases.

Till now we had the asymmetrical longevities to exercise this path, it's obviously not enough since it didn't detect the issue we had in the field.
I recommend to either refactor current longevity or add a new longevity that will exercise this "feature".

denesb · 2024-12-20T12:52:51Z

@Deexie what is the status of this PR?

denesb · 2025-01-20T12:49:07Z

@Deexie what is the status here?

Deexie · 2025-01-20T14:10:26Z

@Deexie what is the status here?

I'm getting back to it. Currently, the PR contains a feature that enables setting the shard number for each server and the test that was used in the mixed shard issue. I do not see how to achieve what's tested here without the custom shard num feature, nor how to make it a regression test that runs periodically.

I see we have db_nodes_shards_selection option, that we use in asymetrical test cases, would this work here? Or could be extended to work here, without so many changes in the core?

@pehala please see my response above (#8435 (comment)). Do you think that the change may get in as is? Does it need additional testing? Should I run it with each backend and check whether the number of cores is as specified?

I don't think we can go with db_nodes_shards_selection.

Please note that "features" kind of tests are tests that aren't being triggered regularly (especially not automatically). It's a good way to test a specific feature quickly, but not part of regression testing of releases.

Till now we had the asymmetrical longevities to exercise this path, it's obviously not enough since it didn't detect the issue we had in the field. I recommend to either refactor current longevity or add a new longevity that will exercise this "feature".

@roydahan This test wasn't meant to run periodically. The bug was examined based on metrics and logs. I don't know how to convert this into longevity.

roydahan · 2025-01-20T18:30:15Z

@roydahan This test wasn't meant to run periodically. The bug was examined based on metrics and logs. I don't know how to convert this into longevity.

Maybe one simple way is to change the current "asymmetric" longevities configuration to use "nodes_smp: [X, Y, Z]" instead of the current random, with number of smp that we think will stress this feature the most.
You can do that by either adding another configuration file like the one here https://github.com/scylladb/scylla-cluster-tests/blob/master/configurations/db-nodes-shards-random.yaml and set some of the longevities that uses this one with your new config file.

sdcm/cluster_aws.py

sdcm/cluster.py

sdcm/cluster_aws.py

sdcm/sct_config.py

fruch · 2025-01-20T22:26:20Z

sdcm/sct_config.py

@@ -499,6 +499,9 @@ class SCTConfiguration(dict):
             In case of random option - Scylla will start with different (random) shards on every node of the cluster
             """),

+        dict(name="nodes_smp", env="SCT_NODES_SMP", type=list,
+             help="List of shard numbers of nodes in Scylla cluster; list of int, like [4, 5, 3]"),


Suggested change

help="List of shard numbers of nodes in Scylla cluster; list of int, like [4, 5, 3]"),

help="List of shard number to set per node in Scylla cluster; list of int, like [4, 5, 3]"),

I wonder how it would work with multi-dc cases:

region_name: 'eu-west-1 us-east-1' n_db_nodes: '2 1' nodes_smp: [12, 12, 15]

The number is based on node_index and I think it does not depend on dc

fruch

LGTM

we might be able to name a bit better the configuration option
arguments shouldn't be mutable

Deexie · 2025-01-24T13:31:04Z

use None as a default param value
rename nodes_smp to smp_per_db_node_mapping
use str_or_list_or_eval type for smp_per_db_node_mapping
add pipelines with custom shard number for some tests that run with random shard num

Deexie · 2025-01-24T13:33:54Z

modify smp_per_db_node_mapping description

scylladbbot · 2025-01-27T09:54:25Z

@Deexie new branch branch-2025.1 was added, please add backport label if needed

Add custom shard number config for Scylla clusters.

…es with custom shard number Copy asimetric jenkins longevity pipelines and set custom shard number for them.

Deexie · 2025-01-28T15:17:48Z

drop excessive self arg

github-actions bot assigned Deexie Aug 26, 2024

Deexie requested review from asias and denesb August 26, 2024 14:54

Deexie force-pushed the mixed-shard-repair branch 6 times, most recently from 05d9461 to 3060c15 Compare August 27, 2024 14:52

Deexie force-pushed the mixed-shard-repair branch from 3060c15 to 7f38a55 Compare September 2, 2024 15:42

Deexie force-pushed the mixed-shard-repair branch from 7f38a55 to 9502ed2 Compare September 3, 2024 12:50

Deexie force-pushed the mixed-shard-repair branch from 9502ed2 to d09d884 Compare September 4, 2024 07:21

Deexie force-pushed the mixed-shard-repair branch 4 times, most recently from 4b62220 to 13c631d Compare September 5, 2024 10:51

Deexie force-pushed the mixed-shard-repair branch 2 times, most recently from 3f3986d to b3929c0 Compare September 13, 2024 16:28

Deexie mentioned this pull request Sep 17, 2024

repair very slow on mixed shard clusters scylladb/scylladb#18269

Closed

Deexie force-pushed the mixed-shard-repair branch 2 times, most recently from 7fca2ff to a948bb9 Compare September 19, 2024 13:05

pehala reviewed Dec 12, 2024

View reviewed changes

roydahan requested a review from fruch January 20, 2025 18:30