fix(test-case): update 5000 tables test case configuration #9843

vponomaryov · 2025-01-16T17:54:11Z

List of changes:

Disable per-table metrics due to significant perf impact.
Enable cluster health checks which work with this case just fine.
Decrease the nemesis interval from 60 minutes to just 3 keeping in mind that health checks will take some time too.
Reduce stress time for each of the 5000 commands. Having 20 minutes per cmd we will get about 1.5 days long test runs instead of the 2.5 days.
Reduce number of loaders from 5 to 3 to use resources more efficiently. In current case the bottleneck is the RAM.

Note that this scenario hits following bug:

https://github.com/scylladb/scylla-enterprise/issues/5093

If destroy_data_then_repair nemesis gets triggered aganst the setup of this scenario.

Testing

scylla-staging/valerii/vp-scale-5000-tables-test#7

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

test-cases/scale/longevity-5000-tables.yaml

List of changes: - Disable per-table metrics due to significant perf impact. - Enable cluster health checks which work with this case just fine. - Decrease the nemesis interval from 60 minutes to just 3 keeping in mind that health checks will take some time too. - Reduce stress time for each of the 5000 commands. Having 20 minutes per cmd we will get about 1.5 days long test runs instead of the 2.5 days. - Reduce number of loaders from 5 to 3 to use resources more efficiently. In current case the bottleneck is the RAM. Note that this scenario hits following bug: - scylladb/scylla-enterprise#5093 If 'destroy_data_then_repair' nemesis gets triggered aganst the setup of this scenario.

mykaul · 2025-01-19T11:49:33Z

Is this with tablets or with vnodes? (we probably need for both and compare between them)

fruch

LGTM

fruch · 2025-01-19T13:24:43Z

Is this with tablets or with vnodes? (we probably need for both and compare between them)

test can run with both, and the issue mentioned above happened with both

you can see the summerie here:
https://github.com/scylladb/qa-tasks/issues/1820#issuecomment-2596369089

but it doesn't seem like anyone is attending to the issue raised by @vponomaryov

The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses the 'test-cases/scale/longevity-5000-tables.yaml' config file for running a short longevity test which triggers a nemesis. If health checks are enabled then the "nemesis call" runs much longer while health checks are completed 2 times - before and after the nemesis. And while is it ongoing the nemesis lock gets held. And the problem with it is that it runs even after finish of this test. So, any another unit test which tries to get a nemesis lock will stumble upon a held lock for 10+ minutes. It started happening after the merge a PR [1] which enabled health checks in the mentioned config file. The affected test is following: - test_nemesis.py::test_list_nemesis_of_added_disrupt_methods Alphabetically it runs after the 'test_longevity.py::test_test_user_batch_custom_time' one. So, disable health checks to avoid side-effects and doing redundant stuff which was not planned when the test was written. [1] scylladb#9843

The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses the 'test-cases/scale/longevity-5000-tables.yaml' config file for running a short longevity test which triggers a nemesis. If health checks are enabled then the "nemesis call" runs much longer while health checks are completed 2 times - before and after the nemesis. And while it is ongoing the nemesis lock gets held. And the problem with it is that it runs even after finish of this test. So, any another unit test which tries to get a nemesis lock will stumble upon a held lock for 10+ minutes. It started happening after the merge a PR [1] which enabled health checks in the mentioned config file. The affected test is following: - test_nemesis.py::test_list_nemesis_of_added_disrupt_methods Alphabetically it runs after the 'test_longevity.py::test_test_user_batch_custom_time' one. So, disable health checks to avoid side-effects and doing redundant stuff which was not planned when the test was written. [1] scylladb#9843

The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses the 'test-cases/scale/longevity-5000-tables.yaml' config file for running a short longevity test which triggers a nemesis. If health checks are enabled then the "nemesis call" runs much longer while health checks are completed 2 times - before and after the nemesis. And while it is ongoing the nemesis lock gets held. And the problem with it is that it runs even after finish of this test. So, any another unit test which tries to get a nemesis lock will stumble upon a held lock for 10+ minutes. It started happening after the merge of the PR [1] which enabled health checks in the mentioned config file. The affected test is following: - test_nemesis.py::test_list_nemesis_of_added_disrupt_methods Alphabetically it runs after the 'test_longevity.py::test_test_user_batch_custom_time' one. So, disable health checks to avoid side-effects and doing redundant stuff which was not planned when the test was written. [1] scylladb#9843

The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses the 'test-cases/scale/longevity-5000-tables.yaml' config file for running a short longevity test which triggers a nemesis. If health checks are enabled then the "nemesis call" runs much longer while health checks are completed 2 times - before and after the nemesis. And while it is ongoing the nemesis lock gets held. And the problem with it is that it runs even after finish of this test. So, any another unit test which tries to get a nemesis lock will stumble upon a held lock for 10+ minutes. It started happening after the merge of the PR [1] which enabled health checks in the mentioned config file. The affected test is following: - test_nemesis.py::test_list_nemesis_of_added_disrupt_methods Alphabetically it runs after the 'test_longevity.py::test_test_user_batch_custom_time' one. So, disable health checks to avoid side-effects and doing redundant stuff which were not planned when the test was written. [1] scylladb#9843

The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses the 'test-cases/scale/longevity-5000-tables.yaml' config file for running a short longevity test which triggers a nemesis. If health checks are enabled then the "nemesis call" runs much longer while health checks are completed 2 times - before and after the nemesis. And while it is ongoing the nemesis lock gets held. And the problem with it is that it runs even after finish of this test. So, any another unit test which tries to get a nemesis lock will stumble upon a held lock for 10+ minutes. It started happening after the merge of the PR [1] which enabled health checks in the mentioned config file. The affected test is following: - test_nemesis.py::test_list_nemesis_of_added_disrupt_methods Alphabetically it runs after the 'test_longevity.py::test_test_user_batch_custom_time' one. So, disable health checks to avoid side-effects and doing redundant stuff which were not planned when the test was written. [1] #9843

The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses the 'test-cases/scale/longevity-5000-tables.yaml' config file for running a short longevity test which triggers a nemesis. If health checks are enabled then the "nemesis call" runs much longer while health checks are completed 2 times - before and after the nemesis. And while it is ongoing the nemesis lock gets held. And the problem with it is that it runs even after finish of this test. So, any another unit test which tries to get a nemesis lock will stumble upon a held lock for 10+ minutes. It started happening after the merge of the PR [1] which enabled health checks in the mentioned config file. The affected test is following: - test_nemesis.py::test_list_nemesis_of_added_disrupt_methods Alphabetically it runs after the 'test_longevity.py::test_test_user_batch_custom_time' one. So, disable health checks to avoid side-effects and doing redundant stuff which were not planned when the test was written. [1] scylladb#9843 (cherry picked from commit a78d65f)

The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses the 'test-cases/scale/longevity-5000-tables.yaml' config file for running a short longevity test which triggers a nemesis. If health checks are enabled then the "nemesis call" runs much longer while health checks are completed 2 times - before and after the nemesis. And while it is ongoing the nemesis lock gets held. And the problem with it is that it runs even after finish of this test. So, any another unit test which tries to get a nemesis lock will stumble upon a held lock for 10+ minutes. It started happening after the merge of the PR [1] which enabled health checks in the mentioned config file. The affected test is following: - test_nemesis.py::test_list_nemesis_of_added_disrupt_methods Alphabetically it runs after the 'test_longevity.py::test_test_user_batch_custom_time' one. So, disable health checks to avoid side-effects and doing redundant stuff which were not planned when the test was written. [1] #9843 (cherry picked from commit a78d65f)

github-actions bot assigned vponomaryov Jan 16, 2025

vponomaryov added backport/2024.2 Need backport to 2024.2 backport/6.2 backport/2024.1 Need backport to 2024.1 backport/6.1 Need backport to 6.1 labels Jan 16, 2025

vponomaryov requested review from fruch and roydahan January 16, 2025 18:00

roydahan reviewed Jan 16, 2025

View reviewed changes

test-cases/scale/longevity-5000-tables.yaml Show resolved Hide resolved

roydahan previously approved these changes Jan 16, 2025

View reviewed changes

vponomaryov dismissed roydahan’s stale review via 6db9fa9 January 17, 2025 10:28

vponomaryov force-pushed the update-5000-tables-senario branch from 878caa3 to 6db9fa9 Compare January 17, 2025 10:28

vponomaryov mentioned this pull request Jan 17, 2025

Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it scylladb/scylla-monitoring#2429

Open

vponomaryov requested a review from roydahan January 17, 2025 11:06

fruch approved these changes Jan 19, 2025

View reviewed changes

fruch merged commit 0c7fa60 into scylladb:master Jan 19, 2025
6 checks passed

scylladbbot added the promoted-to-master label Jan 19, 2025

scylladbbot added backport/6.1-done Commit backported to 6.1 backport/6.2-done backport/2024.1-done Commit backported to 2024.1 and removed backport/6.1 Need backport to 6.1 backport/6.2 backport/2024.1 Need backport to 2024.1 labels Jan 19, 2025

vponomaryov mentioned this pull request Jan 21, 2025

fix(unit-tests): stop triggering health checks in longevity unit test #9888

Merged

2 tasks

scylladbbot mentioned this pull request Jan 21, 2025

[Backport 2024.1] fix(unit-tests): stop triggering health checks in longevity unit test #9890

Closed

2 tasks

scylladbbot mentioned this pull request Jan 21, 2025

[Backport 2024.2] fix(unit-tests): stop triggering health checks in longevity unit test #9891

Merged

2 tasks

scylladbbot mentioned this pull request Jan 21, 2025

[Backport 6.1] fix(unit-tests): stop triggering health checks in longevity unit test #9892

Closed

2 tasks

scylladbbot mentioned this pull request Jan 21, 2025

[Backport 6.2] fix(unit-tests): stop triggering health checks in longevity unit test #9893

Merged

2 tasks

scylladbbot added backport/2024.2-done Commit backported to 2024.2 and removed backport/2024.2 Need backport to 2024.2 labels Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(test-case): update 5000 tables test case configuration #9843

fix(test-case): update 5000 tables test case configuration #9843

vponomaryov commented Jan 16, 2025 •

edited

Loading

mykaul commented Jan 19, 2025

fruch left a comment

fruch commented Jan 19, 2025

fix(test-case): update 5000 tables test case configuration #9843

fix(test-case): update 5000 tables test case configuration #9843

Conversation

vponomaryov commented Jan 16, 2025 • edited Loading

Testing

PR pre-checks (self review)

Reminders

mykaul commented Jan 19, 2025

fruch left a comment

Choose a reason for hiding this comment

fruch commented Jan 19, 2025

vponomaryov commented Jan 16, 2025 •

edited

Loading