Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(test-case): update 5000 tables test case configuration #9843

Merged
merged 1 commit into from
Jan 19, 2025

Conversation

vponomaryov
Copy link
Contributor

@vponomaryov vponomaryov commented Jan 16, 2025

List of changes:

  • Disable per-table metrics due to significant perf impact.
  • Enable cluster health checks which work with this case just fine.
  • Decrease the nemesis interval from 60 minutes to just 3 keeping in mind that health checks will take some time too.
  • Reduce stress time for each of the 5000 commands. Having 20 minutes per cmd we will get about 1.5 days long test runs instead of the 2.5 days.
  • Reduce number of loaders from 5 to 3 to use resources more efficiently. In current case the bottleneck is the RAM.

Note that this scenario hits following bug:

If destroy_data_then_repair nemesis gets triggered aganst the setup of this scenario.

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@vponomaryov vponomaryov added backport/2024.2 Need backport to 2024.2 backport/6.2 backport/2024.1 Need backport to 2024.1 backport/6.1 Need backport to 6.1 labels Jan 16, 2025
@vponomaryov vponomaryov requested review from fruch and roydahan January 16, 2025 18:00
roydahan
roydahan previously approved these changes Jan 16, 2025
List of changes:
- Disable per-table metrics due to significant perf impact.
- Enable cluster health checks which work with this case just fine.
- Decrease the nemesis interval from 60 minutes to just 3 keeping
  in mind that health checks will take some time too.
- Reduce stress time for each of the 5000 commands.
  Having 20 minutes per cmd we will get about 1.5 days long test runs
  instead of the 2.5 days.
- Reduce number of loaders from 5 to 3 to use resources more
  efficiently. In current case the bottleneck is the RAM.

Note that this scenario hits following bug:
- scylladb/scylla-enterprise#5093

If 'destroy_data_then_repair' nemesis gets triggered aganst the setup
of this scenario.
@mykaul
Copy link
Contributor

mykaul commented Jan 19, 2025

Is this with tablets or with vnodes? (we probably need for both and compare between them)

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch
Copy link
Contributor

fruch commented Jan 19, 2025

Is this with tablets or with vnodes? (we probably need for both and compare between them)

test can run with both, and the issue mentioned above happened with both

you can see the summerie here:
https://github.com/scylladb/qa-tasks/issues/1820#issuecomment-2596369089

but it doesn't seem like anyone is attending to the issue raised by @vponomaryov

@scylladbbot scylladbbot added backport/6.1-done Commit backported to 6.1 backport/6.2-done backport/2024.1-done Commit backported to 2024.1 and removed backport/6.1 Need backport to 6.1 backport/6.2 backport/2024.1 Need backport to 2024.1 labels Jan 19, 2025
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while is it ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge a PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which was not planned when the test was written.

[1] scylladb#9843
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge a PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which was not planned when the test was written.

[1] scylladb#9843
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which was not planned when the test was written.

[1] scylladb#9843
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] scylladb#9843
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] scylladb#9843
fruch pushed a commit that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] #9843
scylladbbot pushed a commit to scylladbbot/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] scylladb#9843

(cherry picked from commit a78d65f)
scylladbbot pushed a commit to scylladbbot/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] scylladb#9843

(cherry picked from commit a78d65f)
scylladbbot pushed a commit to scylladbbot/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] scylladb#9843

(cherry picked from commit a78d65f)
scylladbbot pushed a commit to scylladbbot/scylla-cluster-tests that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] scylladb#9843

(cherry picked from commit a78d65f)
fruch pushed a commit that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] #9843

(cherry picked from commit a78d65f)
fruch pushed a commit that referenced this pull request Jan 21, 2025
The 'test_longevity.py::test_test_user_batch_custom_time' unit test uses
the 'test-cases/scale/longevity-5000-tables.yaml' config file for
running a short longevity test which triggers a nemesis.

If health checks are enabled then the "nemesis call" runs much longer
while health checks are completed 2 times - before and after the
nemesis. And while it is ongoing the nemesis lock gets held.
And the problem with it is that it runs even after finish of this test.
So, any another unit test which tries to get a nemesis lock will stumble
upon a held lock for 10+ minutes.

It started happening after the merge of the PR [1] which enabled health
checks in the mentioned config file.
The affected test is following:
- test_nemesis.py::test_list_nemesis_of_added_disrupt_methods

Alphabetically it runs after
the 'test_longevity.py::test_test_user_batch_custom_time' one.

So, disable health checks to avoid side-effects and doing redundant
stuff which were not planned when the test was written.

[1] #9843

(cherry picked from commit a78d65f)
@scylladbbot scylladbbot added backport/2024.2-done Commit backported to 2024.2 and removed backport/2024.2 Need backport to 2024.2 labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/6.1-done Commit backported to 6.1 backport/6.2-done backport/2024.1-done Commit backported to 2024.1 backport/2024.2-done Commit backported to 2024.2 promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants