-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting scylla server and populating keyspace causes "Storage I/O error: 28: No space left on device" and shutdown #22020
Comments
@yarongilor - I'm not sure what was the expected behavior, without increasing the cluster size? |
First, @yarongilor I think that the title is not good enough. |
@mykaul , the scenario description is rephrased to: Run a setup of 90% disk utilization. The massive write load is stopped once getting to 90%. So the bottom line question is why scylla service start triggers something like 64GB additional disk space. |
it's all about populating keyspace1 and specifically the compaction splitting sstables:
|
A different (but possibly related?) problem is found on another run during nemesis of restart_then_repair:
PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 3 nodes (i4i.large) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Based on the scenario you are describing, it should be reasonably easy to create a non-SCT reproducer for this issue, right? |
Not sure it's too easy/available |
what would be the problematic part, limiting storage space? I think, we could demonstrate the issue with lower utilization but visible spike in storage space after restart |
Could be commitlog replay. Did you shut down the server cleanly? @elcallio clean shutdown marks all commitlog segments as free, does it not? |
64GB is exactly the amount of space allocated to commitlog on i4i.4xlarge. However, half of it should be free under normal operation (and all of it after a clean shutdown). |
The shutdown was basically clean like:
|
The issue is reproduced on another "rolling restart" nemesis of
PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 3 nodes (i4i.large) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
We are constantly hitting here #20054. For the original report, when data ingestion stops, which happens around 16:25, we have 4 tablets with the average size of 83.6 GB. This particular issue is mentioned in the test plan and can be overcome by setting higher initial tablet count. Note also that it is not only the node-3's storage utilization that gradually increased, but all of them:
Furthermore, node-1 also hit 100%, reported the error and got restarted:
@yarongilor Can you re-run the test with 32/64 initial tablet counts in order to avoid tablet splits? Once applied, I suspect it will work fine. |
the issue is not reproduced where setting 64 initial tablets. |
Thats great! I think we should increase priority of #20054 to P0, this error proves that it can easily happen in production and I do not think we can recommend tablets for production with this behaviour. @bhalevy @paszkow @swasik WDYT? |
We indeed hit it constantly testing 90% utilization. Note, however, that in all our tests we start from an empty table with 1 tablet/shard and in production, you will have more mature servers. So the problem might not be that severe out there. We shall definitely fix it. |
This problem will be less severe in practice with ongoing work here: #22024 |
I am not sure we will, as far as I know, we dont have migration from vnodes, so customers wanting to try out tablets will need to start with fresh cluster and thus have increased chance of hitting this
In that case, I think either #22024 or #20054 should be P0, in my opinion |
Packages
Scylla version:
2024.3.0~dev-20241215.811c9ccb7f91
with build-idcf31bbad95480fbbafa9cb498cf0a54cd58c7485
Kernel Version:
6.8.0-1020-aws
Issue description
Describe your issue in detail and steps it took to produce it.
Run a setup of 90% disk utilization. The massive write load is stopped once getting to 90%.
Ran an SCT nemesis of
disrupt_stop_wait_start_scylla_server
that stops scylla for 5 minutes then start it back.No significant load on cluster at that time - mainly a read load and perhaps a 3 writes per second.
So no disk utilization growth is expected at this time. Yet, once scylla started - the utilization started increasing as well, unexpectedly.
The scylla service is started and the node disk utilization is gradually increased until reaches 100% and the start command fails after 10 minutes.
log of node-3 event is:
node-3 failure:
grafana shows the node's disk utilization clibms up to 100%:
SCT failure error event:
Impact
node is down.
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image:
ami-0a3508c8059b5dc39
(aws: undefined_region)Test:
byo-longevity-test-yg2
Test id:
2c23b329-4757-4f06-b60c-fc222590dcf4
Test name:
scylla-staging/yarongilor/byo-longevity-test-yg2
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 2c23b329-4757-4f06-b60c-fc222590dcf4
$ hydra investigate show-logs 2c23b329-4757-4f06-b60c-fc222590dcf4
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: