-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perftune.py
should allow rollback / revert the changes it made
#2350
Comments
At a minimum, the script should save the settings before the tuning for a manual revert. While the sysctl and disk settings are easy to revert now, the IRQ ones are hard. |
@vladzcloudius - thoughts? |
Thank for a quick response, @mykaul! Our Scylla configuration for the Scylla Operator is as follows: datacenter: XXX
racks:
- name: YYY
scyllaConfig: "scylla-config"
scyllaAgentConfig: "scylla-agent-config"
members: 7
storage:
storageClassName: local-raid-disks
capacity: 2200G # this is only the initial size, the actual is 3000G now (see https://github.com/scylladb/scylla-operator/issues/402)
agentResources:
# requests and limits here need to be equal to make Scylla have Guaranteed QoS class
requests:
cpu: 150m
memory: 768M
limits:
cpu: 150m
memory: 768M
resources:
# requests and limits here need to be equal to make Scylla have Guaranteed QoS class
limits:
cpu: 31
memory: 108Gi
requests:
cpu: 31
memory: 108Gi The output with the changes it applied on one of the nodes looked like this:
We managed to revert the disk and sysctl settings on all nodes but that alone didn't help. We reverted all the settings on some nodes by rebooting them. Then we wanted to revert the masks on all the nodes, but we realized that we didn't know what the settings were before as on the rebooted nodes they are not just set to the defaults. Therefore ultimately we reverted it by rebooting all the nodes. |
The only still open GH issue out of the above that is related to perftune.py is supposed to be quite trustworthy - especially if you use the version from the I'm not familiar of any open bug related to As to your request to revert the tuning: this would require backing up the configuration of all values it tunes. This is a nice feature when you play with things.
|
Respectfully, this issue is related to perftune. It may not be very clearly visible on the screenshot, but the our average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)! The read times have been affected as well, although less painfully.
Well, we did use it and it broke our performance. Then it was very hard to revert the changes as with the local SSDs on GKE the node restart caused the Scylla nodes to fall into a restart loop. We had to hack them to think they are replacing themselves to make them start without bootstrapping as new nodes. It didn't work for one node and it did bootstrap, which took more than 10 hours. Overall we spent 3 days reverting the optimisations so I think there is need for a revert feature. We would be happy to help with this by providing some PRs, but we would probably need some guidance, maybe over Slack? |
We used the version bundled with Scylla 5.4.9. |
This should get its own issue (in Scylla) and we can look at it there, if we understand what changes were made (which I assume is doable, since you reverted them). |
Sure, I can open an issue in https://github.com/scylladb/scylladb, if that's a more appropriate place.
Well, we know what the settings were after changing them from the perftune logs (above), but we don't exactly know what they were before. That's the point. |
@gdubicki you need to keep in mind that you should only (!!) use perftune.py in conjunction with corresponding Scylla CPUs pinning.
Was that all the case?
|
If you mean using the static CPU manager policy with Guaranteed QoS class, then we did that. See our config in this comment. But maybe it was wrong to allocate 31 cores for Scylla out of 32 core machine? Should we leave some cores free here? 🤔
We have run perftune using Scylla Operator (v. 1.13.0), so it's done in whatever way it does it.
I don't know, to be frank. We have just configured Scylla as in this comment on a n2d-standard-32 node and enabled the performance tuning that used perftune. |
I think you need to use 'cpuset' to ensure pods are assigned static CPU assignment. |
It might be interesting that this was logged by Scylla when starting on the restarted nodes: The first measurement looks about right for 8 local NVMe SSD nodes in GCP, but all the other results are very bad. Note that this is from the restarts to disable the perftune optimizations. ...however the values ultimately written to the config file look like roughly like this on all the nodes:
...except one, which has substantially lower values for writes:
...but I suppose it's a measurement error. |
According to this doc this is done automatically and for our nodes is set like this right now:
|
Different io_properties.yaml are interesting. Either some issue, or you got a lemon. That happens :-/ |
What bugs me is the question if it is right to assign 31 cores on a 32 core machine for the Scylla pod? Shouldn’t I leave a bit more free for other workloads? (But note that these are dedicated nodes for Scylla, the other workloads are other Scylla pods, Datadog and kubesystem only). |
You are asking the wrong question - the question is how many cores you should dedicate to network IRQ handling vs. Scylla cores. That's a ratio you need to ensure is reasonable. Scylla can work on fewer cores - it's up to you how many you wish to have. With very few cores, we don't even use dedicated cores for network processing. That's what perftune does (among other things). Specifically, 31 out of 32 doesn't make sense to me. Should be more to networking. |
This doesn't look correct indeed. My I see the content of
The page above is a bit unclear what needs to be run where but allow me to re-iterate:
Let me know if there are more questions I can help you with, @gdubicki ? |
Thanks @vladzcloudius!
I don't have a copy of it from before reverting the tuning, and now this file does not exist on my Scylla nodes.
I guess that we would need to ask the Scylla Operator team about whether this is done this way. cc @tnozicka |
Installation details
Scylla version (or git commit hash): 5.4.9
Cluster size: 7 nodes
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-1059-gke x86_64)
Hardware details (for performance issues)
Platform (physical/VM/cloud instance type/docker): GKE, v1.29.5-gke.1192000
Hardware: n2d-standard-32, min. CPU platform: AMD Milan
Disks: (SSD/HDD, count): 8 x local SSD
We have run the
perftune.py
on our cluster for the first time and after the changes have been applied our Scylla read and write times have jumped (the change was applied a bit before 21:00):So far the only way we found to completely revert the changes was to restart the Scylla nodes but it's a long and painful procedure.
I think that especially as there are quite a lot of known issues with this tool (f.e. scylladb/scylladb#14873, scylladb/scylladb#10600, #1297, #1698, #1008 and maybe more), there should be a feature implemented in
pertfune.py
to be able to revert to the defaults.The text was updated successfully, but these errors were encountered: