-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: k8s horizontal pod autoscaling #677
Conversation
Hey @gabor-boros, it's great to see that you are working on auto-scaling Open edX! I'm going to repeat the same arguments that I listed here: #675 (comment) |
Hey @regisb, I'm glad you appreciate it! This change is something we could manage as a plugin, though, with all respect, I believe this change should be part of Tutor's core, just as K8s services. Although I'm not aware of edX/U2's or other providers' setup, I think autoscaling is somewhat a must have feature. In the referenced comment you wrote the following:
The question is how we calculate users. Do you count it as a literal person/company or instances provisioned using Tutor? If the first one, probably these settings won't be changed by 50%, but probably ~20-25% as one of the core values of Kubernetes is having the capability of managing resources on scale. On the other hand, if we count users as instances, the percentage is way higher (I would assume 60-70%). Especially, if we take into consideration that companies usually provision multiple instances. For example, at OpenCraft, we would provision about 100 instances per cluster per environment (staging and production). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabor-boros 👍 from my side
- I tested this, by running
kubectl get hpa
on the staging cluster - I read through the code
Having been in this same situation with a couple of PRs that either I or someone on my team worked on, allow me to chime in here. It looks like your policy is, by and large, "don't add new configuration settings to Tutor, use a plugin instead." That's all fair and good but please make that policy explicit. Currently, the contributor documentation does not mention any of this. It's not at all self-evident that people should default to not adding new configuration parameters. Also, the "will this be relevant to more than 30-50% of Tutor users" consideration that you make in the comment on #675 strikes me as a bit of a footgun:
We had this discussion something like 2 years ago: should Tutor concern itself with Kubernetes at all, or should it just get out of the way and not care about container orchestration? If it's the latter, then If even the OpenCraft folks are apparently now coming round to using So I would guess that that leaves you with two options:
My vote would be for option 1. What do you think? |
It's not so black and white, but in general this is correct, yes. Basically, adding a new setting to Tutor core is a red flag to me.
100% agree. The docs do make a good job at explaining which changes are welcome in Tutor core, as opposed to plugins. I consider this an issue, so I just created #683.
The "30-50% rule" is really not a rule per se, and I should not have worded it in that way. I'll attempt to clarify: in my mind, the k8s integration in Tutor is already a plugin. In the future, I expect that many components that are currently part of tutor "core" will be separated as plugins. For instance, we might end up with tutor-openedx-compose and tutor-openedx-k8s, which will take care of the docker-compose and k8s integration of Open edX, respectively. With that in mind, the "30-50% rule" becomes: "does a large minority of It really boils down to the amount of new settings. If auto-scaling could be easily enabled with just one or two new settings, without any impact to users who do not need auto-scaling, then I would be extremely happy to merge this PR. I think auto-scaling is awesome! But I'm not prepared to add so much complexity to Tutor core. @gabor-boros I would like to reiterate that it would be fantastic if Tutor users could easily enable auto-scaling on Kubernetes. At this point, I think that the best way to achieve that is via a plugin, but I'm ready to change my mind if the implementation does not require so many new settings. What's your opinion? |
As much as I can tell, Grove was never made to replace Tutor in any sense, instead it utilizes Tutor as an "openedX instance provisioner" (I found no better term for it). The reason OpenCraft members are collaborating more frequently can be originated from that the tools is getting close to production use, therefore we need to implement those features that are necessary for production -- like autoscaling.
Turning it
Well, this could quickly lead to a chicken-egg problem. I can adjust the implementation to have 4 new settings (listed above), though the next thing people would probably want to have is fine-tuning the min/max pod count, and so on. Which is a valid use case, as no two instances are identical.
If I cannot convince you by the end of the week why would it be essential for K8s users to have autoscaling with precise parameter configuration, I'll take a deep breath and create a plugin instead. But I feel that if a plugin is a must-have for something in most of the cases, that should be a core feature. For example, take the case of the mock library as an example. cc: @keithgg |
@gabor-boros Thanks for this PR. In FIRAcademy, we have also tried enabling autoscaling in K8S which is a little bit similar to your solution. I have once considered making a PR to tutor but I found it was difficult to standardize the configurations. The production environment is complex and in our production, we need even more settings. For example, when autoscaling happens, we want some of the pods to be created in original K8S nodes, and others to be created in an elastic virtual node if the pod number exceeds the limit. We end up with creating a tutor plugin to support autoscaling. I am not sure whether all the added settings make sense to others in their production, although they all make sense to me. Just put my experience here and for a reference. |
@pcliupc thank you for sharing your thoughts and experience! Is that plugin open-source by any chance? If yes, we may could collaborate on it. |
I am not sure whether it is useful for you as it relies on some ali cloud API. But you can find an initial version here https://github.com/pcliupc/tutor-contrib-aliyunhpaplugin. |
The more I think about this, the more I believe this absolutely should live in Tutor core, if
However, if as @regis argues, |
@fghaas I had the exact same discussion with @keithgg on Friday about the possible outcome of moving K8s to a standalone plugin. I couldn't agree more with you. The way you described it needs no more explanation I believe. Having K8s support as a plugin would definitely result in a dependency tree. Since the dependency tree resolution must live in tutor core, that would blow up the code base; more than having k8s as a first class citizen of it. Especially if we think about how "easy" dependency management is, it would probably result in huge issues and bugs (let's take a look at pip, cargo, npm, etc which has hard work of years to have a not hated dependency resolution). |
@regisb any input on this? |
@gabor-boros I think it doesn't make a ton of sense having this discussion across several PRs (this one, #686, #675; there's probably more). There's a greater issue to be discussed here: the stance of "additional config values bad, plugins good" (when the plugin approach has the deficiency that we've discussed here) is quite likely to cause people to fork Tutor, run their local forks with additional config tweaks, and then we'll end up in the same mess that we had with old Ansible-based "native" installation. If you agree with that concern of mine, then this is probably a good discussion to be had on Discourse. What do you think? |
@fghaas I don't think either, though by checking the latest comments' datum I don't think there is any conversations going on. At least not one that would help things going on.
I couldn't agree more. Being transparent about my opinion: if tutor goes that way, the so called "dependency hell" will come soon and there would be plugins-of-plugins-of-plugins-of... which is even worse than having many forks.
I definitely do agree. It would be good to resolve this conversation as it is blocking other stuff. For example, because of this we already have to use a fork -- since we don't know what would be the resolution of this conversation. If that's resolved, we would be able to create a plugin/upstream the changes/something else, but get rid of the fork. Would you start the conversation there? |
@gabor-boros How about you start it. I have pointed this problem out several times in the past and I am going nowhere with it; it would be good to get some additional voices. In particular, it would be good if you pointed out that you're already running a fork, for the aforementioned reasons. |
The way I see it, the dependency tree will mostly be handled by pypi itself. It consists of just two elements:
I think that this implementation resolves the matter of dependency management in a rather simple way, wouldn't you agree?
I'm OK with that, but I'm going to repeat there what I already stated here: there is no way that we are adding 32 new settings (+40%) to Tutor core to handle auto-scaling. I suspect that the conversation is rapidly going to turn into a debate on who will be in charge of maintaining this complex feature. I am not a volunteer. If someone is, they should create and maintain a plugin.
@gabor-boros I find your comment quite unfair. I have responded to all of your comments in a detailed manner. I also consider that I can take a couple days off from responding to non-urgent GitHub comments, either for professional or personal reasons. I'd like to remind you that Nutmeg was released on June 9th and that it represented a significant amount of work for me. |
I'm sorry if you felt that way. My intention was not to be passive aggressive at all. Simply, considering the turnaround time for the initial comments, then the slowed down peace was indicated the loss of interest of a compromise. Thank you for the reminder, I followed the changes closely to see what to prepare for. Speaking of changes and plugins:
I appreciate your detailed comments. Do you have anything on your radar about the extraction of k8s/docker-compose/etc? By that I doesn't necessarily mean timeline but more like how you imagined the outsourcing. Who would maintain the plug-in? Would be that something distributed in the community? Would be that under the GitHub org of overhangio? Since this change seemingly won't be merged, what would be the next step that is aligned with your plans? And again, no passive aggressiveness, just want to see how can I unblock myself without maintaining a fork and being able to give something back to the community that will be used by others too -- maybe by @fghaas too. |
It's not going to happen before September, unfortunately. I'm going to be off most of July and August. I hope to be able to allocate more time to Tutor refactoring after the summer holidays. Many design decisions will have to be made during the extraction, so I'd rather make the changes myself. This work is part of a broader epic to extract pretty much everything related to Open edX from Tutor and make Tutor a general-purpose tool for configuration management and deployment. See this issue for more information. The plugin would be open source (AGPLv3), hosted in the overhangio organization, and we would be looking for a maintainer (see the maintainers program). If we don't find one, I would be in charge.
I think that you/OpenCraft should step in and create a plugin, leveraging the work made in #686. (that we still need to merge) |
Signed-off-by: Gabor Boros <gabor.brs@gmail.com>
Signed-off-by: Gabor Boros <gabor.brs@gmail.com>
4607ed0
to
600d5dc
Compare
@regisb the approach is refactored and adding only the bare minimum configuration. Could you please take a look when you have time? |
I'm not sure that the new patches that you introduce in this PR are necessary? Can you please have a look at #686 and check whether the "k8s-override" patch is enough for your needs? |
Well, it is "enough" as it would allow the override of the whole K8s resource definition, though that's not my intention. I don't even would like to override the deployment's If I'm not mistaken, to extend the deployment's resources the way that #686 implements, I would need to redefine the majority of the deployment, however, I only want to extend its resource need definition. I see why you didn't want to introduce ~30 variables: increased maintenance. To see it from the other direction, if a plugin maintainer have to redefine the majority of a deployment, the plugin maintainer would be in the same shoes. I assume your goal is to keep Tutor slim, though, in my honest opinion, defining If you are aginst the patches because it is called |
As far as I understand, you do not need to redefine the entire deployment; this is the whole point of strategic merge patches https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#customizing Unless I'm mistaken, you would only have to write the following patch(es):
Can you please try this out and confirm? |
@regisb I can confirm it is working, so closing this. |
Description
This PR adds new patches to the LMS, CMS, LMS worker, and CMS worker deployments to controll resource limitations for HPA as necessary. Also, allows to set the CMS/LMS_MEMORY_REQUEST parameters which are new config values.
Patches
k8s-cms-deployment-resources
k8s-cms-worker-deployment-resources
k8s-lms-deployment-resources
k8s-lms-worker-deployment-resources
Resource request config
CMS_MEMORY_REQUEST
LMS_MEMORY_REQUEST
Supporting information
The autoscaling is done by a plugin.
Dependencies
N/A
Testing instructions
Deadline
None
Other information
before scaling
after scaling
memory usage