Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: k8s horizontal pod autoscaling #677

Closed
wants to merge 2 commits into from

Conversation

gabor-boros
Copy link
Contributor

@gabor-boros gabor-boros commented May 30, 2022

Description

This PR adds new patches to the LMS, CMS, LMS worker, and CMS worker deployments to controll resource limitations for HPA as necessary. Also, allows to set the CMS/LMS_MEMORY_REQUEST parameters which are new config values.

Patches

  • k8s-cms-deployment-resources
  • k8s-cms-worker-deployment-resources
  • k8s-lms-deployment-resources
  • k8s-lms-worker-deployment-resources

Resource request config

  • CMS_MEMORY_REQUEST
  • LMS_MEMORY_REQUEST

Supporting information

The autoscaling is done by a plugin.

Dependencies

N/A

Testing instructions

  1. Proofread the code
  2. Try setting custom resource requests
  3. Validate the custom settings applied

Deadline

None

Other information

before scaling

Screenshot 2022-06-29 at 9 47 24

after scaling

Screenshot 2022-06-29 at 9 47 28

memory usage

Screenshot 2022-06-29 at 9 51 33

@regisb
Copy link
Contributor

regisb commented May 30, 2022

Hey @gabor-boros, it's great to see that you are working on auto-scaling Open edX! I'm going to repeat the same arguments that I listed here: #675 (comment)
Basically, I'd rather avoid adding many settings to the Tutor configuration. Instead, I want to make it easy for anyone to auto-scale using a plugin -- that you would develop and maintain? It seems to me that almost all the changes that you propose could already be added to a plugin, except for the resource limitations. Would the approach that I describe in my comment above (using a "k8s/override.yml" file) be satisfactory to you?

@gabor-boros
Copy link
Contributor Author

Hey @regisb, I'm glad you appreciate it!

This change is something we could manage as a plugin, though, with all respect, I believe this change should be part of Tutor's core, just as K8s services. Although I'm not aware of edX/U2's or other providers' setup, I think autoscaling is somewhat a must have feature.

In the referenced comment you wrote the following:

Will a large proportion (~30-50%) of users actually want to fiddle with these settings?

The question is how we calculate users. Do you count it as a literal person/company or instances provisioned using Tutor? If the first one, probably these settings won't be changed by 50%, but probably ~20-25% as one of the core values of Kubernetes is having the capability of managing resources on scale.

On the other hand, if we count users as instances, the percentage is way higher (I would assume 60-70%). Especially, if we take into consideration that companies usually provision multiple instances. For example, at OpenCraft, we would provision about 100 instances per cluster per environment (staging and production).

Copy link
Contributor

@keithgg keithgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabor-boros 👍 from my side

  • I tested this, by running kubectl get hpa on the staging cluster
  • I read through the code

@fghaas
Copy link
Contributor

fghaas commented Jun 7, 2022

@regisb

Hey @gabor-boros, it's great to see that you are working on auto-scaling Open edX! I'm going to repeat the same arguments that I listed here: #675 (comment)

Having been in this same situation with a couple of PRs that either I or someone on my team worked on, allow me to chime in here.

It looks like your policy is, by and large, "don't add new configuration settings to Tutor, use a plugin instead." That's all fair and good but please make that policy explicit. Currently, the contributor documentation does not mention any of this. It's not at all self-evident that people should default to not adding new configuration parameters.

Also, the "will this be relevant to more than 30-50% of Tutor users" consideration that you make in the comment on #675 strikes me as a bit of a footgun:

  • Currently, it's safe to say that fewer than 30-50% of Tutor users use Kubernetes.
  • Thus, no addition to tutor k8s functionality can be expected to meet that threshold. In other words, whatever someone wants to add to tutor k8s, you get to say no because "not enough users care about it".
  • Therefore, your policy keeps features that are valuable only to Kubernetes users from landing in core.
  • And that makes it unnecessarily difficult to use Tutor, meaningfully, in a Kubernetes context.

We had this discussion something like 2 years ago: should Tutor concern itself with Kubernetes at all, or should it just get out of the way and not care about container orchestration? If it's the latter, then tutor k8s has no business existing in Tutor core. But you made a conscious decision to keep it in Tutor core. So it's safe to assume that Tutor does want to concern itself with being easy to deploy and use in Kubernetes. (And I agree with that! Just to be clear.)

If even the OpenCraft folks are apparently now coming round to using tutor k8s rather than the Grove approach they previously tried, then that means that tutor k8s is more useful to more people.

So I would guess that that leaves you with two options:

  1. accept patches that are useful only to tutor k8s users, even if that means they will be helpful to less than 30% of Tutor users overall, or
  2. refactor tutor k8s so that it moves out of core and into a plugin.

My vote would be for option 1. What do you think?

@regisb
Copy link
Contributor

regisb commented Jun 7, 2022

It looks like your policy is, by and large, "don't add new configuration settings to Tutor, use a plugin instead."

It's not so black and white, but in general this is correct, yes. Basically, adding a new setting to Tutor core is a red flag to me.

That's all fair and good but please make that policy explicit.

100% agree. The docs do make a good job at explaining which changes are welcome in Tutor core, as opposed to plugins. I consider this an issue, so I just created #683.

Also, the "will this be relevant to more than 30-50% of Tutor users" consideration that you make in the comment on #675 strikes me as a bit of a footgun:

  • Currently, it's safe to say that fewer than 30-50% of Tutor users use Kubernetes.

The "30-50% rule" is really not a rule per se, and I should not have worded it in that way. I'll attempt to clarify: in my mind, the k8s integration in Tutor is already a plugin. In the future, I expect that many components that are currently part of tutor "core" will be separated as plugins. For instance, we might end up with tutor-openedx-compose and tutor-openedx-k8s, which will take care of the docker-compose and k8s integration of Open edX, respectively.

With that in mind, the "30-50% rule" becomes: "does a large minority of tutor k8s users need this feature?" I may very well agree that auto-scaling is important for many people who run tutor k8s -- a majority, even. But there is absolutely no way that we are going to add 32 new settings to Tutor core to implement this feature, which will eventually be used by a tiny percentage of the global Tutor user population.

It really boils down to the amount of new settings. If auto-scaling could be easily enabled with just one or two new settings, without any impact to users who do not need auto-scaling, then I would be extremely happy to merge this PR. I think auto-scaling is awesome! But I'm not prepared to add so much complexity to Tutor core.

@gabor-boros I would like to reiterate that it would be fantastic if Tutor users could easily enable auto-scaling on Kubernetes. At this point, I think that the best way to achieve that is via a plugin, but I'm ready to change my mind if the implementation does not require so many new settings. What's your opinion?

@gabor-boros
Copy link
Contributor Author

@fghaas

If even the OpenCraft folks are apparently now coming round to using tutor k8s rather than the Grove approach they previously tried, then that means that tutor k8s is more useful to more people.

As much as I can tell, Grove was never made to replace Tutor in any sense, instead it utilizes Tutor as an "openedX instance provisioner" (I found no better term for it). The reason OpenCraft members are collaborating more frequently can be originated from that the tools is getting close to production use, therefore we need to implement those features that are necessary for production -- like autoscaling.

@regisb

I would like to reiterate that it would be fantastic if Tutor users could easily enable auto-scaling on Kubernetes.

Turning it on and off is easy; set CMS_AUTOSCALING, CMS_WORKER_AUTOSCALING, LMS_AUTOSCALING, or LMS_WORKER_AUTOSCALING to true depending on your needs.

I'm ready to change my mind if the implementation does not require so many new settings.

Well, this could quickly lead to a chicken-egg problem. I can adjust the implementation to have 4 new settings (listed above), though the next thing people would probably want to have is fine-tuning the min/max pod count, and so on. Which is a valid use case, as no two instances are identical.

What's your opinion?

If I cannot convince you by the end of the week why would it be essential for K8s users to have autoscaling with precise parameter configuration, I'll take a deep breath and create a plugin instead. But I feel that if a plugin is a must-have for something in most of the cases, that should be a core feature. For example, take the case of the mock library as an example.

cc: @keithgg

@pcliupc
Copy link
Contributor

pcliupc commented Jun 9, 2022

@gabor-boros Thanks for this PR. In FIRAcademy, we have also tried enabling autoscaling in K8S which is a little bit similar to your solution. I have once considered making a PR to tutor but I found it was difficult to standardize the configurations.

The production environment is complex and in our production, we need even more settings. For example, when autoscaling happens, we want some of the pods to be created in original K8S nodes, and others to be created in an elastic virtual node if the pod number exceeds the limit. We end up with creating a tutor plugin to support autoscaling. I am not sure whether all the added settings make sense to others in their production, although they all make sense to me.

Just put my experience here and for a reference.

@gabor-boros
Copy link
Contributor Author

@pcliupc thank you for sharing your thoughts and experience! Is that plugin open-source by any chance? If yes, we may could collaborate on it.

@pcliupc
Copy link
Contributor

pcliupc commented Jun 9, 2022

@pcliupc thank you for sharing your thoughts and experience! Is that plugin open-source by any chance? If yes, we may could collaborate on it.

I am not sure whether it is useful for you as it relies on some ali cloud API. But you can find an initial version here https://github.com/pcliupc/tutor-contrib-aliyunhpaplugin.

@fghaas
Copy link
Contributor

fghaas commented Jun 10, 2022

The more I think about this, the more I believe this absolutely should live in Tutor core, if tutor k8s is to remain in core. Here's why:

  • There are already a bunch of plugins that add additional Kubernetes functionality. Some only add Jobs (or CronJobs), but others add full Deployments with Pods that that have the same uptime requirements as the LMS and CMS pods.
  • Many of them will require autoscaling sooner or later.
  • It would be silly to expect that they all implement their autoscaling functionality seperately. Such an approach would mean that all plugin authors end up reinventing (or at least reimplementing) very similar wheels.

However, if as @regis argues, tutor k8s already is more or less a plugin, and will be factored out of Tutor core in the future to become a "real" plugin, then where do we go from there? Do we create a plugin hierarchy, where some plugins depend on others and inherit their functionality? If so, how will we keep track of such a plugin tree? How do we express dependencies? Sure, we can have package dependencies such that installing a plugin package auto-installs other plugin packages it depends on, but how do we extend a similar logic to enabling plugins?

@gabor-boros
Copy link
Contributor Author

@fghaas I had the exact same discussion with @keithgg on Friday about the possible outcome of moving K8s to a standalone plugin.

I couldn't agree more with you. The way you described it needs no more explanation I believe. Having K8s support as a plugin would definitely result in a dependency tree. Since the dependency tree resolution must live in tutor core, that would blow up the code base; more than having k8s as a first class citizen of it. Especially if we think about how "easy" dependency management is, it would probably result in huge issues and bugs (let's take a look at pip, cargo, npm, etc which has hard work of years to have a not hated dependency resolution).

@gabor-boros
Copy link
Contributor Author

@regisb any input on this?

@fghaas
Copy link
Contributor

fghaas commented Jun 17, 2022

@gabor-boros I think it doesn't make a ton of sense having this discussion across several PRs (this one, #686, #675; there's probably more). There's a greater issue to be discussed here: the stance of "additional config values bad, plugins good" (when the plugin approach has the deficiency that we've discussed here) is quite likely to cause people to fork Tutor, run their local forks with additional config tweaks, and then we'll end up in the same mess that we had with old Ansible-based "native" installation.

If you agree with that concern of mine, then this is probably a good discussion to be had on Discourse. What do you think?

@gabor-boros
Copy link
Contributor Author

@fghaas I don't think either, though by checking the latest comments' datum I don't think there is any conversations going on. At least not one that would help things going on.

There's a greater issue to be discussed here: the stance of "additional config values bad, plugins good" (when the plugin approach has the deficiency that we've discussed here) is quite likely to cause people to fork Tutor, run their local forks with additional config tweaks, and then we'll end up in the same mess that we had with old Ansible-based "native" installation.

I couldn't agree more. Being transparent about my opinion: if tutor goes that way, the so called "dependency hell" will come soon and there would be plugins-of-plugins-of-plugins-of... which is even worse than having many forks.

If you agree with that concern of mine, then this is probably a good discussion to be had on Discourse. What do you think?

I definitely do agree. It would be good to resolve this conversation as it is blocking other stuff. For example, because of this we already have to use a fork -- since we don't know what would be the resolution of this conversation. If that's resolved, we would be able to create a plugin/upstream the changes/something else, but get rid of the fork.

Would you start the conversation there?

@fghaas
Copy link
Contributor

fghaas commented Jun 20, 2022

@gabor-boros How about you start it. I have pointed this problem out several times in the past and I am going nowhere with it; it would be good to get some additional voices. In particular, it would be good if you pointed out that you're already running a fork, for the aforementioned reasons.

@regisb
Copy link
Contributor

regisb commented Jun 20, 2022

I couldn't agree more with you. The way you described it needs no more explanation I believe. Having K8s support as a plugin would definitely result in a dependency tree. Since the dependency tree resolution must live in tutor core, that would blow up the code base; more than having k8s as a first class citizen of it.

Sure, we can have package dependencies such that installing a plugin package auto-installs other plugin packages it depends on, but how do we extend a similar logic to enabling plugins?

The way I see it, the dependency tree will mostly be handled by pypi itself. It consists of just two elements:

  1. Plugin requirements should be added to the plugin's setup.py (see for instance the tutor-ecommerce dependencies).
  2. In the plugin source, verify that the dependent plugins are enabled. If not, raise an exception.

I think that this implementation resolves the matter of dependency management in a rather simple way, wouldn't you agree?

If you agree with that concern of mine, then this is probably a good discussion to be had on Discourse. What do you think?

I'm OK with that, but I'm going to repeat there what I already stated here: there is no way that we are adding 32 new settings (+40%) to Tutor core to handle auto-scaling. I suspect that the conversation is rapidly going to turn into a debate on who will be in charge of maintaining this complex feature. I am not a volunteer. If someone is, they should create and maintain a plugin.

I don't think either, though by checking the latest comments' datum I don't think there is any conversations going on.

@gabor-boros I find your comment quite unfair. I have responded to all of your comments in a detailed manner. I also consider that I can take a couple days off from responding to non-urgent GitHub comments, either for professional or personal reasons. I'd like to remind you that Nutmeg was released on June 9th and that it represented a significant amount of work for me.

@gabor-boros
Copy link
Contributor Author

I find your comment quite unfair. I have responded to all of your comments in a detailed manner. I also consider that I can take a couple days off from responding to non-urgent GitHub comments, either for professional or personal reasons. I'd like to remind you that Nutmeg was released on June 9th and that it represented a significant amount of work for me.

I'm sorry if you felt that way. My intention was not to be passive aggressive at all.

Simply, considering the turnaround time for the initial comments, then the slowed down peace was indicated the loss of interest of a compromise.

Thank you for the reminder, I followed the changes closely to see what to prepare for.


Speaking of changes and plugins:

I'm OK with that, but I'm going to repeat there what I already stated here: there is no way that we are adding 32 new settings (+40%) to Tutor core to handle auto-scaling.

I appreciate your detailed comments. Do you have anything on your radar about the extraction of k8s/docker-compose/etc? By that I doesn't necessarily mean timeline but more like how you imagined the outsourcing. Who would maintain the plug-in? Would be that something distributed in the community? Would be that under the GitHub org of overhangio?

Since this change seemingly won't be merged, what would be the next step that is aligned with your plans?

And again, no passive aggressiveness, just want to see how can I unblock myself without maintaining a fork and being able to give something back to the community that will be used by others too -- maybe by @fghaas too.

@regisb
Copy link
Contributor

regisb commented Jun 21, 2022

Do you have anything on your radar about the extraction of k8s/docker-compose/etc? By that I doesn't necessarily mean timeline but more like how you imagined the outsourcing.

It's not going to happen before September, unfortunately. I'm going to be off most of July and August. I hope to be able to allocate more time to Tutor refactoring after the summer holidays. Many design decisions will have to be made during the extraction, so I'd rather make the changes myself. This work is part of a broader epic to extract pretty much everything related to Open edX from Tutor and make Tutor a general-purpose tool for configuration management and deployment. See this issue for more information.

The plugin would be open source (AGPLv3), hosted in the overhangio organization, and we would be looking for a maintainer (see the maintainers program). If we don't find one, I would be in charge.

Since this change seemingly won't be merged, what would be the next step that is aligned with your plans?

I think that you/OpenCraft should step in and create a plugin, leveraging the work made in #686. (that we still need to merge)

@fghaas
Copy link
Contributor

fghaas commented Jun 24, 2022

@regisb, speaking of #686, did you notice the updates from @foadlind?

Signed-off-by: Gabor Boros <gabor.brs@gmail.com>
Signed-off-by: Gabor Boros <gabor.brs@gmail.com>
@gabor-boros
Copy link
Contributor Author

gabor-boros commented Jun 29, 2022

@regisb the approach is refactored and adding only the bare minimum configuration. Could you please take a look when you have time?

@gabor-boros gabor-boros marked this pull request as ready for review June 29, 2022 15:27
@regisb
Copy link
Contributor

regisb commented Jul 4, 2022

I'm not sure that the new patches that you introduce in this PR are necessary? Can you please have a look at #686 and check whether the "k8s-override" patch is enough for your needs?

@gabor-boros
Copy link
Contributor Author

@regis

I'm not sure that the new patches that you introduce in this PR are necessary? Can you please have a look at #686 and check whether the "k8s-override" patch is enough for your needs?

Well, it is "enough" as it would allow the override of the whole K8s resource definition, though that's not my intention. I don't even would like to override the deployment's resource, just extend it.

If I'm not mistaken, to extend the deployment's resources the way that #686 implements, I would need to redefine the majority of the deployment, however, I only want to extend its resource need definition.

I see why you didn't want to introduce ~30 variables: increased maintenance. To see it from the other direction, if a plugin maintainer have to redefine the majority of a deployment, the plugin maintainer would be in the same shoes.

I assume your goal is to keep Tutor slim, though, in my honest opinion, defining limits for a deployment's resources and extending requests should have been part of Tutor templates the day requests was added. Not complaining at all since all of us have different needs, but it is an essential capability to limit resource usage.

If you are aginst the patches because it is called *-resource, I'm more than happy to introduce new variables to allow defining CPU requests and resource limitations, though that would introduce more variables than patches I added now.

@regisb
Copy link
Contributor

regisb commented Jul 4, 2022

If I'm not mistaken, to extend the deployment's resources the way that #686 implements, I would need to redefine the majority of the deployment, however, I only want to extend its resource need definition.

As far as I understand, you do not need to redefine the entire deployment; this is the whole point of strategic merge patches https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#customizing

Unless I'm mistaken, you would only have to write the following patch(es):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: lms
spec:
    spec:
      containers:
        - name: lms
          resources: <put your changes here>

Can you please try this out and confirm?

@gabor-boros
Copy link
Contributor Author

@regisb I can confirm it is working, so closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants