Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OEP-45 :: ADR-002: Deploying Open edX on Kubernetes Using Helm #372

Merged
merged 7 commits into from
Feb 9, 2023

Conversation

bradenmacdonald
Copy link
Contributor

@bradenmacdonald bradenmacdonald commented Aug 19, 2022

This amends OEP-45 to say that we're planning to use Helm to create a standardized way to provision a k8s cluster so that it can host multiple Open edX instances.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Aug 19, 2022
@openedx-webhooks
Copy link

openedx-webhooks commented Aug 19, 2022

Thanks for the pull request, @bradenmacdonald! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@bradenmacdonald bradenmacdonald marked this pull request as draft August 19, 2022 20:15
@natabene
Copy link

@bradenmacdonald Thank you for the contribution, I will be waiting for green light from you.

@felipemontoya felipemontoya self-requested a review August 30, 2022 14:10
Charts are build with the Go language templating engine which is a drawback since it means adding a new language to the ecosystem with a new learning curve. This is considered acceptable due to the large adoption of Helm in the devops community.


**Other**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradenmacdonald do you have anything else in mind for this section?

Building more Kubernetes support into Tutor
===========================================

Enhancing the current support for tutor core was not considered since it would go against one of the goals of the project as `explained by the author`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisb perhaps you would like to look at this. Am I correctly characterizing your position here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct. I gave some additional thoughts in this other comment: overhangio/tutor#677 (comment)

I still think it would be overkill to add so many (32!) new configuration settings to Tutor core to add features such as auto-scaling to Open edX. But I believe that these options would work great in a Tutor plugin.

My position has always been that, when possible, new non-essential/complex features should be added via Tutor plugins, and I encourage you to reconsider this option. By moving to Helm, you are cutting yourself from the Tutor plugin ecosystem. You will have to re-implement all of the Tutor templates, configuration settings and CLI. Not just for Tutor core, but for other plugins as well (ecommerce and discovery, among many others).

Together, we managed to implement an extension mechanism that allows anyone to modify the kustomize manifests and to easily implement the features you need (see this PR overhangio/tutor#686). I believe this is a great way to extend the capabilities of Open edX in Kubernetes and I do not quite understand why you don't use it.

It's quite possible that the current implementation of the Tutor plugin API makes it difficult to run Helm. For instance, I think that the Jinja2/Helm templating languages might be incompatible. But I would be more than happy to improve the plugin API to make it play well with Helm.

At the very least, I think that this OEP should include a detailed section of what cannot be achieved using Tutor plugins. It should also explain how you plan on maintaining compatibility with existing plugins.

I'm pretty sure that we could work around the difficulties without cutting ourselves from the rest of the plugin ecosystem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the extended context. The approach you have taken with tutor has been very strict and I think you would not have the ease of install that you have so far without it. This OEP however is aimed more towards repeatability with installations that can not do without a good deal of tunning. This means instances with very large throughput, complex ingress for multi-tenant hosting and other restrictions on the kubernetes side like self hosted clusters that may not have backing of cloud services for the k8s api.

My position has always been that, when possible, new non-essential/complex features should be added via Tutor plugins, and I encourage you to reconsider this option. By moving to Helm, you are cutting yourself from the Tutor plugin ecosystem. You will have to re-implement all of the Tutor templates, configuration settings and CLI. Not just for Tutor core, but for other plugins as well (ecommerce and discovery, among many others).

It is not our goal to remove ourselves from the tutor ecosystem. We are active maintainers of plugins like tutor-xqueue and contrib-codejail and tools like TVM. We are using tutor for all our development and plan to continue doing so.
At the same time, for operations, we want fine grained control over every variable and we also want composability with every Helm subchart. Some of our goals (like multi tenant clusters) are specifically in what you decided to cut from the core as design principles of tutor core. Having Helm charts is a way for us to reach some of those goals while at the same time producing something some members of the community can also benefit from.

Together, we managed to implement an extension mechanism that allows anyone to modify the kustomize manifests and to easily implement the features you need (see this PR overhangio/tutor#686). I believe this is a great way to extend the capabilities of Open edX in Kubernetes and I do not quite understand why you don't use it.

Using layers to override parts of the manifest files was a part of what we did in the drydock plugin and we still had issues with composition of plugins and plugins overriding each other. The k8s-override patches are great and we will be using them in our plugins to better maintain the k8s capabilities.

At the very least, I think that this OEP should include a detailed section of what cannot be achieved using Tutor plugins. It should also explain how you plan on maintaining compatibility with existing plugins.

Given the flexibility that tutor allows into the plugin system, I do not think there is anything that cannot be achieved using tutor plugins.
Still as an operator of arrays of instances I see a lot of value in producing helm charts to manage them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you manage to integrate your Helm charts with existing Tutor plugins, then I think that it would be a net gain. I'm looking forward to seeing the implementation. It is my opinion that plugin compatibility should be added to this OEP, to make it clear that it is an important topic. Your other arguments are also pretty reasonable: maybe also add them to the OEP? It would also help to clarify what is the intended audience for Helm charts: is it anyone running Open edX on k8s or just multi-tenant operators? Will it be possible for existing tutor k8s users to migrate to tutor helm?
Lastly: do you really need an OEP for that? I'm not opposed, but creating an OEP seems overkill to me.

Copy link
Contributor Author

@bradenmacdonald bradenmacdonald Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisb I think these Helm charts will primarily focus on multi-tenant installations. From what I understand, Tutor is fundamentally designed to manage single instances at a time and regardless of what plugins you write, it's hard to do things like deploy a cluster with a shared database, shared load balancer, and other shared resources using Tutor.

That's why we're exploring using Tutor+plugins for everything up to the point of deploying onto a cluster (i.e. customizing, building images, doing upgrades, other management), but controlling the deployment onto the cluster using Helm.

Or alternately, maybe it would make sense to deploy the cluster (DBs, LB, monitoring, autoscaling) using Helm and let Tutor deploy each individual instance onto that cluster. In the latter case, I don't think a Tutor plugin to deploy the cluster (without instances yet) makes any sense? But certainly it would make sense to ensure that Tutor and its plugins are compatible with the Helm-provisioned multi-tenant clusters. From looking at an example of a Tutor plugin's k8s deployment spec, I don't see anything that would be incompatible.

Perhaps the way that load balancers work would need to be changed. There cannot be a central Caddyfile to configure a shared load balancer. Rather, each IDA needs to have a service annotation or Custom Resource which the shared, stateless load balancer detects and uses to automatically route traffic and provision HTTPS certs.

I'll need to investigate a bit more, it definitely seems that there's more to consider here.

Lastly: do you really need an OEP for that? I'm not opposed, but creating an OEP seems overkill to me.

I'm not sure either. Maybe it is overkill. It seems like if it's what most providers in the community are doing, we should document it somewhere and encourage consistency and collaboration. And currently other ops-related decisions are documented using OEPs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just chiming in:

[...] what cannot be achieved using Tutor plugins [...]

This reminds me one thing, existing K8s resource definitions cannot be removed by using the strategic merge patching mechanism.

@antoviaque
Copy link
Contributor

@bradenmacdonald @felipemontoya Thank you again for putting this OEP together! What's needed to get it out of the draft state, and move forward with this?

@bradenmacdonald
Copy link
Contributor Author

@antoviaque @felipemontoya I think we need to do some more technical discovery to better understand how this proposal could be made compatible with Tutor's existing k8s support, so that people can take advantage of Tutor plugins in a multi-tenant k8s installation that's managed by Helm, without needing to modify or rewrite the k8s part of each plugin.

@antoviaque
Copy link
Contributor

@bradenmacdonald OK - do we have some room to do that now? I wouldn't want to lose the momentum we were building around collaborating on this, so let me know if we need to prioritize to find the time for it.

@bradenmacdonald
Copy link
Contributor Author

@antoviaque Yeah I think I can start poking away at it next week, but I don't really have a ton of time to spend on it.
What about you @felipemontoya ? Do you agree with my suggested next steps, or have something else in mind?

@felipemontoya
Copy link
Member

@bradenmacdonald I also have found difficulties to spend a lot of time in this. However the reasoning you wrote in a previous comment

That's why we're exploring using Tutor+plugins for everything up to the point of deploying onto a cluster (i.e. customizing, building images, doing upgrades, other management), but controlling the deployment onto the cluster using Helm.

is similar to what we have been doing so far. Tutor + plugins to get the cluster first installed and then maintaining is a different beast.

Or alternately, maybe it would make sense to deploy the cluster (DBs, LB, monitoring, autoscaling) using Helm and let Tutor deploy each individual instance onto that cluster.

I think this is an approach that we could explore simultaneously to the compatibility layer between tutor and helm.
I'll put some time aside this sprint to explore this concept further.

@MoisesGSalas fyi.

@bradenmacdonald
Copy link
Contributor Author

Update: I got some time this week so have started going some testing and prototyping. Will report back in a week or so.

@felipemontoya felipemontoya force-pushed the braden-felipe/kubernetes-helm branch from 8283284 to 8353fc1 Compare December 1, 2022 02:29
@felipemontoya
Copy link
Member

As per our conversation in the last sync meeting, I altered this PR to turn the draft OEP into an ADR belonging to OEP-45.

I also included the requirement that the Helm charts must be compatible with single openedx instances managed by tutor to address what was the most pressing concern of this proposal.

@felipemontoya felipemontoya force-pushed the braden-felipe/kubernetes-helm branch from 8353fc1 to 9aa3119 Compare December 1, 2022 02:34
@bradenmacdonald
Copy link
Contributor Author

Thanks @felipemontoya, looks good to me!

@felipemontoya felipemontoya changed the title OEP-59: Deploying Open edX on Kubernetes Using Helm OEP-45 :: ADR-002: Deploying Open edX on Kubernetes Using Helm Dec 5, 2022
@itsjeyd
Copy link

itsjeyd commented Dec 6, 2022

Hey @felipemontoya @bradenmacdonald, just checking in to see if this PR should still be in Draft status after the latest batch of changes?

@felipemontoya felipemontoya marked this pull request as ready for review December 6, 2022 19:18
@bradenmacdonald
Copy link
Contributor Author

@itsjeyd Sure, it's ready for review now. Though it's much more important that people test out and contribute to https://github.com/open-craft/tutor-contrib-multi (see the issues listed on that repo) to understand this project rather than just reading this small ADR.

@antoviaque
Copy link
Contributor

@sarina @e0d @itsjeyd As we get into starting the formal review phase of this OEP, are there any steps you would like us to take to follow proper procedure?

If you haven't already seen it, besides the contents of this OEP, you'll find the full context on this initiative on the forum threads:

During the meeting from https://discuss.openedx.org/t/deploying-open-edx-on-kubernetes-using-helm/8771 yesterday we discussed opening a longer review period than usual, until the next meeting on January 10th -- due to the holidays, and the need to use the review period to complete concrete technical validation steps. @felipemontoya will be announcing this review period on the forum shortly.

To also help making it clear that we intend to do this project as a shared collaborative effort within the framework of the Open edX project, it would also be useful to move the repository containing the code & tasks list to the github openedx organization - would it be possible to do this now?

@sarina
Copy link
Contributor

sarina commented Dec 7, 2022

Hi @antoviaque - I haven't been involved in this at all so I can't speak to it except for your specific question about moving the repo to the openedx org. For that please file a request ticket of type Systems Request and we will review it/discuss it there.

@antoviaque
Copy link
Contributor

@sarina Thanks for the quick reply! Ticket created: openedx/axim-engineering#577

@itsjeyd
Copy link

itsjeyd commented Dec 8, 2022

@antoviaque From the perspective of OSPR management the main thing to do will be to update the status of the PR to In Review on the contributions board. @mphilbrick211 Could you please take care of that?

That should be it, procedure-wise.

@felipemontoya felipemontoya force-pushed the braden-felipe/kubernetes-helm branch from c66e61c to f8256b3 Compare February 7, 2023 00:51
@felipemontoya
Copy link
Member

I changed the status to 'provisional'. Now I need an approval to be able to continue the merge process.

Copy link
Member

@felipemontoya felipemontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving my approval here already.
However since I'm author of this and arbiter of the original of the OEP I think we should have two approvals on top of mine.

@bradenmacdonald @regisb @gabor-boros @jfavellar90

@felipemontoya
Copy link
Member

@arbrandes I forgot to tag you as well.

@itsjeyd itsjeyd added core committer and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Feb 8, 2023
Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipemontoya @bradenmacdonald Sorry I'm late to the game and thank you for this very helpful ADR. I have no blocking concerns, just a couple clarifying questions.


For Open edX operators who wish to deploy Open edX on Kubernetes, community-maintained `Helm Charts`_ will be used to simplify the process of deploying a kubernetes cluster capable of loading one or more Open edX instances.

Building on top of the `decision 0001`_ those single instances will in turn be managed by Tutor and the Tutor plugin ecosystem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check my understanding, the implication here is that we will not use to Tutor render Helm charts from Jinja2 templates, right? If so, I am glad to hear that. Templates-rendering-templates sounds a bit too much like the edx/configuration for my comfort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is to use Helm charts (with no involvement of Tutor) to deploy shared resources onto the cluster (load balancing with HTTPS, logging, monitoring, autoscaling, ES, Redis, maybe MySQL etc. too). And then once that cluster is provisioned, another workflow (which may vary per provider and is out of scope of this project for now) will use Tutor to build Open edX images and deploy instances onto the cluster using those images.

Wherever possible we want to use k8s primitives so that instances are "auto-detected" as they get deployed - e.g. each instance deploys K8S Ingress objects so that the central load balancer detects the deployment and provisions HTTPS certs and starts routing traffic automatically. Where this is not possible, a tutor plugin or init script will be used to do things like provision database users in the shared database as each instance is deployed.


For most of the history of the Open edX project, production deployments were done primarily using Virtual Machines that had been provisioned using the ansible scripts in the ``edx/configuration`` repository. With the adoption of the `decision 0001`_ from OEP-0045 that selects Tutor as the replacement for ``edx/configuration`` some large instances and operators are encountering use cases that are not supported. Multiple instances of Open edX sharing some infrastructure is the first of such cases.

A popular way to deploy containerized applications is to use `Kubernetes`_. However, Kubernetes has a lot of complexity so the overhead it brings is usually not worthwhile for small, singe-instance deployments. For that reason (as explained in OEP-45) as well as a general lack of experience with Kubernetes, to date there has been no officially recommended way to deploy Open edX on Kubernetes. Although Tutor does include basic support for Kubernetes, it is focused only on a basic working deployment, and lacks the flexibility and capability that operators require for large, multi-instance deployments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding:

large, multi-instance

Do you think this initiative would also be useful to an organization with large, single-instance deployments? In particular, I am thinking of 2U and edx.org. (@jmbowman)

Copy link
Contributor Author

@bradenmacdonald bradenmacdonald Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hope that there is significant overlap and that we can align on best practices, but there will likely be some differences in the approach due to the unique requirements that edx.org has and the single-instance nature.

But yes, although it's not an explicit design goal, I do hope that for anyone who wants to deploy a single highly-scalable instance onto their k8s cluster, that this Helm chart will still be the best approach for that in general too.

@bradenmacdonald
Copy link
Contributor Author

@felipemontoya

However since I'm author of this and arbiter of the original of the OEP I think we should have two approvals on top of mine.

👍🏻 I also approve but since GitHub considers me the opener of this PR I can't do it formally.

Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no expert in any of this, but the ADR is easy to follow and sounds very well-reasoned, so on that basis I give it a 👍🏻

@felipemontoya
Copy link
Member

Thanks to everyone that commented, requested changes and participated in the conversacion.With the current approvals and the 'provisional' status we are ready to merge this.

@felipemontoya felipemontoya merged commit c5cf3fc into master Feb 9, 2023
@felipemontoya felipemontoya deleted the braden-felipe/kubernetes-helm branch February 9, 2023 21:21
@openedx-webhooks
Copy link

@bradenmacdonald 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

@angonz
Copy link

angonz commented Feb 10, 2023

As discussed in our meeting last tuesday, Feb 7th, I suggest changing the title of the ADR. The current title suggests that we are trying to install Open edX using helm charts, and it's not the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.