OEP-45 :: ADR-002: Deploying Open edX on Kubernetes Using Helm #372

bradenmacdonald · 2022-08-19T20:15:06Z

This amends OEP-45 to say that we're planning to use Helm to create a standardized way to provision a k8s cluster so that it can host multiple Open edX instances.

openedx-webhooks · 2022-08-19T20:15:10Z

Thanks for the pull request, @bradenmacdonald! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

supporting documentation
Open edX discussion forum threads
timeline information ("this must be merged by XX date", and why that is)
partner information ("this is a course on edx.org")
any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

natabene · 2022-08-22T14:42:17Z

@bradenmacdonald Thank you for the contribution, I will be waiting for green light from you.

felipemontoya · 2022-08-30T19:37:21Z

oeps/architectural-decisions/oep-0059-arch-kubernetes-helm.rst

+Charts are build with the Go language templating engine which is a drawback since it means adding a new language to the ecosystem with a new learning curve. This is considered acceptable due to the large adoption of Helm in the devops community.
+
+
+**Other**


@bradenmacdonald do you have anything else in mind for this section?

felipemontoya · 2022-08-30T19:38:17Z

oeps/architectural-decisions/oep-0059-arch-kubernetes-helm.rst

+Building more Kubernetes support into Tutor
+===========================================
+
+Enhancing the current support for tutor core was not considered since it would go against one of the goals of the project as `explained by the author`_.


@regisb perhaps you would like to look at this. Am I correctly characterizing your position here?

Yes, this is correct. I gave some additional thoughts in this other comment: overhangio/tutor#677 (comment)

I still think it would be overkill to add so many (32!) new configuration settings to Tutor core to add features such as auto-scaling to Open edX. But I believe that these options would work great in a Tutor plugin.

My position has always been that, when possible, new non-essential/complex features should be added via Tutor plugins, and I encourage you to reconsider this option. By moving to Helm, you are cutting yourself from the Tutor plugin ecosystem. You will have to re-implement all of the Tutor templates, configuration settings and CLI. Not just for Tutor core, but for other plugins as well (ecommerce and discovery, among many others).

Together, we managed to implement an extension mechanism that allows anyone to modify the kustomize manifests and to easily implement the features you need (see this PR overhangio/tutor#686). I believe this is a great way to extend the capabilities of Open edX in Kubernetes and I do not quite understand why you don't use it.

It's quite possible that the current implementation of the Tutor plugin API makes it difficult to run Helm. For instance, I think that the Jinja2/Helm templating languages might be incompatible. But I would be more than happy to improve the plugin API to make it play well with Helm.

At the very least, I think that this OEP should include a detailed section of what cannot be achieved using Tutor plugins. It should also explain how you plan on maintaining compatibility with existing plugins.

I'm pretty sure that we could work around the difficulties without cutting ourselves from the rest of the plugin ecosystem.

Thank you for the extended context. The approach you have taken with tutor has been very strict and I think you would not have the ease of install that you have so far without it. This OEP however is aimed more towards repeatability with installations that can not do without a good deal of tunning. This means instances with very large throughput, complex ingress for multi-tenant hosting and other restrictions on the kubernetes side like self hosted clusters that may not have backing of cloud services for the k8s api.

My position has always been that, when possible, new non-essential/complex features should be added via Tutor plugins, and I encourage you to reconsider this option. By moving to Helm, you are cutting yourself from the Tutor plugin ecosystem. You will have to re-implement all of the Tutor templates, configuration settings and CLI. Not just for Tutor core, but for other plugins as well (ecommerce and discovery, among many others).

It is not our goal to remove ourselves from the tutor ecosystem. We are active maintainers of plugins like tutor-xqueue and contrib-codejail and tools like TVM. We are using tutor for all our development and plan to continue doing so.
At the same time, for operations, we want fine grained control over every variable and we also want composability with every Helm subchart. Some of our goals (like multi tenant clusters) are specifically in what you decided to cut from the core as design principles of tutor core. Having Helm charts is a way for us to reach some of those goals while at the same time producing something some members of the community can also benefit from.

Together, we managed to implement an extension mechanism that allows anyone to modify the kustomize manifests and to easily implement the features you need (see this PR overhangio/tutor#686). I believe this is a great way to extend the capabilities of Open edX in Kubernetes and I do not quite understand why you don't use it.

Using layers to override parts of the manifest files was a part of what we did in the drydock plugin and we still had issues with composition of plugins and plugins overriding each other. The k8s-override patches are great and we will be using them in our plugins to better maintain the k8s capabilities.

At the very least, I think that this OEP should include a detailed section of what cannot be achieved using Tutor plugins. It should also explain how you plan on maintaining compatibility with existing plugins.

Given the flexibility that tutor allows into the plugin system, I do not think there is anything that cannot be achieved using tutor plugins.
Still as an operator of arrays of instances I see a lot of value in producing helm charts to manage them.

If you manage to integrate your Helm charts with existing Tutor plugins, then I think that it would be a net gain. I'm looking forward to seeing the implementation. It is my opinion that plugin compatibility should be added to this OEP, to make it clear that it is an important topic. Your other arguments are also pretty reasonable: maybe also add them to the OEP? It would also help to clarify what is the intended audience for Helm charts: is it anyone running Open edX on k8s or just multi-tenant operators? Will it be possible for existing tutor k8s users to migrate to tutor helm?
Lastly: do you really need an OEP for that? I'm not opposed, but creating an OEP seems overkill to me.

@regisb I think these Helm charts will primarily focus on multi-tenant installations. From what I understand, Tutor is fundamentally designed to manage single instances at a time and regardless of what plugins you write, it's hard to do things like deploy a cluster with a shared database, shared load balancer, and other shared resources using Tutor.

That's why we're exploring using Tutor+plugins for everything up to the point of deploying onto a cluster (i.e. customizing, building images, doing upgrades, other management), but controlling the deployment onto the cluster using Helm.

Or alternately, maybe it would make sense to deploy the cluster (DBs, LB, monitoring, autoscaling) using Helm and let Tutor deploy each individual instance onto that cluster. In the latter case, I don't think a Tutor plugin to deploy the cluster (without instances yet) makes any sense? But certainly it would make sense to ensure that Tutor and its plugins are compatible with the Helm-provisioned multi-tenant clusters. From looking at an example of a Tutor plugin's k8s deployment spec, I don't see anything that would be incompatible.

Perhaps the way that load balancers work would need to be changed. There cannot be a central Caddyfile to configure a shared load balancer. Rather, each IDA needs to have a service annotation or Custom Resource which the shared, stateless load balancer detects and uses to automatically route traffic and provision HTTPS certs.

I'll need to investigate a bit more, it definitely seems that there's more to consider here.

Lastly: do you really need an OEP for that? I'm not opposed, but creating an OEP seems overkill to me.

I'm not sure either. Maybe it is overkill. It seems like if it's what most providers in the community are doing, we should document it somewhere and encourage consistency and collaboration. And currently other ops-related decisions are documented using OEPs.

Just chiming in:

[...] what cannot be achieved using Tutor plugins [...]

This reminds me one thing, existing K8s resource definitions cannot be removed by using the strategic merge patching mechanism.

antoviaque · 2022-10-18T01:50:18Z

@bradenmacdonald @felipemontoya Thank you again for putting this OEP together! What's needed to get it out of the draft state, and move forward with this?

bradenmacdonald · 2022-10-19T17:23:56Z

@antoviaque @felipemontoya I think we need to do some more technical discovery to better understand how this proposal could be made compatible with Tutor's existing k8s support, so that people can take advantage of Tutor plugins in a multi-tenant k8s installation that's managed by Helm, without needing to modify or rewrite the k8s part of each plugin.

antoviaque · 2022-10-20T01:58:07Z

@bradenmacdonald OK - do we have some room to do that now? I wouldn't want to lose the momentum we were building around collaborating on this, so let me know if we need to prioritize to find the time for it.

bradenmacdonald · 2022-10-21T23:40:43Z

@antoviaque Yeah I think I can start poking away at it next week, but I don't really have a ton of time to spend on it.
What about you @felipemontoya ? Do you agree with my suggested next steps, or have something else in mind?

felipemontoya · 2022-10-26T21:05:42Z

@bradenmacdonald I also have found difficulties to spend a lot of time in this. However the reasoning you wrote in a previous comment

That's why we're exploring using Tutor+plugins for everything up to the point of deploying onto a cluster (i.e. customizing, building images, doing upgrades, other management), but controlling the deployment onto the cluster using Helm.

is similar to what we have been doing so far. Tutor + plugins to get the cluster first installed and then maintaining is a different beast.

Or alternately, maybe it would make sense to deploy the cluster (DBs, LB, monitoring, autoscaling) using Helm and let Tutor deploy each individual instance onto that cluster.

I think this is an approach that we could explore simultaneously to the compatibility layer between tutor and helm.
I'll put some time aside this sprint to explore this concept further.

@MoisesGSalas fyi.

bradenmacdonald · 2022-11-03T00:12:34Z

Update: I got some time this week so have started going some testing and prototyping. Will report back in a week or so.

felipemontoya · 2022-12-01T02:31:23Z

As per our conversation in the last sync meeting, I altered this PR to turn the draft OEP into an ADR belonging to OEP-45.

I also included the requirement that the Helm charts must be compatible with single openedx instances managed by tutor to address what was the most pressing concern of this proposal.

bradenmacdonald · 2022-12-02T21:31:39Z

Thanks @felipemontoya, looks good to me!

itsjeyd · 2022-12-06T13:26:57Z

Hey @felipemontoya @bradenmacdonald, just checking in to see if this PR should still be in Draft status after the latest batch of changes?

bradenmacdonald · 2022-12-06T20:01:06Z

@itsjeyd Sure, it's ready for review now. Though it's much more important that people test out and contribute to https://github.com/open-craft/tutor-contrib-multi (see the issues listed on that repo) to understand this project rather than just reading this small ADR.

antoviaque · 2022-12-07T20:29:56Z

@sarina @e0d @itsjeyd As we get into starting the formal review phase of this OEP, are there any steps you would like us to take to follow proper procedure?

If you haven't already seen it, besides the contents of this OEP, you'll find the full context on this initiative on the forum threads:

During the meeting from https://discuss.openedx.org/t/deploying-open-edx-on-kubernetes-using-helm/8771 yesterday we discussed opening a longer review period than usual, until the next meeting on January 10th -- due to the holidays, and the need to use the review period to complete concrete technical validation steps. @felipemontoya will be announcing this review period on the forum shortly.

To also help making it clear that we intend to do this project as a shared collaborative effort within the framework of the Open edX project, it would also be useful to move the repository containing the code & tasks list to the github openedx organization - would it be possible to do this now?

sarina · 2022-12-07T20:36:34Z

Hi @antoviaque - I haven't been involved in this at all so I can't speak to it except for your specific question about moving the repo to the openedx org. For that please file a request ticket of type Systems Request and we will review it/discuss it there.

antoviaque · 2022-12-07T20:46:46Z

@sarina Thanks for the quick reply! Ticket created: openedx/axim-engineering#577

itsjeyd · 2022-12-08T15:37:06Z

@antoviaque From the perspective of OSPR management the main thing to do will be to update the status of the PR to In Review on the contributions board. @mphilbrick211 Could you please take care of that?

That should be it, procedure-wise.

Co-authored-by: Adolfo R. Brandes <adolfo@opencraft.com>

felipemontoya · 2023-02-07T00:53:34Z

I changed the status to 'provisional'. Now I need an approval to be able to continue the merge process.

felipemontoya

I'm leaving my approval here already.
However since I'm author of this and arbiter of the original of the OEP I think we should have two approvals on top of mine.

@bradenmacdonald @regisb @gabor-boros @jfavellar90

felipemontoya · 2023-02-07T18:06:17Z

@arbrandes I forgot to tag you as well.

kdmccormick

@felipemontoya @bradenmacdonald Sorry I'm late to the game and thank you for this very helpful ADR. I have no blocking concerns, just a couple clarifying questions.

kdmccormick · 2023-02-08T13:33:29Z

oeps/architectural-decisions/oep-0045/decisions/0002-helm-for-multi-instance-k8s.rst

+
+For Open edX operators who wish to deploy Open edX on Kubernetes, community-maintained `Helm Charts`_ will be used to simplify the process of deploying a kubernetes cluster capable of loading one or more Open edX instances.
+
+Building on top of the `decision 0001`_ those single instances will in turn be managed by Tutor and the Tutor plugin ecosystem.


Just to check my understanding, the implication here is that we will not use to Tutor render Helm charts from Jinja2 templates, right? If so, I am glad to hear that. Templates-rendering-templates sounds a bit too much like the edx/configuration for my comfort.

Yes, the idea is to use Helm charts (with no involvement of Tutor) to deploy shared resources onto the cluster (load balancing with HTTPS, logging, monitoring, autoscaling, ES, Redis, maybe MySQL etc. too). And then once that cluster is provisioned, another workflow (which may vary per provider and is out of scope of this project for now) will use Tutor to build Open edX images and deploy instances onto the cluster using those images.

Wherever possible we want to use k8s primitives so that instances are "auto-detected" as they get deployed - e.g. each instance deploys K8S Ingress objects so that the central load balancer detects the deployment and provisions HTTPS certs and starts routing traffic automatically. Where this is not possible, a tutor plugin or init script will be used to do things like provision database users in the shared database as each instance is deployed.

kdmccormick · 2023-02-08T13:36:16Z

oeps/architectural-decisions/oep-0045/decisions/0002-helm-for-multi-instance-k8s.rst

+
+For most of the history of the Open edX project, production deployments were done primarily using Virtual Machines that had been provisioned using the ansible scripts in the ``edx/configuration`` repository. With the adoption of the `decision 0001`_ from OEP-0045 that selects Tutor as the replacement for ``edx/configuration`` some large instances and operators are encountering use cases that are not supported. Multiple instances of Open edX sharing some infrastructure is the first of such cases.
+
+A popular way to deploy containerized applications is to use `Kubernetes`_. However, Kubernetes has a lot of complexity so the overhead it brings is usually not worthwhile for small, singe-instance deployments. For that reason (as explained in OEP-45) as well as a general lack of experience with Kubernetes, to date there has been no officially recommended way to deploy Open edX on Kubernetes. Although Tutor does include basic support for Kubernetes, it is focused only on a basic working deployment, and lacks the flexibility and capability that operators require for large, multi-instance deployments.


Regarding:

large, multi-instance

Do you think this initiative would also be useful to an organization with large, single-instance deployments? In particular, I am thinking of 2U and edx.org. (@jmbowman)

We hope that there is significant overlap and that we can align on best practices, but there will likely be some differences in the approach due to the unique requirements that edx.org has and the single-instance nature.

But yes, although it's not an explicit design goal, I do hope that for anyone who wants to deploy a single highly-scalable instance onto their k8s cluster, that this Helm chart will still be the best approach for that in general too.

bradenmacdonald · 2023-02-08T20:18:38Z

@felipemontoya

However since I'm author of this and arbiter of the original of the OEP I think we should have two approvals on top of mine.

👍🏻 I also approve but since GitHub considers me the opener of this PR I can't do it formally.

kdmccormick

I'm no expert in any of this, but the ADR is easy to follow and sounds very well-reasoned, so on that basis I give it a 👍🏻

felipemontoya · 2023-02-09T21:18:10Z

Thanks to everyone that commented, requested changes and participated in the conversacion.With the current approvals and the 'provisional' status we are ready to merge this.

openedx-webhooks · 2023-02-09T21:21:08Z

@bradenmacdonald 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

angonz · 2023-02-10T11:52:01Z

As discussed in our meeting last tuesday, Feb 7th, I suggest changing the title of the ADR. The current title suggests that we are trying to install Open edX using helm charts, and it's not the idea.

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Aug 19, 2022

bradenmacdonald marked this pull request as draft August 19, 2022 20:15

felipemontoya self-requested a review August 30, 2022 14:10

felipemontoya reviewed Aug 30, 2022

View reviewed changes

felipemontoya force-pushed the braden-felipe/kubernetes-helm branch from 8283284 to 8353fc1 Compare December 1, 2022 02:29

felipemontoya force-pushed the braden-felipe/kubernetes-helm branch from 8353fc1 to 9aa3119 Compare December 1, 2022 02:34

felipemontoya changed the title ~~OEP-59: Deploying Open edX on Kubernetes Using Helm~~ OEP-45 :: ADR-002: Deploying Open edX on Kubernetes Using Helm Dec 5, 2022

felipemontoya marked this pull request as ready for review December 6, 2022 19:18

antoviaque mentioned this pull request Dec 7, 2022

Move repo to openedx org openedx/openedx-k8s-harmony#6

Closed

This was referenced Dec 7, 2022

Announce formal review period of the proposal openedx/openedx-k8s-harmony#8

Closed

[GH Request] Moving the repository containing the Kubernetes/Helm deployment support openedx/axim-engineering#577

Closed

feanil assigned antoviaque and bradenmacdonald Dec 22, 2022

This was referenced Jan 30, 2023

Adolfo's reviews (2023-01-23) openedx/axim-engineering#620

Closed

docs: reducing complexity openedx-unsupported/frontend-lib-content-components#220

Closed

itsjeyd added waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. core committer labels Jan 31, 2023

openedx-webhooks removed the core committer label Feb 3, 2023

arbrandes mentioned this pull request Feb 6, 2023

Adolfo's reviews (2023-02-06) openedx/axim-engineering#655

Closed

22 tasks

bradenmacdonald and others added 7 commits February 6, 2023 19:46

docs: A new OEP related to deploying Open edX on Kubernetes with Helm

815d0c9

docs: Adding Helm context and other alternatives considered

3131d37

docs: moving oep-59 to decisions of oep-45

5a60ebd

docs: altering format to match ADR

2bd0f2b

docs: renaming ADR file

60775b8

docs: addressing feedback from the PR and discuss forum

bcece17

docs: setting the status as provisional

f8256b3

Co-authored-by: Adolfo R. Brandes <adolfo@opencraft.com>

felipemontoya force-pushed the braden-felipe/kubernetes-helm branch from c66e61c to f8256b3 Compare February 7, 2023 00:51

felipemontoya approved these changes Feb 7, 2023

View reviewed changes

itsjeyd added core committer and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Feb 8, 2023

kdmccormick reviewed Feb 8, 2023

View reviewed changes

openedx-webhooks removed the core committer label Feb 8, 2023

kdmccormick approved these changes Feb 8, 2023

View reviewed changes

jfavellar90 approved these changes Feb 9, 2023

View reviewed changes

felipemontoya merged commit c5cf3fc into master Feb 9, 2023

felipemontoya deleted the braden-felipe/kubernetes-helm branch February 9, 2023 21:21

sarina mentioned this pull request Jan 31, 2025

Update OEP-45 ADR-2 to reflect current reality #668

Open

		Charts are build with the Go language templating engine which is a drawback since it means adding a new language to the ecosystem with a new learning curve. This is considered acceptable due to the large adoption of Helm in the devops community.


		Other


		For Open edX operators who wish to deploy Open edX on Kubernetes, community-maintained `Helm Charts`_ will be used to simplify the process of deploying a kubernetes cluster capable of loading one or more Open edX instances.

		Building on top of the `decision 0001`_ those single instances will in turn be managed by Tutor and the Tutor plugin ecosystem.


		For most of the history of the Open edX project, production deployments were done primarily using Virtual Machines that had been provisioned using the ansible scripts in the ``edx/configuration`` repository. With the adoption of the `decision 0001`_ from OEP-0045 that selects Tutor as the replacement for ``edx/configuration`` some large instances and operators are encountering use cases that are not supported. Multiple instances of Open edX sharing some infrastructure is the first of such cases.

		A popular way to deploy containerized applications is to use `Kubernetes`_. However, Kubernetes has a lot of complexity so the overhead it brings is usually not worthwhile for small, singe-instance deployments. For that reason (as explained in OEP-45) as well as a general lack of experience with Kubernetes, to date there has been no officially recommended way to deploy Open edX on Kubernetes. Although Tutor does include basic support for Kubernetes, it is focused only on a basic working deployment, and lacks the flexibility and capability that operators require for large, multi-instance deployments.

OEP-45 :: ADR-002: Deploying Open edX on Kubernetes Using Helm #372

OEP-45 :: ADR-002: Deploying Open edX on Kubernetes Using Helm #372

Conversation

bradenmacdonald commented Aug 19, 2022 • edited Loading

openedx-webhooks commented Aug 19, 2022 • edited Loading

natabene commented Aug 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradenmacdonald Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoviaque commented Oct 18, 2022

bradenmacdonald commented Oct 19, 2022

antoviaque commented Oct 20, 2022

bradenmacdonald commented Oct 21, 2022

felipemontoya commented Oct 26, 2022

bradenmacdonald commented Nov 3, 2022

felipemontoya commented Dec 1, 2022

bradenmacdonald commented Dec 2, 2022

itsjeyd commented Dec 6, 2022

bradenmacdonald commented Dec 6, 2022

antoviaque commented Dec 7, 2022

sarina commented Dec 7, 2022

antoviaque commented Dec 7, 2022

itsjeyd commented Dec 8, 2022

felipemontoya commented Feb 7, 2023

felipemontoya left a comment

Choose a reason for hiding this comment

felipemontoya commented Feb 7, 2023

kdmccormick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradenmacdonald Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

bradenmacdonald commented Feb 8, 2023

kdmccormick left a comment

Choose a reason for hiding this comment

felipemontoya commented Feb 9, 2023

openedx-webhooks commented Feb 9, 2023

angonz commented Feb 10, 2023

bradenmacdonald commented Aug 19, 2022 •

edited

Loading

openedx-webhooks commented Aug 19, 2022 •

edited

Loading

bradenmacdonald Sep 9, 2022 •

edited

Loading

bradenmacdonald Feb 8, 2023 •

edited

Loading