-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a section on how to use helm stable/postgresql for the hub #575
Conversation
0a301de
to
9f96e38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, @gsemet! I appreciate the doc addition, although I think it should take a significantly different form. I've left inline comments, but my overall opinion is:
- We shouldn't recommend people run complex RDBMs on Kubernetes at this time,
- Even when the time comes where running complex RDBMs on Kubernetes works ok, we should still link people to instructions than provide them ourselves. This reduces support burden on this projects, prevents us from accidentally introducing security holes, and makes it less likely we are providing users with out-of-date information. I think @choldgraf has thoughts on this too, around 'when to put content here vs when to link to content'.
What I would like to have docs for is:
- How to connect to an external postgres database for the hub db? Your PR has this info, and I think this will be very useful.
- When to use an external database? The only reasons I think are:
2a. You want faster hub restarts
2b. You don't have a fast enough PVC provider
2c. You already have a postgresql / mysql database you want to keep using. - How to determine when the db is corrupted / dropped, and what to do when this happens.
I appreciate your PRs and work in the jupyterhub/kubernetes community. Do you think we can steer this PR into one of the suggested doc are as above? Hopefully, this feedback is useful and not overly negative.
- ``mysql`` | ||
- ``postgres`` | ||
|
||
Using PostgreSql or MySQL provides the main advantage to allow seemless update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain a bit more what you mean by this paragraph? What differences are you seeing between sqlite / postgres? When the hub pod restarts, here are the following sequence of actions:
- Running hub pod pod dies
- New hub pod is started and scheduled
- (if using sqlite on a PVC) If new pod is on a different node than old pod (which we've taken steps to avoid by using pod affinity rules), disk must detach from old node and attach to new node
- hub process starts, and checks all user servers to see if they are running (this takes the longest time)
- It resumes service
Using an external database will make step (3) go faster, but (4) is the biggest cause for concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- with sqlite, "Running hub pod pod dies" yes. With postgres, i clearly see it waits for the new pod to be up before being terminated.
- ok
- ok (i guess the termination of the old hub occurs here)
- ok (i guess there is also a potential DB scheme migration can occurs here, so the service might be unavailable here if this steps is too long)
- ok
i'll check tomorow more carefully, but from my experiment and reading the hub deployment, we see there is a rolling update strategy applied.
But for sure we clearly see a better "user experience" when using postgresql, even without pvc (so restart of the pod mean loose of data) than the sqlite backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, a rollingupdate strategy is applied when using postgres! I've reverted that in #576 - I thought I already had, but had forgotten! Apologies for the confusion!
doc/source/advanced.md
Outdated
SQLite and MySQL/PostgreSQL are for new user, in order to allow them to log in | ||
even when a new hub is starting). | ||
|
||
The easiest way of running PostgreSQL for the Hub is to use Helm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On more thought, I don't think we should be encouraging users to run their postgresql databases on Kubernetes. Doing so is hard (https://medium.com/kokster/postgresql-on-kubernetes-the-right-way-part-one-d174ee8a56e3 and elsewhere on the web), and when the postgresql pod dies your hub process won't be accessible - even though the hub itself might be up. This gives us two points of failure than the current one, and for not much benefit - there'll always only be one hub process talking to a database.
I think the 'easiest' way to use an external database is to use a hosted cloud provided database (like RDS / Cloud SQL), or to set one up on an external machine. Running databases on kubernetes right now is something I'll consider 'super advanced', and should probably also use something like https://github.com/CrunchyData/postgres-operator rather than the helm chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, storing critical data on kubernetes is hard. The intend of this proposal is not to mislead people on letting them believe a simple solution exist with this helm.
This is an option admins can have, and for POC, for test or even for real usage at small scale, a postgreSQL on a PVC is great, makes the hub happy. At worst, in case of database crash, we only loose current connection. Still better than the sqlite backend, even if it also works fine.
The idea is to give people option. I can add a link to the document, or other, that would highlight HA is hard to acheive with PG as any other database, and that the best option is to take the DB service provided by our cloud provider.
I work on baremetal, so we do not have any PG as a service yet (should come at one point). But we intensively use PG on kubernetes, of course only on non critical information (ie, where restarting from an empty DB is not a big deal)
doc/source/advanced.md
Outdated
lost when the PostgreSQL server will restart. Do not use `--recreate-pods` | ||
helm option. It will not cause any data loss for user, just the proxy will | ||
not be able to route logged users. | ||
- even without persistance, your hub can still be upgraded smoothly for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relies on the postgres pod never dying, which is not true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but it's up to the admin to choose persistance or not. Pretty easy to do with the helm config
doc/source/advanced.md
Outdated
users, because a Hub restart will not cause the PostgreSQL pod to restart. | ||
- The Hub will use Rolling Update, so a new hub will be started while the | ||
other is still running and the transition will be transparent for the user. | ||
- if case of DB loss or reset, simply ensure all your server are shutdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a separate section on 'how to recover from database corruption' might be useful outside of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can remove this line, recovering from db corruption is a big problem on its own. It was just here to tell "no big deal, there is no data loss, only restart everything and it will work again"
doc/source/advanced.md
Outdated
not be able to route logged users. | ||
- even without persistance, your hub can still be upgraded smoothly for the | ||
users, because a Hub restart will not cause the PostgreSQL pod to restart. | ||
- The Hub will use Rolling Update, so a new hub will be started while the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true I think, and might actually cause corruption because at times two hub processes might be talking to the same database. The hub is not designed to work like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am refering to this section of the configuration. It does keep the old hub running and accepting new people while the new hub instance is initializing. Maybe there is a short outage but at least it does not take very long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch! @minrk pointed out the problems that could cause in #367 (review) but I forgot to make a revert. I've opened #576 now to revert that change properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #576 should help with the confusion a bit - I thought we had done that PR a long time ago, clearly I had forgotten! Hopefully my comments make a little bit more sense in that context!
It is advised to use the Database as a Service solution that your Cloud | ||
provider may provide you. | ||
|
||
You have the option to run PostgreSQL for the Hub using Helm 'stable/postgresql' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead provide content like:
You can run postgresql either via your cloud provider's hosted offering (linkes to Amazon RDS, Google Cloud SQL, Azure's Database), or by setting it up yourself manually. There's experimental method of running it directly on kubernetes (link to helm chart), but we do not recommend it.
I'd prefer we have that rather than provide direct instructions on using the helm chart here.
- ``mysql`` | ||
- ``postgres`` | ||
|
||
Using PostgreSql or MySQL provides the main advantage to allow seemless update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, a rollingupdate strategy is applied when using postgres! I've reverted that in #576 - I thought I already had, but had forgotten! Apologies for the confusion!
doc/source/advanced.md
Outdated
|
||
Note the following points when using PostgreSQL as database for the Hub: | ||
|
||
- If you do not use a persistant volume on PostgreSQL, your database will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only applies if the postgres database is set up with helm, which we should recommend against. Possibly remove this whole section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would not choose for the user, I prefer giving him the information, the limitation of each solution, so he can make his own choices depending on his own situation.
We are dealing with people that are deploying JH in a kubernetes cluster, this population should know what they are doing. It is always better to give them hints and links to more informaiton, and I learn a lot when reading zero2jupyterhub documentation.
I guess you would like to prevent people from making mistakes, that's honorable, but if i would write a scrum story for this feature I would formulate it such as: "as a devops that would like to deploy jupyterhub in my kubernetes cluster, I would like to be able to use a postgresql server started on my cluster, when I do not have an DBaaS option".
Also, it may be an argument from authority, but there is so much helm charts in the 'stable' section of the official repository, that already provide an optional postgresql dependency, they should not be all wrong, is it? It is always optional, if the admin has a better option if should not enable this dependency and use its DBaaS.
i am appear a bit pushy on this one but switching to pg it added a better user experience on our baremetal jupyterhub installation I would like to document for other to try. |
Reference jupyterhub#567 Signed-off-by: Gaetan Semet <gaetan@xeberon.net>
This is a lot of great conversation, thanks @gsemet and @yuvipanda ! Just a quick note that is worth considering: I think that @gsemet 's comments make sense if we can assume that our jupyterhub k8s deployers are relatively experienced ops people, but I don't think that we can assume this. In addition to more experienced people, Z2JH is also aimed towards relative kubernetes newbies with little experience in this kind of thing (e.g. a teacher that wants to use a jupyterhub for a 2 day workshop). We need to be careful not to overload these people with information, because many will simply start following whatever steps are there and dig themselves into a hole... |
@choldgraf Regardless of deployer experience, I think the core problem is that the biggest advantage here (rolling updates) is actually a bug in z2jh, and will be 'fixed' when #576 gets merged. There are still great reasons to use postgres, but only in specific circumstances. I still do think that running postgres on kubernetes is a bad idea and not one we should recommend for anyone though. |
@gsemet that's great to hear it's been smoother for you! Can you give me a little more details on how your set up is, and what problems you ran into with the default sqlite setup? Thanks! |
Hello. So, as a background, I did have some experience on sqlite and mysql, i worked with Buildbot to maintain a CI for 2000 android developer. But I do not consider myself as a DB expert at all. So I did a log of sqlite/sqlalchemy/alembic and it made us a lot of hard time maintaining it. So, that is too tell you I can maybe to help you on this part. And it is always good for my own personnal skills to work on these hard subject (DB, concurence, migration) with more experienced people. I have set up a jupyterhub using the marvelous z2jh guide. I really appreciated the fact it takes people by the hand for the deployment, so it helped me a lot learning more about kube/helm stuff. We now use daily at work kubernetes for all our microservices but in a very self-cooking way. So bad for maintainance. Now jupyterhub is running on our baremetal kubernetes cluster, behind a Traefik Ingress without any problem. First I started with the sqlite on a PVC, it worked indeed well, but there is no "smooth upgrade", ie a rolling update. New user or admin going on So I switched to PostgreSQL, using the helm chart because it is simpler. Also we do not have yet a DBaaS, so we cannot ask for a safe postgres database with high availability and so on. It will be available at one point, and I understand the point you mention admin should always use the DBaaS provided by our cloud provider, like Google, AWS or OpenStack. But that is simply not an option yet. And our JupyterHub cluster is not intended for high availability, just should work at best. We had a discontinuity today of the hub (the node died). The proxy also died so user has to wait for it to come back. This is not a problem for us, but I guess high availability for the proxy is more (or switch to traefik proxy? :) ) After switching to postgres, we have a rolling update for the hub so new user can still log in. I did not experienced any "corruption" like mentionned here, but I think our environment does not stress it enough. I think there is a small discontinuity (like I described before) but it is so short I cannot say how long it takes (I should measure it). I would like to propose an value.yaml option to make this behavior back but under an opt-in trigger (False by default, and users that want this rolling update can still enable it on their own risk). I am for proposing user (=people that knows or learn how to deploy in kubernetes) the best "good" choices by default, while still letting user to enable some "experimental" feature on their own. Power users can still fork the project and maintain their own modification, but I really feel the charts and helm are exactly made for feature gating (let people made their own choice). I agree and disagree with you, @choldgraf. maybe we can move all these information in a "very experimental features page" or a wiki/blog section, but their is not "too much information". Their are relevant information, and useless noise. After, it is indeed only a matter of organisation, and properly choosing the default choices (Convention Over Configuration, but at the end it a willing user want to derive from the convention, he still can... at an higher cost of maintainability, but it is his responsibility). |
Thanks for the detailed information, @gsemet! I highly appreciate it! I agree in your case using the posgres helm chart was a good option! I've two primary thoughts:
I think my answer to question 1 is: Yes, but in very broad terms that allow overriding various kubernetes bits. For example, in the case of #576, we might allow users to override the entire 'updateStragey' field. This requires that whoever wants to play with rolling updates do so with full knowledge of kubernetes norms, and can play with it with full power without having to rely on us to support various features. I'm very pro allowing overrides of all parts of the Kubernetes objects we create. Question 2 is really the core of this issue, I think. As JupyterHub maintainers, we have limited time and resources to support people who might run into problems. We want to make sure we only provide info we can support in some form. This limits what we can have here, but also allows us time and space to build things, support the community best we can, and keep the guide up to date and tested whenever we make releases without burning the community out. However, the negative consequence of this is that we do not have a space for 'these things worked for people in specific circumstances, and you can try them too!'. This would have the clear expectation that it's not 'supported' in the same way the core z2jh guide is, and relies on much wider community of practice to keep up to date than us. A recent example of this is running z2jh on OpenStack. Lots of people are trying to do this and it'd be great to have a space for them to collaborate - but most of the info is related to running k8s on OpenStack, and z2jh itself is not a good place for it. Any OpenStack info we have in the guide is going to be incomplete and out of date very soon, since none of us are OpenStack experts. I think the situation is similar for postgres. This community space can also be a space to try out different approaches to doing something before moving it into z2jh itself. We currently don't have any guidelines of these 'levels of support' and what not, so writing this out is a very good idea. My preference would be for us to start another repo that acts as a wiki, with very liberal editing rights (pretty much anyone can write anything there) and a specific scope (to be discussed). This will allow space for advanced (and specific) use cases to be documented and organically adopted, without increasing the support burden on us. What do you think, @gsemet? |
I also filed #584 for the 'hub is down for a while when changing config' issue, which should help with at least some of the problems! |
Regarding this part, is their an "helm" way of doing it? because even with the Community section is perfect for me! later some part could then be promoted to official jupyter support team. Wiki is fine. i do not see, sadly, this section in the sphinx doc, even if i am more confortable that in wiki edition. maybe you do not want to review these changes and let the "community" organize itself. |
Hi @gsemet and @yuvipanda. Thank yo @gsemet for your PR and sorry you experienced it become stale. I want to reach a conclusion for this PR so I've spent some time to consider it with fresh eyes. I rebased this PR and had a look. I concluded that I'd like to close this PR for these reasons:
I'm going ahead and closing this issue without awaiting a response since the PR has been stale for a long time, but please feel free to discuss this further. |
@yuvipanda tell me if a documentation patch like this is ok for you.
Reference #567