Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systemd 'conversion' & GAT updates #63

Closed
hexylena opened this issue Jun 28, 2022 · 15 comments
Closed

Systemd 'conversion' & GAT updates #63

hexylena opened this issue Jun 28, 2022 · 15 comments
Labels

Comments

@hexylena
Copy link
Member

hexylena commented Jun 28, 2022

hey y'all. I'm .. struggling a bit, with the updates to gravity from systemd. Currently it feels like I'm losing a lot of things that I use on a daily basis, but maybe I've just missed the equivalent replacements?

  • journalctl --since "30 minutes ago" galaxy, I don't think there's an equivalent, I used this daily to find problems that happened at a specific time point. I don't really want to go back to looking at timestamps and parsing them with my brain.
  • journalctl -fu galaxy-handler-*, I can't seem to use wildcards. I used this regularly since I don't know which handler was serving a request, and I wanted to check which one errored, easily.
  • journalctl -fu galaxy, period. Now I have to know the log location (or as the galaxy user, run /srv/galaxy/venv/bin/galaxyctl --state-dir /srv/galaxy/var/gravity/ follow handler gunicorn). This feels like a major downgrade from the sheer simplicity of systemd (yes I said it.)
  • cgroups & memory limits. I know we'll have new controls maybe, but, I miss being able to micromanage which types of handlers and units got how much memory. I don't want to apply it at the global level
  • log rotation, journald was handling that, now we have to do it ourselves, or, I guess add logfile_maxbytes but I liked my old log rotation.
  • log cleaning, now I have to manually delete them, rather than being able to rely on journalctl --vacuum-size=500M (great for removing old stuff when in a VM with limited space!) or journalctl --vacuum-time=2d (especially nice for deleting old logs, and being sure we aren't holding GDPR type data for too long.)

Then there's some open questions for me

  • tusd, is this now getting merged in? Do we throw away the tutorial or is the ansible role/config still required? Do we now manage this some other way? I don't know what changes are required here.
  • gx-it-proxy, same as above.

Pro: I know this would put all of us on a more standardised platform, and a lot closer to what main does (which feels good in theory!!). I know systemd isn't always well received, and doesn't work as well for "laptop" users, I understand/get/like the portability of supervisord, especially for all the small/laptop users who are replacing run.sh (and main.)

Con: But I think bigger sites are really getting a lot of benefits from it that we really like. Many small galaxies started in the past 3-4 years all use systemd, journald. All of these admins will have to update what they're doing.

I guess I'm asking:

  • Do y'all believe this is the one true path, using gravity to control galaxy? I don't want to duplicate effort, not when there's a lot of knowledge encoded in gravity. I don't have bandwidth to re-implement as systemd units currently.
  • If this is the true path, could we maybe get more hands updating the training, since this seems to affect a lot of it. Is there some way this can be a part of the release/release-testing process? That devs who understand the product, are making sure the admin training is using current best practices everywhere? I'm not sure I'm capable to catch them, as a non-experienced gravity user.
@jdavcs
Copy link
Member

jdavcs commented Jun 29, 2022

  • If this is the true path, could we maybe get more hands updating the training, since this seems to affect a lot of it. Is there some way this can be a part of the release/release-testing process? That devs who understand the product, are making sure the admin training is using current best practices everywhere? I'm not sure I'm capable to catch them, as a non-experienced gravity user.

+1. For that to happen, the release testing team must be made up of at least 1 admin and at least 1 dev who is sufficiently comfortable with gravity. That, and a revision of how we do release testing: we have a reasonably well-structured approach to testing end-user-facing stuff, but absolutely not admin-facing stuff. Maybe we should change that.

@natefoo
Copy link
Member

natefoo commented Jun 29, 2022

  • journalctl --since "30 minutes ago" galaxy

Yeah, no equivalent to this.

  • journalctl -fu galaxy-handler-*

No wildcard, but you can follow multiple with galaxyctl follow <service> [service ...]:

(venv) [g2test@galaxy07 log]$ galaxyctl follow test_handler0 test_handler1
==> /srv/galaxy/test/log/test_handler0.log <==
...
==> /srv/galaxy/test/log/test_handler1.log <==
...

I agree this is more annoying though, wildcard is a good feature worth adding if it's not too much trouble. It should apply to many of the subcommands like restart, etc.

  • journalctl -fu galaxy ... /srv/galaxy/venv/bin/galaxyctl --state-dir /srv/galaxy/var/gravity/ follow handler gunicorn

We default $GRAVITY_STATE_DIR in the venv's activate script, so once activating you should only need to galaxyctl follow [service ...]. In my case, I added a ~galaxy/bin/galaxyctl that is just a shell script that activates and passes the args on to galaxyctl.

You could also configure systemd to run galaxyctl start --foreground without the --quiet that the Ansible role adds if you want all the output in journald. But this does include all of the configured services (celery, celerey-beat, tusd, etc.).

You can also set log_dir in the gravity section of galaxy.yml to a more convenient dir rather than tail them directly with tail.

In general with journalctl, personally, I have always been less comfortable having to use journald over flat log files, but I readily admit that is mostly just me being crusty and antiquated.

  • cgroups & memory limits

Yeah, no individual controls here. I am not sure how much complexity we would need to add to implement this via the supervisor process manager and if we'd be wasting our time doing it over alternatives (see below).

log rotation, log cleaning

Conversely I always find it difficult to find older logs (if they are even available) with journald, but again, this is mostly my fault. We should probably however at least default the rotation option if flat files are used. logfile_backups can control cleaning but obviously without the many options/features of journald.

tusd

Still has to be installed somehow (e.g. via the role) but some of the stuff that we include to set it up to run in the role no longer needs to be done if you run it with gravity. However, #55 is important and needs to be addressed.

gx-it-proxy

Same. Installation is still external to gravity.

Pro/Con

I largely agree, although one thing I was never satisfied with was that everything was disjointed and basically all done via a bunch of disparate ansible roles that you figure out how to use via various trainings that have to be put together in the right order. Gravity sticks everything in one place to try to reduce the number of pieces you have to assemble yourself. With the addition of celery and removal of mules this was important, since you could no longer start and run a fairly featureful Galaxy with a single command. Although most Galaxies deployed using the training method in the past few years have used systemd/journald, they also ran mostly from a single service unit, something that would be broken with the addition of all the new processes, and had to be addressed in one way or another. As you said, you don't have the bandwidth to reimplement as systemd units, and we chose not to take that route (yet) either since it would be a solution for one specific case (production deployment with root privs on Linux).

All of that said...

Gravity was built to be fairly modular and agnostic with respect to the process manager, and so with some effort it would be possible to implement a systemd process manager as an alternative to supervisord. One issue is Gravity expecting to be run as the Galaxy user - and while you can run Galaxy from systemd user mode (The galaxyproject.galaxy Ansible role supports it now, for example), many admins would want it to work at the standard system/root level. This might mean creating a standalone Gravity venv for systemd installs rather than using Gravity from the Galaxy venv.

And yes, absolutely, we need more hands (especially mine, since I am one of the folks most intimately familiar with all of this) on the updating... I have been underwater all of this week and I know we're running out of time before the training, but I am still planning to make a massive push to get everything updated.

@hexylena
Copy link
Member Author

@natefoo thanks for the considered response, I appreciate it and agree with basically everything you've said, especially

I was never satisfied with was that everything was disjointed and basically all done via a bunch of disparate ansible roles that you figure out how to use via various trainings that have to be put together in the right order

And completely sympathise that having it all in one place (somehow) makes a lot of sense. This always was absolute rubbish, the mix 'n match of unofficial roles. Hopefully the collections and new ansible stuff you're working on helps there, I'd love to have all of the roles in one place with example playbooks.

systemd as an alternative process manager makes this a lot more interesting to me, and gives me a lot of hope for this then. I would not mind installing it twice, that starts to feel like the best of both worlds; gravity as a single synchronized point, and systemd for all of the aforementioned features.

and I know we're running out of time before the training

rest, relax then :)

After discussion with @jdavcs, @Slugger70, @cat-bro, based on our lack of knowledge about gravity (, and the lack of a release) we've decided to re-use the old videos, staying on 22.01 for this GAT. We're going to use the archival training materials for the first day.

@hexylena
Copy link
Member Author

hexylena commented Jun 30, 2022

Conversely I always find it difficult to find older logs (if they are even available) with journald

should we write a journalctl training module for our old crusty folks? 😃 There's bound to be more who aren't comfortable with it yet, and lots of nice advanced features we can discuss.

edit: galaxyproject/training-material#3525

@natefoo
Copy link
Member

natefoo commented Jun 30, 2022

should we write a journalctl training module for our old crusty folks? smiley There's bound to be more who aren't comfortable with it yet, and lots of nice advanced features we can discuss.

Yes, let's call it "Helena teaches Nate how to use journald" and then I will forget it all the next day anyway as I grumble about how UNIX was supposed to be all about files.

@natefoo
Copy link
Member

natefoo commented Aug 30, 2022

Work is underway in my systemd-pm branch.

Gravity has always had support for managing multiple Galaxy instances, but most of this is hidden from you if you only have a single instance. With supervisor, when you have multiple instances we use supervisor's group feature to keep the service configs separated. This will need to work differently with systemd. Here are my ideas:

1. Use systemd service instances

We could in theory use systemd service instances (systemctl <verb> galaxy-<service>@<instance_name>.service), but this really only works if your Galaxy instance components are all laid out in exactly the same structure since all we have to work with is the unit template var %i. To illustrate what I mean, let's say you had an instance service unit containing:

ExecStart=/srv/galaxy/%i/venv/bin/gunicorn [args]
WorkingDirectory=/srv/galaxy/%i/galaxy
Environment=GALAXY_CONFIG_FILE=/srv/galaxy/%i/config/galaxy.yml

So your venv always has to be at /srv/galaxy/<instance_name>/venv, your Galaxy root dir has to be /srv/galaxy/<instance_name>/galaxy, and so on.

2. Use service instances but Gravity writes out an environment file

The service instance file would look like this:

Environment=GRAVITY_STATE_DIR=/srv/gravity
EnvironmentFile=$GRAVITY_STATE_DIR/systemd/env-%i.conf
ExecStart=$GALAXY_VENV/bin/gunicorn [args]
WorkingDirectory=$GALAXY_ROOT

Gravity would write out /srv/gravity/systemd/env-<instance_name>.conf with all the necessary vars:

GALAXY_VENV=/home/galaxy/venv
GALAXY_ROOT=/nfs/galaxy
GALAXY_CONFIG_FILE=/home/galaxy/galaxy.yml

This allows each instance to store its components wherever, but requires Gravity to keep a separate file up to date, which might be a bit more obtuse to admins.

3. Don't use systemd service instances

This is the "simplest" method, you just end up with a bunch of service units for each instance of Galaxy, named like:

  • galaxy-<instance>-gunicorn.service
  • galaxy-<instance>-celery.service
  • etc.

Conclusion

Whatever we do, I was planning to do the same thing in systemd when there is only a single instance configured as we do in supervisor: hide all the extra instance stuff. So in the case of a single instance, you'd have just galaxy-gunicorn.service, etc.

Since multiple instances is likely a not-often-used feature, I am not going to worry too much about this, but wanted to throw it out there in case you (or anyone else) had a better idea.

@natefoo
Copy link
Member

natefoo commented Sep 7, 2022

Some thoughts on my own options.

  1. This is just too rigid.
  2. Tried playing around with this a bit and it seems as though you can't use env vars in User/Group/WorkingDirectory nor in the path to the command in ExecStart (although there are some tricks like bash -c ... for the latter, but this gets kind of hacky imo).
  3. So this is probably the best/simplest of these 3 options, even if it's not the most "systemd-ish."

I did discover one additional option that does appear to work:

  1. Use instance conf dirs

The service instance unit file (e.g. galaxy-gunicorn@.service) would omit things like User/Group/WorkingDirectory/ExecStart, these would instead be set in /etc/systemd/system/galaxy-gunicorn@<instance_name>.service.d/. Additionally, you can reduce repetition by separating the instance-level stuff from the service level stuff like this:

# cat galaxy-@work2.service.d/40-gravity.conf 
[Service]
User=nate
Group=nate
WorkingDirectory=/home/nate/work/galaxy2
Environment=GALAXY_CONFIG_FILE=/home/nate/work/galaxy2/config/galaxy.yml
# cat galaxy-gunicorn@work2.service.d/50-gravity.conf
[Service]
ExecStart=/home/nate/work/galaxy2/.venv/bin/gunicorn 'galaxy.webapps.galaxy.fast_factory:factory()' --timeout 300 --pythonpath lib -k galaxy.webapps.galaxy.workers.Worker --workers=1 --config python:galaxy.web_stack.gunicorn_config --pid /tmp/galaxy.pid

That said, although it's more "systemd-ish," Gravity still ends up managing a bunch of unit files, at least one per service per instance, plus the top-level service instance config for each service, and probably a target unit for each instance (which I haven't even tested yet). In a non-instance setup you're going to end up having fewer unit files, just galaxy-<instance_name>-gunicorn.service, galaxy-<instance_name>-tusd.service, etc.

@hexylena
Copy link
Member Author

hexylena commented Sep 8, 2022

I have a strong strong preference for 3, galaxy-<instance>-<service>.service, since used to use @$i for separate handlers. The multiple instance case is so so rare, it doesn't make sense to force it into the "systemd" way of @....

4 is not my preference, multiple files is more headache for admins to mentally resolve into what the unit actually looks like.

@natefoo
Copy link
Member

natefoo commented Sep 8, 2022

Perfect, this will also be the simplest. Thanks!

multiple files is more headache for admins to mentally resolve into what the unit actually looks like.

Agreed, although TIL systemctl cat and systemctl show.

@hexylena
Copy link
Member Author

hexylena commented Sep 8, 2022

Yeah, those are nice (but then still have to figure out if it's overridden, which file, and blahhhh). My personal opinion is systemctl cat should be equal to cat (except in case of distro managed units), but that's probably a weird opinion only I hold.

@natefoo
Copy link
Member

natefoo commented Sep 13, 2022

My systemd-pm branch was getting a little unwieldy and so I broke it up and refined it in more focused chunks. #69 and #72 get us most of the way there.

@natefoo
Copy link
Member

natefoo commented Oct 17, 2022

The rest is in #77.

@natefoo
Copy link
Member

natefoo commented Oct 17, 2022

Still need the ability to configure limits as you can in the Ansible role, I suppose.

@hexylena
Copy link
Member Author

Limits would be ace!

@natefoo
Copy link
Member

natefoo commented Sep 7, 2023

I think this is all taken care of now but if not feel free to re-open or spin anything remaining out into their own issue(s).

@natefoo natefoo closed this as completed Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants