Infrastructure Improvements #2215

mo-nathan · 2024-07-06T19:07:21Z

mo-nathan
Jul 6, 2024
Maintainer

There was a recent security issue with the version of the operating system we use for MO. We were able to address this case by doing a fairly simple upgrade, but it raised some questions in my mind about how to be better about tracking such things and trying them out in a safe environment before committing to them. At the moment we have the following servers at Digital Ocean:

mushroomobserver.org - Main production server
db-2020-10.mushroomobserver.org - Our database server
test.mushroomobserver.org - A recently created server for testing some of the changes to the Create Observation workflow

Small Updates

The approach I took wasn't really well thought out and took some unnecessary risks. Specifically, I simply ran an automated update on the production server and then rebooted without doing any testing. After that I updated the test server. Fortunately this worked this time. However, in retrospect I really should have started with the test server and verified that things were working before doing it on production. Luckily we didn't have to change anything on the database server in this case, but we would have run the same risk without the possibility of trying it first on a distinct test database server. We could clone that server and try an upgrade, but I don't know for sure how much that would cost (it's our most expense server) or exactly how we'd test it once the clone was made.

Big Updates

The second issue is that the versions of the operating system we have on these systems are getting out of date. The database server is particularly old (2020) and no longer being "supported". The production and test servers are a bit more up to date (2022), but should be upgraded. Our process for big upgrades like this is pretty cumbersome at the moment. Essentially we create a new server by hand and work on it until it's working and then we swap out the server. Build and swap is not an unreasonable approach, but I think a lot could be done to simplify that setup process using scripts and some better automation tools.

Scalability

The other big concern I have is related to how scalable our current production server is. Specifically, what happens if 10-20 folks are simultaneously uploading observations to the system. To assess this concern I spent some time analyzing past behavior on the site. From what I can tell observation uploads take on average 6-7 seconds but can sometimes take around 30 seconds (based on a day and half worth of data). Over that time 113 observations were created.

Looking at it from another angle, our maximum number of observations in a day over the last year is 238. I would expect an event like NEMF or NAMA could increase that by 4x or more and would often be at specific times of day (when folks return from forays). The maximum we've every had uploaded in a day was 959 on July 17, 2012 which looks like a day that Jason decided to create a very large number of lichen observations (896). The next highest (865) was on 2014-09-03 when Christian created 794 from a trip to Alaska.

Assuming that Jason was pushing the system in some automated way it looks like he was averaging 15s per upload based on looking at the time between subsequent observations (median was 6s).

To address these performance concerns I think we'll want to enable running with multiple servers. We could consider focusing on threading, but I expect that would require a lot more development and will still have a ceiling based on how many cores the server has (our current server has 2 CPUs). Running multiple servers would be similar to what we're currently doing with the test and production servers except we'd put them behind a load balancer that would switch between the servers based on the load. In theory this approach would allow up to spin up and down servers as needed. It would also give us an easy way to provide a real maintenance page when we need to bring the system down during a code deployment or when updating the operating systems.

Suggested tasks

Document a procedure for doing small updates that involves a test server and potentially cloning the working system to enable testing. Note that the cloning process might require site down time. Needs to be tested and documented.
Go through the operating system upgrade process on a clone of a server. This should work to put as much of the process into a script that can be reliably run whenever we need to do this sort of update rather than following a bunch of written documentation.
Look into getting a load balancer setup on the test server and setup a clone of the test server that it can balance between. Ensure that both systems are getting hit and document how to scale up and down the number of servers. The more automated this is the better. This might lead us to doing our own containerization, but hopefully we can just leverage what Digital Ocean already has.
Figure out how leverage the load balancer to switch mushroomobserver.org to a maintenance page when we need to bring the system down.

nimmolo · 2024-07-06T19:28:16Z

nimmolo
Jul 6, 2024
Maintainer

Thank you for writing this up. Sounds pretty thorough... what would be the process for upgrading the DB server, though? Similar process? It seems a bit harder to test, but I don't know much about this. It also seems like the script may need to be updated for every Ubuntu version upgrade.

Want to note here that Jason and I did quite a bit of tweaking to the mysql config on the db server based on advice from people on SO, and this improved performance pretty dramatically. So let's be sure to copy that config in whatever clone migration we do, it is not a default config!

1 reply

mo-nathan Jul 6, 2024
Maintainer Author

Thanks. Good to know about the mysql config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure Improvements #2215

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Infrastructure Improvements #2215

mo-nathan Jul 6, 2024 Maintainer

Small Updates

Big Updates

Scalability

Suggested tasks

Replies: 1 comment · 1 reply

nimmolo Jul 6, 2024 Maintainer

mo-nathan Jul 6, 2024 Maintainer Author

mo-nathan
Jul 6, 2024
Maintainer

Replies: 1 comment 1 reply

nimmolo
Jul 6, 2024
Maintainer

mo-nathan Jul 6, 2024
Maintainer Author