-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate internal monitoring to free 3rd party service #208
Comments
I created a repo with JSON definitions for (remote) monitors and alerts which are automatically deployed to DataSet account on push / merge - https://github.com/StackStorm/dataset-scalyr-resources. To begin with, I started with a private repo, but if the repo won't contain any secrets, we can also make it public. For other non-HTTP based monitors, we will need to define agent based monitors (that also includes HTTP cert and domain expiration since that functionality is not directly supported by the remote monitors). Having said that - we need to decide how to install and manage the agent and on which hosts (just cicd or also some other host?). Ideally we would use infra as code approach for installing the agent and managing the agent config. One option would be store agent config in the same repo (dataset-scalyr-resources) and then pull the config down during the agent install / deploy job. Another one would be to store it in the same repo which contains code (chef cookbook or whatever) to install the agent - although I would prefer the first approach to keep all the config files in a single location. Another thing - to which email address should alerts go to? |
OK, so far all the host (agent) based checks for st2cicd has been ported - http://monitoring001:3000/#/client/sensu/st2cicd042.uswest2.stackstorm.net. Which other clients / hosts do we want to port the checks for? Aka on which hosts the agent also needs to be installed + monitors + alerts set up. |
st2cicd host is sufficient, we'll likely to get rid of everything else. |
@armab I believe I migrated all the checks for st2cicd now. This includes "remote" SSL cert and domain expire checks, but those utilize agent monitor since DataSet doesn't support those remote checks natively. Would be good if you double checked nothing is missing when you get a chance - https://app.scalyr.com/alerts?teamToken=BLSvhkqnK81b_wD2KhjsoQ--. I also still need to adjust some thresholds and verify that indeed all alerts are set up correctly - aka that they trigger when they should. I also set up log ingestion for StackStorm services logs in case they may help us with troubleshooting. They seem to be low volume so it shouldn't cause any log volume related issues. In case it does, we can always disable them. Only exception is MongoDB, that log seems to grow like crazy so I removed that file. |
Internal infrastructure includes a
st2monitoring
server with the dashboard and client checks (services, memory, processes, ports) for each internal infra node includingst2cicd
server, as well as external checks (APIs, SSL cert expiry, Domains, ST2 websites availability health checks).In order to reduce the amount of infra, costs, moving pieces, and relying less on AWS resources (see https://github.com/orgs/StackStorm/projects/27), remove the st2monitoring server and start migrating to free 3rd party service for monitoring and alerting.
For example, we could use
Scalyr
(where @Kami works).There are several sub-tasks here:
Example with external checks:
![monitoring](https://user-images.githubusercontent.com/1533818/157028397-a84369b0-6155-44f9-a882-6dae2767645a.gif)
Example for st2cicd server:
![image](https://user-images.githubusercontent.com/1533818/157029691-a863f35f-7f5d-4358-9a25-3cc27799b768.png)
Finishing the first part with migrating the external checks would be already great. We can remove the monitoring at that point which would save us $60/mo in AWS.
The text was updated successfully, but these errors were encountered: