Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate internal monitoring to free 3rd party service #208

Open
arm4b opened this issue Mar 7, 2022 · 5 comments
Open

Migrate internal monitoring to free 3rd party service #208

arm4b opened this issue Mar 7, 2022 · 5 comments
Assignees

Comments

@arm4b
Copy link
Member

arm4b commented Mar 7, 2022

Internal infrastructure includes a st2monitoring server with the dashboard and client checks (services, memory, processes, ports) for each internal infra node including st2cicd server, as well as external checks (APIs, SSL cert expiry, Domains, ST2 websites availability health checks).

In order to reduce the amount of infra, costs, moving pieces, and relying less on AWS resources (see https://github.com/orgs/StackStorm/projects/27), remove the st2monitoring server and start migrating to free 3rd party service for monitoring and alerting.

For example, we could use Scalyr (where @Kami works).

There are several sub-tasks here:

  • research the monitoring/alerting platform (if Scalyr is good)
  • create and configure 3rd party monitoring st2 TSC account
    • shared account/email with the TSC
    • monitoring alerts should go to #opstown Slack
  • setup external API/web checks:
    • APIs
    • SSL + expiry
    • Domains + expiry
    • Health checks:
      • stackstorm.com
      • stackstorm.org
      • index.stackstorm.org
      • helm.stackstorm.com
      • api.stackstorm.com
      • docs.stackstorm.com
      • st2cicd webhook endpoints
  • create internal checks for st2cicd:
    • via 3rd party monitoring agent/client
    • migrate st2cicd internal checks: memory, CPU, services, processes, etc, etc

Example with external checks:
monitoring

Example for st2cicd server:
image

Finishing the first part with migrating the external checks would be already great. We can remove the monitoring at that point which would save us $60/mo in AWS.

@arm4b arm4b added the infra label Mar 7, 2022
@Kami
Copy link
Member

Kami commented Mar 25, 2022

I created a repo with JSON definitions for (remote) monitors and alerts which are automatically deployed to DataSet account on push / merge - https://github.com/StackStorm/dataset-scalyr-resources.

To begin with, I started with a private repo, but if the repo won't contain any secrets, we can also make it public.

For other non-HTTP based monitors, we will need to define agent based monitors (that also includes HTTP cert and domain expiration since that functionality is not directly supported by the remote monitors).

Having said that - we need to decide how to install and manage the agent and on which hosts (just cicd or also some other host?).

Ideally we would use infra as code approach for installing the agent and managing the agent config. One option would be store agent config in the same repo (dataset-scalyr-resources) and then pull the config down during the agent install / deploy job. Another one would be to store it in the same repo which contains code (chef cookbook or whatever) to install the agent - although I would prefer the first approach to keep all the config files in a single location.

Another thing - to which email address should alerts go to? redacted@ or do we have a dedicated address for alerts?

@arm4b
Copy link
Member Author

arm4b commented Mar 25, 2022

For the alerts, #opstown Slack monitoring channel would work best as other alerts already go there.
in the past you already played with that in the same channel:
image

@Kami
Copy link
Member

Kami commented Mar 25, 2022

OK, so far all the host (agent) based checks for st2cicd has been ported - http://monitoring001:3000/#/client/sensu/st2cicd042.uswest2.stackstorm.net.

Which other clients / hosts do we want to port the checks for? Aka on which hosts the agent also needs to be installed + monitors + alerts set up.

@arm4b
Copy link
Member Author

arm4b commented Mar 25, 2022

st2cicd host is sufficient, we'll likely to get rid of everything else.

@Kami
Copy link
Member

Kami commented Mar 25, 2022

@armab I believe I migrated all the checks for st2cicd now. This includes "remote" SSL cert and domain expire checks, but those utilize agent monitor since DataSet doesn't support those remote checks natively.

Would be good if you double checked nothing is missing when you get a chance - https://app.scalyr.com/alerts?teamToken=BLSvhkqnK81b_wD2KhjsoQ--.

I also still need to adjust some thresholds and verify that indeed all alerts are set up correctly - aka that they trigger when they should.

I also set up log ingestion for StackStorm services logs in case they may help us with troubleshooting. They seem to be low volume so it shouldn't cause any log volume related issues. In case it does, we can always disable them. Only exception is MongoDB, that log seems to grow like crazy so I removed that file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants