Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Observability Stack for Monitoring and Logging #3

Open
MooseQuest opened this issue Mar 17, 2020 · 9 comments
Open

Create Observability Stack for Monitoring and Logging #3

MooseQuest opened this issue Mar 17, 2020 · 9 comments
Assignees
Labels
k8s infra Requires work on ops-facing workloads which support k8s app

Comments

@MooseQuest
Copy link
Collaborator

Generating the observability stack serves the following purposes:

  • Allows for monitoring of both the cluster and application.
  • Identify resource contention and scale issues through metrics
  • Allows developers to pinpoint errors and surface them for ops alerts

Components to generate:

  • The data pipeline which will deliver logs from the environment and application
  • The metrics visualization
  • Data management layer either on the cluster, or provide the connectivity to another component which will surface.

Technologies and software to consider:

  • Elasticsearch
  • Grafana
  • Prometheus
  • Splunk
@MooseQuest MooseQuest self-assigned this Mar 17, 2020
@lottspot lottspot self-assigned this Mar 18, 2020
@lottspot
Copy link
Contributor

Going to try out this pre-rolled stack as a starting point: https://github.com/coreos/kube-prometheus

Even if all goes well, this doesn't get us a logging stack; just metrics and monitoring.

@lottspot
Copy link
Contributor

Roll out went well and we have metrics dashboards running at https://metrics.chime-live-cluster.phl.io/

The manifests used for the rollout are currently sitting in the issues/3 branch, where they will remain until the freeze on PR to masters is lifted.

themightychris added a commit that referenced this issue Mar 19, 2020
Add prometheus+grafana to k8s; refactor other infra manifests
@mariekers mariekers removed the devops label Mar 20, 2020
@lottspot lottspot added the k8s infra Requires work on ops-facing workloads which support k8s app label Mar 20, 2020
quinn-dougherty pushed a commit that referenced this issue Mar 25, 2020
ckoerber pushed a commit that referenced this issue Apr 1, 2020
@rcknplyr
Copy link
Collaborator

@lottspot would we consider this completed?

@lottspot
Copy link
Contributor

We don't have anything capturing logs yet so this is technically not completed

@MooseQuest
Copy link
Collaborator Author

I'll be pushing up what we have so far onto a branch and will reference here.

@mariekers
Copy link
Contributor

Would someone be interested in telling a non-devops person how this differs from #32 ?

@fxdgear
Copy link
Contributor

fxdgear commented Apr 22, 2020

Just going to leave a few comments here for posterity:

I had a conversation with @MooseQuest and he told me that Elasticserach was installed on the dev k8s cluster.

Elasticsearch was installed on a following the instructions here: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-quickstart.html

For reference the instructions here are for installing Elastic Cloud, which is a service for managing multiple elasticsearch deployments. This of this as https://cloud.elastic.co on prem. Meaning you will have a web interface for managing multiple es clusters. You can upgrade, manage backups, etc.. It's a great service but might be overkill to have an elastic cloud serice for each chime deployment

My recommendation is that for each deployment of CHIME it would have a single deployment of Elasticsearch.

To deploy Elasticsearch (and the elastic stack at large) I would recommend using the Elasticsearch Helm Charts

Elasticsearch Helm chart requirements are:

  • Helm >=2.8.0 and <3.0.0 (see parent README for more details)
  • Kubernetes >=1.8
  • Minimum cluster requirements include the following to run this chart with default settings. All of these settings are configurable.
    • Three Kubernetes nodes to respect the default "hard" affinity settings
    • 1GB of RAM for the JVM heap

Elasticsearch being a distributed system operates on an high availability model. Meaning the minimum number of Elasticsearch nodes should be 3. This is why the kubernetes cluster must have at least 3 nodes. This allows for Elasticsearch cluster to survive a kubernetes node failure.

Using the helm charts also gives us the added benefit of being able to deploy:

  • elasticsearch

  • filebeat

  • metricbeat

  • kibana

  • apm-server

  • Filebeat can be configured to read the logs from pods in the k8s cluster and ship the logs to elasticsearch

  • metricbeat can be configured to collect metrics from the k8s cluster and ship them to elasticsearch

  • APM server is a service that runs on the k8s clusters and can accept APM data from various applications deployed in the K8s cluster and ship APM data to elasticserach.

the benefit of having all this data going into elasticsearch is that you can use Kibana to vizualize all these different data sources in one place.

Kibana also has a "logs" app which lets you tail incoming logs to elasticsearch. You can even filter on k8s labels or pod names or namespaces etc..

The elastic apm service currently has support the following languages

  • Go
  • Java
  • .NET
  • Node.js
  • Python
  • Ruby

@themightychris
Copy link
Member

themightychris commented Apr 22, 2020

@fxdgear long term, we're not looking to give each deployment of CHIME its own cluster. That was a stop-gap measure to proceed quickly. Eventually, we want to have a single prod cluster hosting many civic applications including chime, alternate versions of chime, follow-up projects related to chime, and other local civic projects. We are thinking that each project would be within its own namespace.

We need an infrastructure that gets us as close as possible to each project/namespace being free-when-idle. Any cluster services that we need to deploy instances of per-project/namespace will create poor economics for us. We have very modest funding within which we need to be able to host a large number of low-traffic projects sustainably for many years. At any given time, only a small number of projects, if any, will have high traffic. It's kind of an inverse scenario of most enterprise use cases

Given that, would you adjust your recommendations at all?

@fxdgear
Copy link
Contributor

fxdgear commented Apr 22, 2020

@themightychris Thanks for the quick response.

Given the longterm goal of a single K8s cluster with multiple namespaces what I think I would recommend in this case is the following:

  • Deploy the elastic stack into it's own namespace
    • APM
    • *beats
    • Elasticsearch
    • Kibana
  • Configure the *beats to read from ALL namespaces
  • Setup APM server to run as an internal service (ie no ingres)
    • configure your apps which send APM data to APM Server to communicate to the full service name. ie service-name.namespace.svc.cluster.local

The end goal here being (wrt the elastic stack) is that it's a single deployment of the elastic tooling. It's configured in a way that lets you add and remove namespaces (ie various CHIME related projects and deployments)

But you end up with a singular entity to monitor ALL your deployments.

This was not explicit in my previous comment, but the goal here is that a if you end up having multiple k8s clusters or a single k8s cluster you still only need a single elastic stack deployment per k8s cluster.

This strategy will scale regardless.


On another note, depending on volume of logs/metrics you may or may run out of disk space for storing data in Elasticsearch. There's a couple ways to handle this.

If you have a policy on the length of time you are required (or want) to store logs you can do any of the following:

  1. increase the disk size of your PVC to account for the ammount of data you need to store.
  2. sheduled snapshots of the data to store outside the cluster.
  3. Roll ups (basically storing older data with lower fidelity)
  4. And finally using ILM (Index Lifecycle Management) to automate a lot of this to ensure your disks don't overload with stale data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s infra Requires work on ops-facing workloads which support k8s app
Projects
None yet
Development

No branches or pull requests

6 participants