GNIP-50: GeoNode monitoring #3137

cezio · 2017-06-22T15:47:54Z

GNIP: geonode monitoring

Overview

GeoNode monitoring is an infrastructure to extract and present information on installation's health status and resource (layers, maps, documents) usage. Monitoring is an additional Django/GeoNode application which will:

collect data (events) from software components
calculate usage statistics (metrics)
display it in human-friendly form
provide a way to set fault thresholds and send alerts in case of reaching those
GeoNode monitoring functionality is not limited to plain GeoNode, but it will also collect data from accompanying GeoServer instances, and from operating system on hardware resources usage.

Proposed by

Cezary Statkiewicz GeoSolutions

Assigned to release

None yet.

Motivation

GeoNode lacks information on resources usage and system health, which can be problematic in most cases, where operator(s) want to know some insights of running system. This was formulated as a significant problem by GFDRR’s Innovation Lab, which, through the Open Data for Resilience Initiative has assisted in the creation of National and Regional Geospatial data sharing platforms since 2010. Many of these platforms were deployed outside formal data centers and have administrators with other responsibilities unrelated to GeoNode. Some real problems raised:

Some services become unavailable with time, requiring follow up with the hosting institutions.
Given that GeoNode allows any registered user to upload datasets, some GeoNodes may contain layers that are partially configured and not accessible.
In order to know what resources the public finds most valuable, we need to understand whether or not a GeoNode in a particular location is being actively used: new layers are being added or modified, new users are registering on the site, recurring users are viewing or downloading data, the origin of the users visiting the site, etc.
When errors occur on a GeoNode they are logged in different files but the administrator may not have enough experience to diagnose and fix issues. It is desirable to send exceptions and errors back to a central registry where they can be categorized and studied by consultants helping the administrator.
Since most GeoNodes start only with a handful of users we need tools that track performance metrics (hard drive usage, memory usage, CPU usage, bandwidth usage, statistics on the time of HTTP responses, etc) to help identify when hardware upgrades are needed.

Technically, data needed to deduce such information could be extracted in various ways (client-side analytics, log parsing, external monitoring), but each way has it's drawbacks, and none would show full picture. Also, existing or previous attempts (GeoHealthCheck, geonode-monitor are quite incomplete and focus mostly on measuring external visibility/state only.

This proposal introduces contrib monitoring application, which would provide insights into actual usage of data and do health check of underlying system. Application should be optional, although there are few integration points in GeoNode core.

Note, this is not a replacement for full-fledge monitoring systems like Zabbix or Nagios. GeoNode monitoring is a simplified, especially from user's perspective. However, while GeoNode monitoring can work in stand-alone mode, it could be also integrated with 3rd party systems as well, as a data source (not covered by this GNIP).

Proposal

GN monitoring has two main tasks:

collecting data from probes,
calculate usage stats and present them.

Collecting data starts with recording request with context: besides basic http context, it should also contain information about used resources, service which was used, more detailed information on client etc. Similar data structure is already available in GeoServer.

Data collection may be implemented in several ways. By default, data will be pulled from probes, although there should be a way for probes to push data to collector. Also, monitoring should be ready to handle data exchange through AMQP, which will be future default way of notifications handling. Data collection can be performed periodically or persistently in real time.

Statistics calculation is performed periodically, into fixed length periods with aggregated data. Aggregated data would contain general statistics and per-resource statistics, so presentation layer can present system status from overview to layer-level without much of recalculation.

Architecture overview

Monitoring is composed of several components, described below. Note, that those are logical units. Code should reside in geonode.contrib.monitoring module as a Django application.

GeoNode probes

Probes are points of integration in GeoNode core, which will record it's activity. This is build with:

middleware - requests are marked with start/end time, and after view is processed, request is recorded to database with context information,
views - core views for layers, maps and documents should mark request with resources affected by this request.

GeoServer probes

GeoServer provides Monitoring/Audit API, which can be used. GeoServer improvements will be handled outside this GNIP.

System-level probes

GeoNode monitoring can collect system-level data (cpu usage, memory usage, disks usage). System-level data can be extracted by reading system indicators from GeoNode and GeoServer processes and expose with Status API in GeoServer. GeoNode would have Expose API, which is a set of views which will present system-level data at the moment of request.

Collector

This is the core element of monitoring, because it connects both main functionalities. Collector provides following facilities:

receive or acquire raw data from probes
This can be any of following:
- actively query (over HTTP or other transport) probes for data (pull),
- expose view reachable from probes, which will report their data to it (push),
- hybrid, utilizing incoming AMQP infrastructure, probes will publish events to queue, collector will act as a consumer and collect it from broker,
normalize, calculate and store metrics,
expose metrics for status UI and notifications.

Collector can be run as:

periodical command, pulling data from probes (implemented as collect_metrics),
long-running process (as a AMQP consumer),
as a view, passively receive data from probes.

Dashboard/Status UI

(Note, those are designs, not actual implementation)

main view:

list of captured exceptions

exception details

notifications configuration

response statistics

resources statistics

Status UI is a set of views and client-side application that will present metrics. User should get main indicators in simplified form (to judge if system is working properly), and have a way to see more detailed data few clicks away. Status UI should also provide a way to configure notifications and collector.

Notifications

Monitoring Notifications shouldn't be confused with GeoNode notifications app, which is a separate entity. However, Monitoring Notifications will use general notifications as a backend for sending alerts. User should be able to configure thresholds for certain indicators, which can consist of several metrics. Notifications will check metrics for each indicator after each metrics calculation, and send alerts in alarm conditions.

Beacon

Beacon is an API that exposes current status of GeoNode for external monitoring.

Data model

Collected data

There are different types of probes and data they provide. Basically, two base types are distinguished, service type and host type. Service type provides stream of events from service (GeoNode, GeoSever). Stream can contain data from past or be provided in real-time. Host type probe provides only data for current moment.

Collector will get following data from probes:

request with context (client location, affected resources, timing, errors),
exception information for errors occured during request procesing,
system-level data.

Data will be aggregated and stored in fixed-lenght periods. For near-present data, periods should be 1-5 minutes, for older data periods could be longer.

Metrics

Metric is an aggregated value for specific indicator. There are three types of metrics:

value (where we store value and count occurences within a period of time, for example: request method)
rate (where we store average rate within a period of time, for example: net interface tx/rx transfer rates)
count (where we count occurences within specific period of time, for example: errors count, net interface tx/rx bytes for given period)

While metric types seems similar, they are handled differently when are aggregated in API.

A metric has several main properties:

valid period (valid_from, valid_to),
service, for which it is calculated,
a name, like request.ip, or request.count (which is defined in MetricName model),
numeric value.
Additionally, metric can be associated to:
specific resource (layer, map, document),
OWS service type (names stored in OWSService model),
free-text label.

Following metric organization allows to have different levels of granularity (per-service, per-metric, per-resource etc) and further aggregation (increased intervals, aggregating total request count from sum of requests to specific resources etc).

Errors

Errors captured by GN or GS are stored along with request details, and are exposed with dedicated API endpoint. Error information contains:

error class
error message
stack trace
request context

Monitoring API

Detailed API description: https://github.com/geosolutions-it/geonode/wiki/Monitoring:-API

The text was updated successfully, but these errors were encountered:

afabiani · 2017-06-26T09:05:57Z

+1

capooti · 2017-06-27T21:12:34Z

Great. Have also a look at Hypermap: https://github.com/cga-harvard/HHypermap

We use it to track health check of thousands of services and layers, including our GeoNode instance (WorldMap). Here is our live instance: http://hh.worldmap.harvard.edu/

For example here is the situation for WorldMap: http://hh.worldmap.harvard.edu/registry/hypermap/service/2a96b71c-96b2-4432-b31f-219c45f3fc52/

cezio · 2017-06-29T14:52:30Z

@capooti thanks. looks interesting, but correct me if i'm wrong here: this is just external visibility check, right?

capooti · 2017-06-29T20:09:03Z

We test services and layers using OWSLib and ArcREST.
Test for a service (and time response) is done getting the capability document.
Test for a layer (and time response) is done with a GetMap (or similar for Arc REST Services)

safezpa · 2017-07-02T13:13:57Z

I think may be ELK will be another nice solution for monitoring.

cezio · 2017-10-31T13:47:10Z

code merged in, closing

afabiani added enhancement gnip A GeoNodeImprovementProcess Issue labels Jun 26, 2017

cezio mentioned this issue Aug 23, 2017

[monitoring] GNIP for monitoring geosolutions-it/geonode#209

Closed

cezio mentioned this issue Sep 26, 2017

Monitoring #3322

Merged

cezio closed this as completed Oct 31, 2017

afabiani changed the title ~~GNIP: GeoNode monitoring~~ GNIP-50: GeoNode monitoring Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNIP-50: GeoNode monitoring #3137

GNIP-50: GeoNode monitoring #3137

cezio commented Jun 22, 2017 •

edited

Loading

afabiani commented Jun 26, 2017

capooti commented Jun 27, 2017

cezio commented Jun 29, 2017

capooti commented Jun 29, 2017 •

edited

Loading

safezpa commented Jul 2, 2017 •

edited

Loading

cezio commented Oct 31, 2017

GNIP-50: GeoNode monitoring #3137

GNIP-50: GeoNode monitoring #3137

Comments

cezio commented Jun 22, 2017 • edited Loading

GNIP: geonode monitoring

Overview

Proposed by

Assigned to release

Motivation

Proposal

Architecture overview

GeoNode probes

GeoServer probes

System-level probes

Collector

Dashboard/Status UI

main view:

list of captured exceptions

exception details

notifications configuration

response statistics

resources statistics

Notifications

Beacon

Data model

Collected data

Metrics

Errors

Monitoring API

afabiani commented Jun 26, 2017

capooti commented Jun 27, 2017

cezio commented Jun 29, 2017

capooti commented Jun 29, 2017 • edited Loading

safezpa commented Jul 2, 2017 • edited Loading

cezio commented Oct 31, 2017

cezio commented Jun 22, 2017 •

edited

Loading

capooti commented Jun 29, 2017 •

edited

Loading

safezpa commented Jul 2, 2017 •

edited

Loading