Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics #18

Open
novoj opened this issue Feb 24, 2023 · 15 comments · Fixed by #504
Open

Metrics #18

novoj opened this issue Feb 24, 2023 · 15 comments · Fixed by #504
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@novoj
Copy link
Collaborator

novoj commented Feb 24, 2023

Introduce metrics into the evitaDB. The servlet for metric should start as separate API on different port (or part of a system API). Although we are used to Prometheus API, we should analyze different options - namely Open Telemetry.

Metrics proposals

System metrics

  • JVM metrics
  • Liveness probe metrics
  • Readiness probe metrics (with labeled API)
  • Java errors counter
  • evitaDB InternalError counter

Storage metrics

Transactions

  • number of active transactions
  • age of transaction in seconds
  • number of commits
  • number of rollbacks
  • age of oldest active transaction in seconds
  • number of transactions writing to disk
  • oldest WAL record age in seconds (write history)
  • number of WAL records to process
  • number of WAL records to processed
  • latency between transaction commit and finalization
  • latency of transaction stage execution time
  • WAL offheap memory size (+ used off heap memory)
  • transaction queue lag - for each stage

Storage

Per collection

  • offset index size (memory)
  • offset index waste size (disk)
  • offset index active dataset size (disk)
  • number of opened file handles
  • non-flushed record count
  • non-flushed record size in Bytes
  • max record size in bytes
  • record count
  • record count per record type
  • offset index size - total (disk)
  • oldest record age in seconds (read history)
  • compaction occurrences
  • compaction time in seconds

Per catalog

  • collection count
  • offset index size (memory) - ∑ of entity collection
  • offset index waste size (disk) - ∑ of entity collection
  • offset index active dataset size (disk) - ∑ of entity collection
  • number of opened file handles - ∑ of entity collection + WAL + file transactions
  • record count - ∑ of entity collection
  • offset index size - total (disk) - ∑ of entity collection
  • total folder size - total (disk)
  • oldest catalog header version in seconds (read history)
  • compaction occurrences - ∑ of all
  • compaction time in seconds - ∑ of all
  • number of cached opened outputs (ObservableOutputKeeper)
  • number of WAL cached locations (CatalogWriteAheadLog)

Per instance

  • catalog count
  • offset index size (memory) - ∑ of catalogs
  • offset index waste size (disk) - ∑ of catalogs
  • offset index active dataset size (disk) - ∑ of catalogs
  • number of opened file handles - ∑ of entity catalogs
  • record count - ∑ of catalogs
  • offset index size - total (disk) - ∑ of catalogs
  • total folders size - total (disk) - ∑ of catalogs
  • compaction occurrences - ∑ of all compactions
  • compaction time in seconds - ∑ of all compactions
  • number of cached opened outputs (ObservableOutputKeeper) - ∑ of catalogs
  • number of WAL cached locations (CatalogWriteAheadLog) - ∑ of catalogs

Engine metrics

Queries

  • query process time (tag: catalog, collection)
  • query complexity (tag: catalogn, collection)
  • query records returned (tag: catalog, collection)
  • query records fetched from disk - count (tag: catalog, collection)
  • query records fetched from disk - size in Bytes (tag: catalog, collection)
  • active sessions (tag: catalog)
  • sessions killed (tag: catalog)
  • queries per session
  • age of sessions in seconds
  • age of oldest session in seconds

Per instance

  • query process time (tag: catalog, collection) - ∑ of catalogs
  • query complexity (tag: catalog, collection) - ∑ of catalogs
  • query records returned (tag: catalog, collection) - ∑ of catalogs
  • active sessions (tag: catalog) - ∑ of catalogs
  • executor threads
  • executor used threads (tag: process name, catalog)
  • executor thread execution time (tag: process name, catalog)

Cache

  • cache size in Bytes
  • duration of cache re-evaluation
  • number of records in cache
  • number of records per type in cache
  • size of records in Bytes
  • size of records in Bytes per type in cache
  • cache adepts weighted and found wanting
  • cache adepts elevated to records
  • cache records in cool-down
  • cache records surviving
  • cache records overall complexity per type
  • cache hits
  • cache misses
  • cache records initialized
  • anteroom record count
  • anteroom cycles wasted

Web API metrics

  • active requests (tag: catalog, tag: API[gRPC,REST,GraphQL])
  • request count (+ requests per second, tag: catalog, API[gRPC,REST,GraphQL], result:[TIMEOUT, ERROR, OK])
  • thread pool size
  • used threads
  • ingress Bytes (tag: catalog, API[gRPC,REST,GraphQL])
  • egress Bytes (tag: catalog, API[gRPC,REST,GraphQL])
  • query process time API overhead (tag: catalog, collection, API[gRPC,REST,GraphQL])
  • query process time with API overhead (tag: catalog, collection, API[gRPC,REST,GraphQL])
  • JSON query deserialization into internal structures (tag: catalog, collection, API[gRPC, REST, GraphQL])
  • query require constraints reconstruction time (tag: catalog, collection, API[GraphQL])
  • input data deserialization time (tag: catalog, collection, API[gRPC,REST, GraphQL])
  • API schema building time - new and refresh (tag: catalog, API[REST, GraphQL])
  • API refresh count (tag: catalog, API[REST, GraphQL])
  • API schema DSL lines count (tag: catalog, API[REST, GraphQL]) ??
  • number of API endpoints (tag: catalog, API[REST, GraphQL]) ??
  • number of gRPC messages (sent / received, with status ok / error / canceled)
  • number of gRPC messages per second (tag: catalog, methodName)
  • latency of gRPC messages (histogram)
@novoj novoj self-assigned this Feb 24, 2023
@novoj novoj added the enhancement New feature or request label Feb 24, 2023
@novoj novoj added this to the Alpha milestone Feb 24, 2023
@novoj novoj removed this from the Alpha milestone Jul 18, 2023
@smejdil
Copy link

smejdil commented Sep 20, 2023

I will be happy to help and build Zabbix template "evitaDB by Prom"

@novoj
Copy link
Collaborator Author

novoj commented Sep 20, 2023

I will be happy to help and build Zabbix template "evitaDB by Prom"

We'll get in touch before we start working on this issue. ETA is the December 23 / January 24.

@novoj
Copy link
Collaborator Author

novoj commented Dec 7, 2023

@novoj
Copy link
Collaborator Author

novoj commented Dec 12, 2023

Interesting slide - three pillars of observability:

image

I think it might be beneficial to provide a basic access to all three of them in evitaLab.

@novoj
Copy link
Collaborator Author

novoj commented Dec 13, 2023

I'd suggest creating a prototype where:

@novoj
Copy link
Collaborator Author

novoj commented Dec 13, 2023

It would be interesting also to test https://www.jaegertracing.io/ and its integration into https://grafana.com/docs/grafana/latest/datasources/jaeger/ - it's somehow similar to our #148 and we should discuss whether it makes sense to move toward some standard instead of our proprietary solution (the principle should be very similar so it shouldn't be hard to migrate).

@novoj
Copy link
Collaborator Author

novoj commented Dec 13, 2023

This should help us too: https://plugins.jetbrains.com/plugin/20937-java-jfr-profiler

@smejdil
Copy link

smejdil commented Dec 16, 2023

I'd suggest creating a prototype where:

* try to implement example JFR events according to blog post: https://www.morling.dev/blog/rest-api-monitoring-with-custom-jdk-flight-recorder-events/

* create event that covers evitaDB `QueryPlan` execution

* try to record / stream events and visualize them

* try to filter them by predefined template (e.g. collection type for example)

* try to integrate them to metrics: https://opentelemetry.io/docs/instrumentation/java/ and measure the slowdown of the system (I'd like to integrate directly with OpenTelemetry and avoid MicroProfile Metrics)

* open servlets for Prometheus scraping

* create example dashboard in Grafana

If EvitaDB will be able to use Prometheus format, it is possible to get metrics into Zabbix using https://www.zabbix.com/documentation/current/en/manual/config/items/itemtypes/prometheus it is also possible to use LLD - Low lever Discovery technique for some schema instances etc.

@novoj
Copy link
Collaborator Author

novoj commented Jan 4, 2024

Notes from first prototype showdown and what needs to be added into the prototype:

  • how to filter JFR events to be generated = stored
    • maybe prepare an alternative to JvmMetrics for evitaMetrics
    • check how to make JFR Event enabled work properly
  • try to see if some JVM events can be "disabled" - e.g. the flamegraph must be quite demanding?!
  • find out how to plug in the OpenTelemetry abstraction
  • metrics will have their own API

@novoj
Copy link
Collaborator Author

novoj commented Jan 5, 2024

We'we been recommended by Láďa Prskavec to stick to Prometheus metrics and don't use OTEL for database monitoring purposes. The recommendation was:

  • expose metrics via Prometheus endpoint
  • log data including "tracing information" - i.e. client id + request id to logs in standardized format

The OTEL is then used on SRE side to integrate multiple vendors together.

As localhost tracing viewer we've been recommended to use https://github.com/CtrlSpice/otel-desktop-viewer and for shared Grafana service https://grafana.com/oss/tempo/

Khertys pushed a commit that referenced this issue Jan 23, 2024
Khertys pushed a commit that referenced this issue Jan 30, 2024
…ules which could be enabled via linking the libraries to the project
Khertys pushed a commit that referenced this issue Jan 30, 2024
novoj added a commit that referenced this issue Jan 31, 2024
Refactoring during team status.
Khertys pushed a commit that referenced this issue Feb 1, 2024
…ded order to ExternalApiProviderRegistrar.java to ensure that ObservabilityProviderRegistrar loads before GrpcProviderRegistrar
novoj added a commit that referenced this issue Feb 6, 2024
Introduce metrics into the evitaDB. The servlet for metric should start as separate API on different port (or part of a system API). Although we are used to Prometheus API, we should analyze different options - namely Open Telemetry.
@novoj
Copy link
Collaborator Author

novoj commented Feb 6, 2024

Very minimal set of metrics published at: http://demo.evitadb.io:5557/observability/metrics ... when we get the prototype up and running, we'll expand the metrics list according to the set defined in the issue header.

@lukashornych
Copy link
Collaborator

We need to update the Monitor docs https://evitadb.io/documentation/operate/monitor with the new IDs.

@novoj
Copy link
Collaborator Author

novoj commented Feb 6, 2024

We probably don't need to suffix metrics with type - it's visible in the metrics as comment:

# HELP io_evitadb_core_metric_event_query_plan_step_executed_event_timegauge Time taken
# TYPE io_evitadb_core_metric_event_query_plan_step_executed_event_timegauge gauge
io_evitadb_core_metric_event_query_plan_step_executed_event_timegauge 72500.0

novoj added a commit that referenced this issue May 26, 2024
novoj added a commit that referenced this issue May 26, 2024
novoj added a commit that referenced this issue May 26, 2024
novoj added a commit that referenced this issue May 27, 2024
novoj added a commit that referenced this issue May 27, 2024
Unfortunately, Prometheus doesn't propagate native histograms in text format - see prometheus/prometheus#11265, only in Protobuff format, and this is not easily scrapeable.
novoj added a commit that referenced this issue May 27, 2024
@novoj
Copy link
Collaborator Author

novoj commented May 27, 2024

Initial version of metrics is done and released in 2024.7 release:

Image

The issue won't be closed since there are still some metrics missing and also we need to properly document them.

novoj added a commit that referenced this issue Jun 5, 2024
novoj added a commit that referenced this issue Jun 7, 2024
(cherry picked from commit 07453ab)
lukashornych added a commit that referenced this issue Jun 11, 2024
feat(#18): GraphQL API JFR events and metrics
lukashornych added a commit that referenced this issue Jun 17, 2024
lukashornych added a commit that referenced this issue Jun 17, 2024
feat(#18): REST and GraphQL API metrics improvements
lukashornych added a commit that referenced this issue Jun 17, 2024
feat(#18): add OpenAPI operation ID to REST metrics to distinguish requests
lukashornych added a commit that referenced this issue Jun 17, 2024
feat(#18): count REST endpoints metric
@novoj
Copy link
Collaborator Author

novoj commented Jun 24, 2024

Most of the metrics is done by now. We have also pretty looking dashboard in Grafana that's getting useful. I postpone closing this issue to be finalized later. We still need to:

  • visualize cache metrics, but this is related to Cache inconsistency #37 being solved
  • visualize thread pools usage
  • finalize documentation (descriptions etc.) - base is already available at https://evitadb.io/documentation/operate/observe?lang=evitaql#metrics
  • extract and document Grafana dashboard to JSON and make available for downloading - we also know, that current filters will not match requirements by K8S pod selection, so we need to investigate this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants