Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

grafana MAU metrics not working when using a background worker #14622

Closed
janonym1 opened this issue Dec 5, 2022 · 1 comment · Fixed by #14644
Closed

grafana MAU metrics not working when using a background worker #14622

janonym1 opened this issue Dec 5, 2022 · 1 comment · Fixed by #14644
Assignees
Labels
O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@janonym1
Copy link

janonym1 commented Dec 5, 2022

Description

When using the grafana template the "MAU limits" metric does not work, because in a synapse setup with workers, the MAU is reported by the background worker instead of the synapse/master. This became really relevant with the deprecation of legacy metrics and there were already a lot of fixed for the grafana template: a651479

Steps to reproduce

-) Install synapse + workers with slavis playbook
-) install an external prometheus server and use the templated prometheus config
-) import grafana template and lview MAU Limits metric

A pretty typical prometheus template config (which gets generated with the playbook) looks something like:

- job_name: 'synapse'
    metrics_path: /metrics/synapse/main-process
    scheme: https
    basic_auth:
      username: prometheus
      password_file: /path/to/passwordfile
    static_configs:
      - targets: ['matrix.domain.com:443']
        labels:
          instance: "prod"
          job: "master"
          index: "0"
[...]
  - job_name: 'matrix-synapse-worker-background-0'
    metrics_path: /metrics/synapse/worker/background-0
    scheme: https
    basic_auth:
      username: prometheus
      password_file: /path/to/passwordfile
    static_configs:
      - targets: ['matrix.domain.com:443']
        labels:
          worker_id: background-0
          job: "background"
          app: generic_worker
          instance: "prod"

Most other metrics work fine

Homeserver

synapse homeserver with a lot of workers

Synapse Version

synapse 1.71

Installation Method

Docker (matrixdotorg/synapse)

Database

dedicated PostgreSQL 14 DB

Workers

Multiple workers

Platform

Ubuntu 20.04
12 cores, 20GB RAM, 1GBit/s Network, dedicated postgreSQL DB cluster
installed with Slavis ansible playbook running on a VM

Configuration

presence disabled, cache_factor 5.0

Relevant log output

"MAU Limits

No data

"

Anything else that would be useful to know?

the affected lines are L11299 and L11313

synapse_admin_mau_current{instance="$instance", job=~"(hhs_)?synapse"} does not work in my case but a workaround would be removing the job: synapse_admin_mau_current{instance="$instance"}. However, then I get one time series for every worker, which I think is unintended.

Specifying the job label to background works better: synapse_admin_mau_current{instance="$instance", job="background"} but I assume that breaks for most monolith synapse setups.

@reivilibre suggested using something like max over (job) { ... } but I am not sure how to get around this best.

One possible workaround that works nicely for me is just using max over the expression without using a job variable: max(synapse_admin_mau_current{instance="$instance"}) and max(synapse_admin_mau_max{instance="$instance"}).

However, I dont know if that also works for monolithic synapse instances or if there are any other drawbacks, since I am not well versed in writing grafana templates

@reivilibre
Copy link
Contributor

Thanks! I've put up #14644 that uses the max() workaround, which seems to work well on this end and I don't see why it shouldn't work for monoliths.

@reivilibre reivilibre added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
2 participants