Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logfire.instrument_system_metrics() #373

Merged
merged 55 commits into from
Aug 22, 2024

Conversation

alexmojaki
Copy link
Contributor

@alexmojaki alexmojaki commented Aug 8, 2024

Breaking change!

With this, you must call logfire.instrument_system_metrics() to enable basic system metrics, previously system metrics were always sent if you had the appropriate package installed.


Background:

In preparation for starting to charge we looked our usage from our larger beta users and realised a lot of companies were sending us a lot of system metrics which we are guessing they didn't care about, but would be charged a lot for.

This was because the standard opentelemetry-instrumentation-system-metrics logic was collecting around 400 metrics per minute per host — stuff like this:

metric_name count
system.disk.operations 74
system.disk.time 74
system.disk.io 74
system.cpu.time 64
system.cpu.utilization 64
system.memory.utilization 10
system.memory.usage 10
system.network.io 6
system.network.dropped_packets 6
system.network.errors 6

(This is from looking at a machine running on render)

Most people don't need 219 datapoints on disk usage every minute!

Solution

The solution (described in detail in the docs changes in this PR) are:

  1. Make system metrics opt-in, logfire will now only send them if you call logfire.instrument_system_metrics()
  2. Send a curated set of the most useful system metrics, reducing the metric volume by around 98% in common cases

Feedback on whether we should adjust the set of metrics sent in "basic" mode very welcome.

Copy link

cloudflare-workers-and-pages bot commented Aug 8, 2024

Deploying logfire-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 0ba76a7
Status: ✅  Deploy successful!
Preview URL: https://5e05a01f.logfire-docs.pages.dev
Branch Preview URL: https://alex-instrument-system-metri.logfire-docs.pages.dev

View logs

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (5d9a16f) to head (0ba76a7).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #373   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          124       126    +2     
  Lines         9193      9256   +63     
  Branches      1198      1206    +8     
=========================================
+ Hits          9193      9256   +63     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


## Configuration

By default, `instrument_system_metrics` collects only the metrics it needs to display the 'Basic System Metrics' dashboard. You can choose exactly which metrics to collect and how much data to collect about each metric. The default is equivalent to this:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current 'Basic System Metrics' dashboard template won't work with this new set of reduced metrics. I will create a PR with a new template. That PR will need to be ready to merge before these changes can be merged, and it should be deployed approximately when this is released.

When users upgrade the SDK, system metrics will stop being collected, if they were being collected at all. Since their collection was the implicit default, we can't warn them about this in code, and must rely on the changelog and announcements in slack. If the only thing that users do is add logfire.instrument_system_metrics() then existing basic system metrics dashboards will still not work, so we need to make it clear that they also need to switch to the new template.

The old template is still useful though, because it works with OTEL conventions, so system metrics collected by means other than the logfire SDK can be used with the old template. So we should keep and document both. We need to figure out how to name them to clarify the difference, e.g. 'Basic System Metrics (Logfire)' and 'Basic System Metrics (OpenTelemetry)'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexmojaki alexmojaki marked this pull request as ready for review August 19, 2024 11:37
@alexmojaki alexmojaki requested review from Kludex and dmontagu August 19, 2024 11:37
Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be so slow, otherwise looks great.

docs/integrations/system_metrics.md Outdated Show resolved Hide resolved
logfire/_internal/config.py Outdated Show resolved Hide resolved
docs/integrations/system_metrics.md Outdated Show resolved Hide resolved
@alexmojaki alexmojaki merged commit ffe0c25 into main Aug 22, 2024
13 checks passed
@alexmojaki alexmojaki deleted the alex/instrument-system-metrics branch August 22, 2024 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants