Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: store CPU samples in nanoseconds #1602

Closed
petethepig opened this issue Oct 5, 2022 · 0 comments
Closed

Proposal: store CPU samples in nanoseconds #1602

petethepig opened this issue Oct 5, 2022 · 0 comments
Labels
backend Mostly go code

Comments

@petethepig
Copy link
Member

petethepig commented Oct 5, 2022

This came up during the discussion in #1589 and vasi-stripe#1

@kolesnikovae mentioned that there's a flaw in our storage engine:

Sample rate is stored per segment (series) and gets overridden with the value ingested last. This could be a problem if there are two profiles with distinct time units (e.g seconds and milliseconds).

@vasi-stripe proposed to store CPU samples in nanoseconds instead

Silly idea: What if we always chose nanosecond sampling rate? Would that be worse in any way?


I think this a great idea. Before we commit to it we need to figure out two things:

  • space requirements
  • how to handle old data already stored on disk

Space requirements

I imagine this new way of storing data will be less efficient because we use varints when we serialize integers to disk, and so storing 1,000,000,000 is less efficient than storing 100 (5 bytes instead of 1). But without experimentation it's hard to say how much less efficient. We do compress data when it lands on disk so in theory this shouldn't be too much of a problem, but we should still do some measurements before we commit to this.

Also, we might want to consider other alternative units, e.g:

  • microseconds
  • milliseconds

milliseconds are only 1 digit less efficient than storing samples at 100Hz, and are good enough for most of our integrations I imagine.

How to handle old data already stored on disk

I think the easiest approach would be to introduce a new version for trees. New data will be written with nanosecond precision, and the old data read in v1 format will be normalized to nanoseconds during deserialization process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Mostly go code
Projects
None yet
Development

No branches or pull requests

2 participants