You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@kolesnikovae mentioned that there's a flaw in our storage engine:
Sample rate is stored per segment (series) and gets overridden with the value ingested last. This could be a problem if there are two profiles with distinct time units (e.g seconds and milliseconds).
@vasi-stripe proposed to store CPU samples in nanoseconds instead
Silly idea: What if we always chose nanosecond sampling rate? Would that be worse in any way?
I think this a great idea. Before we commit to it we need to figure out two things:
space requirements
how to handle old data already stored on disk
Space requirements
I imagine this new way of storing data will be less efficient because we use varints when we serialize integers to disk, and so storing 1,000,000,000 is less efficient than storing 100 (5 bytes instead of 1). But without experimentation it's hard to say how much less efficient. We do compress data when it lands on disk so in theory this shouldn't be too much of a problem, but we should still do some measurements before we commit to this.
Also, we might want to consider other alternative units, e.g:
microseconds
milliseconds
milliseconds are only 1 digit less efficient than storing samples at 100Hz, and are good enough for most of our integrations I imagine.
How to handle old data already stored on disk
I think the easiest approach would be to introduce a new version for trees. New data will be written with nanosecond precision, and the old data read in v1 format will be normalized to nanoseconds during deserialization process.
The text was updated successfully, but these errors were encountered:
This came up during the discussion in #1589 and vasi-stripe#1
@kolesnikovae mentioned that there's a flaw in our storage engine:
@vasi-stripe proposed to store CPU samples in nanoseconds instead
I think this a great idea. Before we commit to it we need to figure out two things:
Space requirements
I imagine this new way of storing data will be less efficient because we use varints when we serialize integers to disk, and so storing
1,000,000,000
is less efficient than storing100
(5 bytes instead of 1). But without experimentation it's hard to say how much less efficient. We do compress data when it lands on disk so in theory this shouldn't be too much of a problem, but we should still do some measurements before we commit to this.Also, we might want to consider other alternative units, e.g:
milliseconds are only 1 digit less efficient than storing samples at 100Hz, and are good enough for most of our integrations I imagine.
How to handle old data already stored on disk
I think the easiest approach would be to introduce a new version for trees. New data will be written with nanosecond precision, and the old data read in v1 format will be normalized to nanoseconds during deserialization process.
The text was updated successfully, but these errors were encountered: