TimescaleDB suitable as long term prometheus data storage? #810

Firefishy · 2022-12-15T15:01:40Z

Evaluate TimescaleDB as long term prometheus data storage.

Storage requirements?
Retention period?
Hardware requirements?
Autovacuum issue resolved?

tomhughes · 2022-12-15T15:15:50Z

The stability issues do seem to be resolved and we currently have three months of data, with the database size looking like:

It levels off in early November but that aligns with a reduction in samples being captured and in the size of the main database. That was around the time we switched from Cisco to Juniper in AMS so maybe we're just collecting less data from the Junipers?

Hardware wise stormfly-03 is coping but it's reasonably well loaded by it which is why I suggested moving it to one of the new machines.

jpds · 2023-04-02T18:29:49Z

The recommended way to do long-term Prometheus storage is to use an object store with either:

I use Thanos for my own infra deployments, as it's more "edge-focused" whereas Mimir has a more "centralized" architecture but they both accomplish the end-goal in different ways.

tomhughes · 2023-04-02T18:47:23Z

I can't even comprehend what would possess somebody to use an object store for that... I think there are plenty of alternate solutions I'd look at before going down that route if we decided to dump promscale.

Even ignoring the question of suitability (and I can't really imagine how it even works to use an object store for this) the cost (or complication if running our own store) would likely rule it out.

jpds · 2023-04-02T20:20:25Z

I can't even comprehend what would possess somebody to use an object store for that

Because an object store is a perfect place to store arbitrary application data- and Prometheus data blocks are just that; arbitrary application data - whilst decoupling the storage layer from the application layer, which allows the latter to be easily horizontally scaled as required.

I think there are plenty of alternate solutions

Sure, you could go and build alternative - both projects I mentioned in my previous comment were founded by long-time Prometheus maintainers:

And both are known to scale to over a billion metrics:

Even ignoring the question of suitability (and I can't really imagine how it even works to use an object store for this)

It's all explained in their respective design documents:

the cost

The design document also has an [estimated] costs section: https://thanos.io/tip/thanos/design.md/#cost

(or complication if running our own store) would likely rule it out.

I'm quite happily using https://garagehq.deuxfleurs.fr/ on my own hardware for my own Thanos deployments, which fully supports georeplication and can be deployed on any hardware in a mere matter of hours.

tomhughes · 2023-04-02T20:47:17Z

My point was that an object store is good for storing discrete items but metrics are continuous not discrete.

If you shard the data by time then a query has to retrieve lots of objects which is both slow and expensive but if you shard it by metric then you have to update objects as the data changes over time.

I imagine in reality they are sharding by both metric and time but that still leaves all the issues of having to retrieve lots of objects to resolve queries.

I would probably have put InfluxDB top of my list of alternate solutions but they seem to have taken a step away from prometheus integration in 2.x so maybe VictoriaMetrics would be top of my list now.

jpds · 2023-04-02T21:53:51Z

If you shard the data by time then a query has to retrieve lots of objects which is both slow and expensive but if you shard it by metric then you have to update objects as the data changes over time.

I imagine in reality they are sharding by both metric and time but that still leaves all the issues of having to retrieve lots of objects to resolve queries.

So you design a component that pre-caches and presents which data it has access to - typically people sat in front of Grafana only care about the latest data:

...whilst they would only query (and thus require the store API to fetch older data) on a as-needed basis.

You can also use a querier which supports fetching from both an object store or from Prometheus instances at the same time for more real-time metrics:

so maybe VictoriaMetrics would be top of my list now.

Going by https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#architecture-overview - this was designed by a team who hadn't heard of an object store and instead built their own at the "vmstorage" layer.

There's also a [large] object storage thread at their GitHub issue tracker.

Another thing to consider before adopting a solution is whether a project backed by a single entity is a good idea in the long term in terms of longevity, or to use one which is under a foundation such as the CNCF - not usually something I care about, but plenty of companies I've worked for factor it into their decision making.

jpds · 2023-04-02T22:16:17Z

Also just found: https://wikitech.wikimedia.org/wiki/Thanos (however as someone with previous experience with Swift, you'd definitely want to use garage instead as it's orders of magnitude easier to manage).

SuperQ · 2023-04-04T15:52:30Z

FYI, Promscale, the Prometheus to TimescaleDB adapter, has been deprecated and abandoned.

tomhughes · 2023-04-04T15:58:27Z

Thanks for the heads up - confirmation is at timescale/promscale#1836.

Guess we will have to start evaluating alternatives then :-(

SuperQ · 2023-04-04T16:08:13Z

Based on the Prometheus dashboard I found, it seems like you've got about 1.5M metrics, and 84k samples/sec. You seem to be using about 185GiB of storage for 31d of Prometheus retention. This seems a little higher than I would expect. But not wildly out of whack.

This doesn't seem like you would need an external TSDB for this size of a setup. You only have local disk, and you only have one Prometheus instance. Why not just change the retention parameters on Prometheus and keep storing it internally? A couple of TiB of storage is no big deal for Prometheus to handle.

SuperQ · 2023-04-04T16:18:19Z

Actually, I take that back, I must have done the math wrong. The reported storage use is about 1 byte per sample, which is about as optimal as you can get with Prometheus. With the storage you're using with Promscale, 1.5TiB, you can have 8 months of Prometheus retention.

tomhughes · 2023-04-04T16:24:25Z

The intention was always to move it to a better machine in due course so more storage is not necessarily an issue.

SuperQ · 2023-04-04T16:47:28Z

Seems like the easiest solution. Prometheus only really needs long-term retention for much larger multi-instance deployments where you have multiple clusters where you need to aggregate data. For example, my $dayjob has tens to hundreds of TiB of data in Thanos S3 buckets to back hundreds of Prometheus instances in various clusters.

pnorman · 2023-05-19T06:42:54Z

Retention time was set in openstreetmap/chef@2ca02ae

Firefishy added the monitoring label Dec 15, 2022

Firefishy mentioned this issue Dec 15, 2022

Switch to Prometheus for monitoring and alerting #360

Closed

pnorman closed this as completed May 19, 2023

pnorman mentioned this issue May 21, 2023

Extend prometheus metric retention #484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimescaleDB suitable as long term prometheus data storage? #810

TimescaleDB suitable as long term prometheus data storage? #810

Firefishy commented Dec 15, 2022

tomhughes commented Dec 15, 2022

jpds commented Apr 2, 2023

tomhughes commented Apr 2, 2023

jpds commented Apr 2, 2023 •

edited

Loading

tomhughes commented Apr 2, 2023

jpds commented Apr 2, 2023

jpds commented Apr 2, 2023 •

edited

Loading

SuperQ commented Apr 4, 2023

tomhughes commented Apr 4, 2023

SuperQ commented Apr 4, 2023

SuperQ commented Apr 4, 2023

tomhughes commented Apr 4, 2023

SuperQ commented Apr 4, 2023

pnorman commented May 19, 2023

TimescaleDB suitable as long term prometheus data storage? #810

TimescaleDB suitable as long term prometheus data storage? #810

Comments

Firefishy commented Dec 15, 2022

tomhughes commented Dec 15, 2022

jpds commented Apr 2, 2023

tomhughes commented Apr 2, 2023

jpds commented Apr 2, 2023 • edited Loading

tomhughes commented Apr 2, 2023

jpds commented Apr 2, 2023

jpds commented Apr 2, 2023 • edited Loading

SuperQ commented Apr 4, 2023

tomhughes commented Apr 4, 2023

SuperQ commented Apr 4, 2023

SuperQ commented Apr 4, 2023

tomhughes commented Apr 4, 2023

SuperQ commented Apr 4, 2023

pnorman commented May 19, 2023

jpds commented Apr 2, 2023 •

edited

Loading

jpds commented Apr 2, 2023 •

edited

Loading