-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TimescaleDB suitable as long term prometheus data storage? #810
Comments
The stability issues do seem to be resolved and we currently have three months of data, with the database size looking like: It levels off in early November but that aligns with a reduction in samples being captured and in the size of the main database. That was around the time we switched from Cisco to Juniper in AMS so maybe we're just collecting less data from the Junipers? Hardware wise stormfly-03 is coping but it's reasonably well loaded by it which is why I suggested moving it to one of the new machines. |
The recommended way to do long-term Prometheus storage is to use an object store with either: I use Thanos for my own infra deployments, as it's more "edge-focused" whereas Mimir has a more "centralized" architecture but they both accomplish the end-goal in different ways. |
I can't even comprehend what would possess somebody to use an object store for that... I think there are plenty of alternate solutions I'd look at before going down that route if we decided to dump promscale. Even ignoring the question of suitability (and I can't really imagine how it even works to use an object store for this) the cost (or complication if running our own store) would likely rule it out. |
Because an object store is a perfect place to store arbitrary application data- and Prometheus data blocks are just that; arbitrary application data - whilst decoupling the storage layer from the application layer, which allows the latter to be easily horizontally scaled as required.
Sure, you could go and build alternative - both projects I mentioned in my previous comment were founded by long-time Prometheus maintainers:
And both are known to scale to over a billion metrics:
It's all explained in their respective design documents:
The design document also has an [estimated] costs section: https://thanos.io/tip/thanos/design.md/#cost
I'm quite happily using https://garagehq.deuxfleurs.fr/ on my own hardware for my own Thanos deployments, which fully supports georeplication and can be deployed on any hardware in a mere matter of hours. |
My point was that an object store is good for storing discrete items but metrics are continuous not discrete. If you shard the data by time then a query has to retrieve lots of objects which is both slow and expensive but if you shard it by metric then you have to update objects as the data changes over time. I imagine in reality they are sharding by both metric and time but that still leaves all the issues of having to retrieve lots of objects to resolve queries. I would probably have put InfluxDB top of my list of alternate solutions but they seem to have taken a step away from prometheus integration in 2.x so maybe VictoriaMetrics would be top of my list now. |
So you design a component that pre-caches and presents which data it has access to - typically people sat in front of Grafana only care about the latest data:
...whilst they would only query (and thus require the store API to fetch older data) on a as-needed basis. You can also use a querier which supports fetching from both an object store or from Prometheus instances at the same time for more real-time metrics:
Going by https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#architecture-overview - this was designed by a team who hadn't heard of an object store and instead built their own at the "vmstorage" layer. There's also a [large] object storage thread at their GitHub issue tracker. Another thing to consider before adopting a solution is whether a project backed by a single entity is a good idea in the long term in terms of longevity, or to use one which is under a foundation such as the CNCF - not usually something I care about, but plenty of companies I've worked for factor it into their decision making. |
Also just found: https://wikitech.wikimedia.org/wiki/Thanos (however as someone with previous experience with Swift, you'd definitely want to use garage instead as it's orders of magnitude easier to manage). |
FYI, Promscale, the Prometheus to TimescaleDB adapter, has been deprecated and abandoned. |
Thanks for the heads up - confirmation is at timescale/promscale#1836. Guess we will have to start evaluating alternatives then :-( |
Based on the Prometheus dashboard I found, it seems like you've got about 1.5M metrics, and 84k samples/sec. You seem to be using about 185GiB of storage for 31d of Prometheus retention. This seems a little higher than I would expect. But not wildly out of whack. This doesn't seem like you would need an external TSDB for this size of a setup. You only have local disk, and you only have one Prometheus instance. Why not just change the retention parameters on Prometheus and keep storing it internally? A couple of TiB of storage is no big deal for Prometheus to handle. |
Actually, I take that back, I must have done the math wrong. The reported storage use is about 1 byte per sample, which is about as optimal as you can get with Prometheus. With the storage you're using with Promscale, 1.5TiB, you can have 8 months of Prometheus retention. |
The intention was always to move it to a better machine in due course so more storage is not necessarily an issue. |
Seems like the easiest solution. Prometheus only really needs long-term retention for much larger multi-instance deployments where you have multiple clusters where you need to aggregate data. For example, my $dayjob has tens to hundreds of TiB of data in Thanos S3 buckets to back hundreds of Prometheus instances in various clusters. |
Retention time was set in openstreetmap/chef@2ca02ae |
Evaluate TimescaleDB as long term prometheus data storage.
Storage requirements?
Retention period?
Hardware requirements?
Autovacuum issue resolved?
The text was updated successfully, but these errors were encountered: