How large can block_retention reasonably be? #3728
-
We are deploying Tempo via the tempo-distributed Helm Chart. The default helm value for We are thinking for our use-case, we'd want to retain traces for much longer.... like 2 years! Let's say we create 1000 traces each day, each with 10000x1KB spans. Assuming storage capacity is not a concern, is this amount of traffic sustainable with a 2 year Appreciate any tips, thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, this is a good question. We have a worked with a few places using longer-term storage to meet auditing requirements, and I would be interested to hear more about your use case.
Agree, I think the main reason for these low defaults is to prevent other worse issues that are likely on a new install: filling disks, unexpected object storage costs, high latency from lack of appropriate scaling. As the operator becomes more experienced and tunes the cluster, then storage can be bumped as well. 30d is more of what I consider the sweet spot. Traces are frequently used for troubleshooting live systems, and the value of a trace diminishes as it ages and departs from the current state of the system.
🚀 😄
I provide detail about scaling and tuning below, but then it occurred to me that the biggest risk is the block format. 2y retention means the block format must be stable in Tempo for 2 years. I'm not sure we can provide that guarantee at this stage, as we are making rapid changes in this area for new features. We just deleted the vParquet1 format which was added June 2022 This is just short of 2y :) Note: Tempo doesn't upgrade existing blocks to new formats because of the large overhead. We just let ingesters start writing the new format, and old formats naturally cycle out (blocks deleted). But assuming the format is stable for 2y then: I think the main thing to consider is the total amount of data. A cluster will require similar scale and resources at X TB of data, regardless if it is spread out over 2 years, or 14d. If configured ideally, there could be the same number and size of blocks in both scenarios, and Tempo will be fine. Because a 2 year retention will have 50X the data as 14d install, in my mind is the question is: can we 50X the number of blocks on this cluster. The main pressure will be on the read path, as trace lookup must inspect all blocks. Higher number of queriers, and tuning around the frontend->querier path will be needed (job parallelism, etc). Caching will be required for bloom filters and page i/o. The scaling here is based on the number and size of blocks. The same amount of work is required to scan 100K blocks whether they are spread across 14d or 2y. The next pressure will be on compactors. With short retention like 14d, under-scaled compactors can go unnoticed because blocks that weren't able to be well-compacted get deleted anyway. But at 2 years, it will be important that compactors are able to keep up. Higher number of compactors will be needed for sure. I would also increase compaction_window. The default is 1h, but this starts with a high floor on the minimum number of blocks. Would try 4-24h. Adjust as needed to keep the block list around 100K, which is on the upper end of comfort in my experience. A larger compaction_window only works for low-volume, long-retention. (High-volume clusters typically are 5m!) The highest pressure on object storage is likely to be polling/listing blocks due to a high block list. Ingesters probably don't need much work. They must scale for the write volume, which is not a problem for low-volume/high-retention. One more thing to consider is the Grafana UI can be configured to provide a time range when looking up a trace. It means that when you look up a trace by ID, instead of scanning the whole 2y, Tempo can only look in a subset like last 30 days. This helps performance, but requires you to know roughly the timeframe for the trace. Therefore it is not commonly used, but might be helpful in this scenario. |
Beta Was this translation helpful? Give feedback.
Hi, this is a good question. We have a worked with a few places using longer-term storage to meet auditing requirements, and I would be interested to hear more about your use case.
Agree, I think the main reason for these low defaults is to prevent other worse issues that are likely on a new install: filling disks, unexpected object storage costs, high latency from lack of appropriate scaling. As the operator becomes more experienced and tunes the cluster, then storage can be bumped as well. 30d is more of what I consider the sweet spot. Traces are frequently used for troubleshooting live systems, and the value of a trace diminishe…