-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mainnet 0 storage growing 10x faster than 6GB/month norm #4106
Comments
1.3G /root/.agoric/data/state.db Most data is occupied by application db and we dont even have agoric sdk deployed? |
Just to note, the "application" in this context means the "Tendermint Application", which is "Cosmos SDK", nothing to do with the yet-to-be-enabled agoric-sdk layer. So this should be something that folks familiar with Cosmos can help. |
It's less clear to me that this is an unusual amount of disk space. On Nov 28 I asked in
Another data point, from Nov 22: figment.io reported:
I asked if they would try to diagnose the problem, but I don't recall seeing any reply. p.s. this seems relevant to #2292 |
Hi folks, glancing at disk usage in
I believe |
It's common to Cosmos-SDK chains, as noted in #4106 (comment) We're looking for an experienced cosmos validator to look into this more closely and let us know whether 61G is normal and if not, suggest a diagnosis. |
@dtribble We need to find a validator who can reliably give us statistics. |
This seems to not actaully be an issue in practice. @Tomas-Eminger can you provide concrete numbers from your validator? Should we close this issue? |
Its is growing pretty fast IMO....
|
Hi @Tomas-Eminger ... any thoughts on why? It would be great if you could help us diagnose this. |
I'm not sure what's going on here (and it might be a genuine problem), but it's outside the scope of SwingSet, whose database is in a different set of files entirely. |
I'd start with @michaelfig. If he's not the one it's likely he'll have a better idea of who should be. |
@michaelfig please put an estimate on this, for at least the work to find the problem, and give the appropriate area label. there are some numbers from one of the validators above. |
This is our current figures:
And this is our overall disk space growth last couple of days in gigabytes:
*Edit: |
I’d like for someone to enable the I'll be running a follower soon, so will be able to see this myself, but other folks' data would be helpful too. |
du -h data From my Box. 127GB so far Also take a look at this It could be we ned to set custom pruning strategy as default seems to have issue? |
Here is a 6hr kv store trace. stategrowth Warning 3 GB |
From @zmanian's Here is a typical diff between two block's writes: https://gist.github.com/michaelfig/2badd5fd418798acca2ce883a6f48a6b So it really looks like the |
How are things going here? |
tldr: we need a hero to diagnose this. Agoric has budgeted a day of @michaelfig 's time, but other things are higher priorities, so it's not likely to happen soon. In discord I saw a new validator ask how much storage would be used and the answer (148GB on March 3) shocked them. I thought others followed up by sharing their pruning settings, but I can't find them. informal snapshot sharingCatching up, but slowlyOne validator reported their sync times 10 minutes apart; I projected 56hrs to catch up on 764hrs of chain time.
|
@warner just set up a follower node. After spending a couple days catching up, he reports: |
@arirubinstein asked: how much do our nodes need to keep available? @michaelfig says for IBC, our RPC nodes need to remember the whole bonding period (21 days). |
this could be an upstream Tendermint issue |
These settings were just shared in a message in discord: # default: ...
# custom: allow pruning options to be manually specified through 'pruning-keep-recent', 'pruning-keep-every', 'pruning-interval'
pruning = "custom"
# These are applied if and only if the pruning strategy is cusom
pruning-keep-recent = "500"
pruning-keep-every = "1000"
pruning-interval = "10" |
@arirubinstein and I talked this over. The 33GB in 15 days likely includes the fast sync period, when storage is consumed considerably faster than steady-state. Our experience does support the estimate of 6GB/month in the runbook. Our default-pruning node is still syncing, but with the following pruning settings:
... we see 3GB/month (14.35 - 14.25 = 0.1 GB / day): For reference, from another Cosmos-SDK chain: Desmos docs on pruning |
Recently I was migrating my agoric validator node to new server. Previously I have been using pruning default, my new node was using: When I have switched the nodes, I was able to sign blocks of other validators, but when my node was responsible for the block I was unable to sign it. (This error was present before changing the node to validator)
tmkms told me, that I'm double signing:
IDK if it's relevant but I would rather publish here my experience. When I started over with unsafe-reset-all, downloaded polkachu snapshot, and set pruning to default, everything started worked normally :) |
An "update" of my disk: Delta in the disk looks to be in ≈1.3gb a day in average. Edit: Added start and end blocks for the period.
|
Know the issue is closed but here is an update on the size of our Agoric data folder.
|
I wonder if this has something to with compaction/ garbage collection in level db and leveldb might be delaying compaction if a lot of disk space is available? |
@nitronit , @arirubinstein is the one who really knows what's going on around here. I'm leaving this to him. |
First of all maybe a good clarification on my side is that we use GoLevelDB. @zmanian - Thanks, I think you might be onto something since it makes sense if its a garbage collection/compaction error, in terms of the growth 'matches' #4106 (comment). Can theoretically see the benefit with delaying compression as its enhances block speed(?) but cant see any references/mentioning in the GoLevelDB docs/repo of it which makes me rather leaning towards an error than delaying. In addition with @gavinly also hinting this might be an upstream Tendermint issue. #4106 (comment), makes it more likely in my ('amateur') opinion. Will keep an eye on if chains using newer TM-versions are seeing the same issues with default settings as well as monitoring our other Agoric nodes running with custom pruning. Which indeed looks to work. Sorry to bother you all. |
On the contrary! Please do continue to share your experience! |
edit: Diagnosis / Resolution
Describe the bug
I see reports of storage space growing 33GB in 15 days
https://discord.com/channels/585576150827532298/755164695849205942/911761990349905920
Why is it growing that fast?
Expected behavior
Our hardware baseline estimates ~6GB_month
Platform Environment
default pruning,transaction indexing etc etc
Additional context
mainnet 0
The text was updated successfully, but these errors were encountered: