State prunning in the cumulus embedded relay is not being pruned #2850

girazoki · 2024-01-04T10:34:35Z

We are running rpc-nodes for tanssi with the embedded relay chain from cumulus, with the following parameters

--chain=/chain-network/parachain-raw-no-bootnodes-specs.json
      --base-path=/data/para
      --database=paritydb
      --rpc-port=9944
      --prometheus-port=9615
      --prometheus-external
      --name=$(POD_NAME)
      --node-key-file=/p2p-private-keys/$(POD_NAME)
      --listen-addr=/ip4/0.0.0.0/tcp/30333
      --public-addr=/dns4/$(POD_NAME).a.stagenet.tanssi.network/tcp/30333
      --execution=wasm
      --state-pruning=archive
      --blocks-pruning=archive
      --telemetry-url=wss://telemetry.polkadot.io/submit/ 0
      --rpc-cors=all
      --unsafe-rpc-external
      --rpc-methods=safe
      --db-cache=1024
      --trie-cache-size=1073741824
      --detailed-log-output
      --rpc-max-connections=400
      --bootnodes=/dns4/fraa-stagebox-boot-0.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWMM7TQebpc95rcDZfXPapuG3B9FKqk3C1FyDgB1toGfDN
      --bootnodes=/dns4/fraa-stagebox-boot-1.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWCQSoevCxrdkTNi5cMwxFjEbidEonYp8HqSNE9DdP27VJ
      --
      --chain=/chain-network/relay-raw-no-bootnodes-specs.json
      --base-path=/data/relay
      --database=paritydb
      --rpc-port=9945
      --prometheus-port=9616
      --prometheus-external
      --name=$(POD_NAME)-relay
      --node-key-file=/p2p-private-keys/$(POD_NAME)
      --listen-addr=/ip4/0.0.0.0/tcp/30334
      --public-addr=/dns4/$(POD_NAME).a.stagenet.tanssi.network/tcp/30334
      --public-addr=/dns4/$(POD_NAME)/tcp/30334
      --pool-limit=0
      --db-cache=128
      --execution=wasm
      --state-pruning=2000
      --blocks-pruning=2000
      --rpc-cors=all
      --rpc-methods=safe
      --unsafe-rpc-external
      --sync=fast
      --bootnodes=/dns4/frag3-stagenet-relay-val-0.g.moondev.network/tcp/30334/p2p/12D3KooWKvtM52fPRSdAnKBsGmST7VHvpKYeoSYuaAv5JDuAvFCc
      --bootnodes=/dns4/frag3-stagenet-relay-val-1.g.moondev.network/tcp/30334/p2p/12D3KooWQYLjopFtjojRBfTKkLFq2Untq9yG7gBjmAE8xcHFKbyq
      --bootnodes=/dns4/frag3-stagenet-relay-val-2.g.moondev.network/tcp/30334/p2p/12D3KooWMAtGe8cnVrg3qGmiwNjNaeVrpWaCTj82PGWN7PBx2tth
      --bootnodes=/dns4/frag3-stagenet-relay-val-3.g.moondev.network/tcp/30334/p2p/12D3KooWLKAf36uqBBug5W5KJhsSnn9JHFCcw8ykMkhQvW7Eus3U

Which as you can see, it should do state pruning and block pruning, it should keep only 2000 blocks from the relay. Block pruning is working well, but state pruning on the other hand is not. We have tested by asking the state of specific storage keys on blocks that are more than 1M old and we are still able to fetch them. Initially I though it could be related to #1358, but we upgraded our nodes to the 1.3.0 release, and still observe the same behavior.

Is there anything that we are missing? Any paremeter that we are setting wrong?

The text was updated successfully, but these errors were encountered:

bkchr · 2024-01-04T12:33:29Z

CC @skunert

skunert · 2024-01-04T14:41:09Z

Hmm the parameters look correct to me, relay chain should prune.
Do you see normal memory usage? Might be related to #1662 as there we also see some abnormal pruning behaviour.

Otherwise your node is running fine and is at the tip of the chain? You could run with -lstate-db=trace and post the log here to see if any irregular log entries get generated.

girazoki · 2024-01-04T15:22:55Z

yes we will and we will report back. One thing we can already say is that we have several nodes affected by this and each of them stopped pruning states at a different height of the relay chain history..

The memory usage seems to be OK (below 4GB)

girazoki · 2024-01-04T17:54:37Z

Does the log line [Relaychain] Pinned block cache limit reached. Evicting value. hash = 0x28a7…f60d mean something related to state prunning? Because I started seeing these around the block range at which we always stop pruning state when syncing from scratch

skunert · 2024-01-05T08:52:29Z

This is indeed suspicious!
Which code are you using? Can you point me to the branch? I want to reproduce this.

girazoki · 2024-01-05T13:39:14Z

Alright I was able to figure out the issue, and it is an issue in our side. See PR: moondance-labs/tanssi#382

The way I discovered it was by comparing synchronizing a tanssi-node vs a polkadot-parachain node of the same polkadot version 1.3.0. While polkadot-parachain immediately yield an error of the form

cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x7a8a016ffb89b376c3dead2d1f3ca9892c759223022da74e0a98ade58f5a9a68

tanssi-node was never showing any of these. I am still unsure what the PR brings as a difference, but I will investigate in cumulus.

Now I see that that state prunning is working. However after synchronization I still see a bunch of log lines like

[Relaychain] cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x7a8a016ffb89b376c3dead2d1f3ca9892c759223022da74e0a98ade58f5a9a68

And I dont know the reason for this, but I see that you have a similar issue: #19

The paremeters we are using are fast-sync

./polkadot-parachain --chain dancebox-raw-specs.json --database=paritydb --rpc-port=9944 --state-pruning archive --blocks-pruning archive --db-cache 1024 --bootnodes=/dns4/fraa-stagebox-boot-0.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWMM7TQebpc95rcDZfXPapuG3B9FKqk3C1FyDgB1toGfDN --bootnodes=/dns4/fraa-stagebox-boot-1.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWCQSoevCxrdkTNi5cMwxFjEbidEonYp8HqSNE9DdP27VJ -- --chain 
westend-raw-specs.json --db-cache=128 --database=paritydb --rpc-port=9945 --state-pruning=2000 --blocks-pruning=2000 --rpc-cors=all --sync=fast --bootnodes=/dns4/frag3-stagenet-relay-val-0.g.moondev.network/tcp/30334/p2p/12D3KooWKvtM52fPRSdAnKBsGmST7VHvpKYeoSYuaAv5JDuAvFCc --bootnodes=/dns4/frag3-stagenet-relay-val-1.g.moondev.network/tcp/30334/p2p/12D3KooWQYLjopFtjojRBfTKkLFq2Untq9yG7gBjmAE8xcHFKbyq --bootnodes=/dns4/frag3-stagenet-relay-val-2.g.moondev.network/tcp/30334/p2p/12D3KooWMAtGe8cnVrg3qGmiwNjNaeVrpWaCTj82PGWN7PBx2tth --bootnodes=/dns4/frag3-stagenet-relay-val-3.g.moondev.network/tcp/30334/p2p/12D3KooWLKAf36uqBBug5W5KJhsSnn9JHFCcw8ykMkhQvW7Eus3U --tmp

I can put the raw specs here if you want to try it out. In any case this looks like a more common problem than the one we were observing, so thank you for the input!

skunert · 2024-01-05T14:14:10Z

Alright I was able to figure out the issue, and it is an issue in our side. See PR: moondance-labs/tanssi#382

What is the idea behind these changes? To me it looks like this is just extracting the logic from here to the tanssi node without changing anything. I have doubts that this fixes the underlying reason.

Does the log line [Relaychain] Pinned block cache limit reached. Evicting value. hash = 0x28a7…f60d mean something related to state prunning?

Whenever we send out a finality or import notification in the node, we pin the underlying block and state the notification references. The pin here means that this block is not to be pruned. When the notification is dropped, the pin is released and the block can be pruned again. Since you are seeing this log line, the pinning cache is full which means that some notification was not dropped.

girazoki · 2024-01-05T14:29:20Z

Actually yes it was a false positive from my side, I see it working without these changes too. I will close such PR. We will keep investigating to see if we understand what is going on... but at least I see state pruning working now when I sync fresh.

skunert · 2024-01-05T14:31:03Z

Actually yes it was a false positive from my side, I see it working without these changes too. I will close such PR. We will keep investigating to see if we understand what is going on... but at least I see state pruning working now when I sync fresh.

Okay. I will also continue investigation.
Btw, what is your motivation to use sync=fast vs sync=warp?

girazoki · 2024-01-10T16:25:26Z

Hey @skunert we have been able to investigate a bit more and here is what we found.

We are using the 1.3.0 polkadot version (we are using polkadot-parachain) but with additional logs in state-db: https://github.com/moondance-labs/polkadot-sdk/tree/girazoki-1.3.0-additional-logs. You can check the logs we added here: moondance-labs@1be1b22

I have managed to reproduce the behavior in which state pruning is not working by doing the following:

Sync polkadot-parachain with the embedded relay (we can put the specs here if its needed but I suspect doing it with statemint would work as well) but before you reach the end of the first state-pruning window (this is, when the window size is lower than what it should be, in the case of the arguments provided above, less than 2000), reboot the node.

After that the log lines that we see are:

_
2024-01-10 14:00:33.616 TRACE main state-db: [Relaychain] StateDb settings: Constrained(Constraints { max_blocks: Some(50) }). Ref-counting: false
2024-01-10 14:00:33.617 TRACE main state-db: [Relaychain] Reading uncanonicalized journal. Last canonicalized #12743547 (0xfbaea2a0cd072e50a15d50b21ca9ea8c1b3975dd1389ce685ea504db248053fd)
2024-01-10 14:00:33.621 TRACE main state-db: [Relaychain] Uncanonicalized journal entry 12743548.0 (0xd72a7a4879fe647e9e4eeef648fdf5998a7a19548465c75bfcf2eabf74c1541c) (106 inserted, 111 deleted)
2024-01-10 14:00:33.648 TRACE main state-db: [Relaychain] Uncanonicalized journal entry 12743549.0 (0x6e991e545e6870c266fa03854fa3e584b86033da5dd9d1ff0dcb54fc70f10936) (96 inserted, 87 deleted)
2024-01-10 14:00:33.654 TRACE main state-db: [Relaychain] Uncanonicalized journal entry 12743549.1 (0x1179c09b8ef5612b86cc3529711c3ba5eba3de6d9483cb2dee649842a9800df6) (99 inserted, 90 deleted)
2024-01-10 14:00:33.693 TRACE main state-db: [Relaychain] Finished reading uncanonicalized journal, 3 entries
2024-01-10 14:00:33.693 TRACE main state-db: [Relaychain] Reading pruning journal for the database-backed queue. Pending #0

----------------------

2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] canonicalizing block 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Canonicalizing 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db::pin: [Relaychain] Pinned non-canon block: 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Discarding 1 records
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717001)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] StateDb prunning set to 12717002
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717001)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717001)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] Adding to pruning window: 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0 (99 inserted, 106 deleted)
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Importing 12717002, base=0
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] Entering pruning loop
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717002)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] next hash Ok(None)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] base 0
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] popfront db backed
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Trying to prune when there's nothing to prune
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] we should never reach this branch
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db::pin: [Relaychain] Discarding unpinned non-canon block: 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0

Some of those logs are added by us, but two things stand out to me here:

Reading pruning journal for the database-backed queue. Pending #0
Trying to prune when there's nothing to prune

We can confirm that after this, state pruning stops working and it never prunes anymore

Does this put some clarity on the issue?

skunert · 2024-01-12T13:15:26Z

@girazoki Can you elaborate the steps to arrive at this?

Is this correct?:

Start polkadot-parachain and sync relay chain with embedded more and sync=fast
Wait until sync is finished and state is downloaded
Then before PRUNING_WINDOW_SIZE next blocks are finalized you reboot the node (2000 in your CLI from first post, 50 fin the logs from the last comment apparently)

Are these steps correct?

girazoki · 2024-01-12T13:34:50Z

yes, I tried it with both 50 and 2000, so maybe the logs reflect to the trial where I tried with 50. But in both cases I observed this pattern

skunert · 2024-01-12T14:19:29Z

Can you try to upgrade to 1.4..0, or at least apply the changes form this PR? #2228

Your node is reading #0 form the LAST_PRUNED column. The PR above declares that the column was not set for warp sync, but since we use state sync for fast sync too, it could fix your issue.

girazoki · 2024-01-12T15:35:43Z

We can try yes! do we need to resync from scratch or can we apply it on one of the nodes whose database is not being pruned?

skunert · 2024-01-15T09:47:04Z

Please try with a fresh node to verify if this fixes the issue.

girazoki · 2024-01-29T16:07:52Z

@skunert we have been running the collators with such change for one week and apparently the state pruning is working in all of them now. So I guess this issue can be considered resolved.

Thank you so much!

skunert · 2024-01-29T16:13:08Z

Good to hear, thanks for reporting back! There is also #2935 in the works which will get rid of the error message you have been seeing regarding #19

bkchr added the I10-unconfirmed Issue might be valid, but it's not yet known. label Jan 4, 2024

skunert closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State prunning in the cumulus embedded relay is not being pruned #2850

State prunning in the cumulus embedded relay is not being pruned #2850

girazoki commented Jan 4, 2024

bkchr commented Jan 4, 2024

skunert commented Jan 4, 2024

girazoki commented Jan 4, 2024 •

edited

Loading

girazoki commented Jan 4, 2024 •

edited

Loading

skunert commented Jan 5, 2024

girazoki commented Jan 5, 2024

skunert commented Jan 5, 2024

girazoki commented Jan 5, 2024

skunert commented Jan 5, 2024

girazoki commented Jan 10, 2024 •

edited

Loading

skunert commented Jan 12, 2024

girazoki commented Jan 12, 2024

skunert commented Jan 12, 2024

girazoki commented Jan 12, 2024

skunert commented Jan 15, 2024

girazoki commented Jan 29, 2024

skunert commented Jan 29, 2024

State prunning in the cumulus embedded relay is not being pruned #2850

State prunning in the cumulus embedded relay is not being pruned #2850

Comments

girazoki commented Jan 4, 2024

bkchr commented Jan 4, 2024

skunert commented Jan 4, 2024

girazoki commented Jan 4, 2024 • edited Loading

girazoki commented Jan 4, 2024 • edited Loading

skunert commented Jan 5, 2024

girazoki commented Jan 5, 2024

skunert commented Jan 5, 2024

girazoki commented Jan 5, 2024

skunert commented Jan 5, 2024

girazoki commented Jan 10, 2024 • edited Loading

skunert commented Jan 12, 2024

girazoki commented Jan 12, 2024

skunert commented Jan 12, 2024

girazoki commented Jan 12, 2024

skunert commented Jan 15, 2024

girazoki commented Jan 29, 2024

skunert commented Jan 29, 2024

girazoki commented Jan 4, 2024 •

edited

Loading

girazoki commented Jan 4, 2024 •

edited

Loading

girazoki commented Jan 10, 2024 •

edited

Loading