Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State prunning in the cumulus embedded relay is not being pruned #2850

Closed
girazoki opened this issue Jan 4, 2024 · 17 comments
Closed

State prunning in the cumulus embedded relay is not being pruned #2850

girazoki opened this issue Jan 4, 2024 · 17 comments
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@girazoki
Copy link
Contributor

girazoki commented Jan 4, 2024

We are running rpc-nodes for tanssi with the embedded relay chain from cumulus, with the following parameters

--chain=/chain-network/parachain-raw-no-bootnodes-specs.json
      --base-path=/data/para
      --database=paritydb
      --rpc-port=9944
      --prometheus-port=9615
      --prometheus-external
      --name=$(POD_NAME)
      --node-key-file=/p2p-private-keys/$(POD_NAME)
      --listen-addr=/ip4/0.0.0.0/tcp/30333
      --public-addr=/dns4/$(POD_NAME).a.stagenet.tanssi.network/tcp/30333
      --execution=wasm
      --state-pruning=archive
      --blocks-pruning=archive
      --telemetry-url=wss://telemetry.polkadot.io/submit/ 0
      --rpc-cors=all
      --unsafe-rpc-external
      --rpc-methods=safe
      --db-cache=1024
      --trie-cache-size=1073741824
      --detailed-log-output
      --rpc-max-connections=400
      --bootnodes=/dns4/fraa-stagebox-boot-0.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWMM7TQebpc95rcDZfXPapuG3B9FKqk3C1FyDgB1toGfDN
      --bootnodes=/dns4/fraa-stagebox-boot-1.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWCQSoevCxrdkTNi5cMwxFjEbidEonYp8HqSNE9DdP27VJ
      --
      --chain=/chain-network/relay-raw-no-bootnodes-specs.json
      --base-path=/data/relay
      --database=paritydb
      --rpc-port=9945
      --prometheus-port=9616
      --prometheus-external
      --name=$(POD_NAME)-relay
      --node-key-file=/p2p-private-keys/$(POD_NAME)
      --listen-addr=/ip4/0.0.0.0/tcp/30334
      --public-addr=/dns4/$(POD_NAME).a.stagenet.tanssi.network/tcp/30334
      --public-addr=/dns4/$(POD_NAME)/tcp/30334
      --pool-limit=0
      --db-cache=128
      --execution=wasm
      --state-pruning=2000
      --blocks-pruning=2000
      --rpc-cors=all
      --rpc-methods=safe
      --unsafe-rpc-external
      --sync=fast
      --bootnodes=/dns4/frag3-stagenet-relay-val-0.g.moondev.network/tcp/30334/p2p/12D3KooWKvtM52fPRSdAnKBsGmST7VHvpKYeoSYuaAv5JDuAvFCc
      --bootnodes=/dns4/frag3-stagenet-relay-val-1.g.moondev.network/tcp/30334/p2p/12D3KooWQYLjopFtjojRBfTKkLFq2Untq9yG7gBjmAE8xcHFKbyq
      --bootnodes=/dns4/frag3-stagenet-relay-val-2.g.moondev.network/tcp/30334/p2p/12D3KooWMAtGe8cnVrg3qGmiwNjNaeVrpWaCTj82PGWN7PBx2tth
      --bootnodes=/dns4/frag3-stagenet-relay-val-3.g.moondev.network/tcp/30334/p2p/12D3KooWLKAf36uqBBug5W5KJhsSnn9JHFCcw8ykMkhQvW7Eus3U

Which as you can see, it should do state pruning and block pruning, it should keep only 2000 blocks from the relay. Block pruning is working well, but state pruning on the other hand is not. We have tested by asking the state of specific storage keys on blocks that are more than 1M old and we are still able to fetch them. Initially I though it could be related to #1358, but we upgraded our nodes to the 1.3.0 release, and still observe the same behavior.

Is there anything that we are missing? Any paremeter that we are setting wrong?

@bkchr bkchr added the I10-unconfirmed Issue might be valid, but it's not yet known. label Jan 4, 2024
@bkchr
Copy link
Member

bkchr commented Jan 4, 2024

CC @skunert

@skunert
Copy link
Contributor

skunert commented Jan 4, 2024

Hmm the parameters look correct to me, relay chain should prune.
Do you see normal memory usage? Might be related to #1662 as there we also see some abnormal pruning behaviour.

Otherwise your node is running fine and is at the tip of the chain? You could run with -lstate-db=trace and post the log here to see if any irregular log entries get generated.

@girazoki
Copy link
Contributor Author

girazoki commented Jan 4, 2024

yes we will and we will report back. One thing we can already say is that we have several nodes affected by this and each of them stopped pruning states at a different height of the relay chain history..

The memory usage seems to be OK (below 4GB)

@girazoki
Copy link
Contributor Author

girazoki commented Jan 4, 2024

Does the log line [Relaychain] Pinned block cache limit reached. Evicting value. hash = 0x28a7…f60d mean something related to state prunning? Because I started seeing these around the block range at which we always stop pruning state when syncing from scratch

@skunert
Copy link
Contributor

skunert commented Jan 5, 2024

This is indeed suspicious!
Which code are you using? Can you point me to the branch? I want to reproduce this.

@girazoki
Copy link
Contributor Author

girazoki commented Jan 5, 2024

Alright I was able to figure out the issue, and it is an issue in our side. See PR: moondance-labs/tanssi#382

The way I discovered it was by comparing synchronizing a tanssi-node vs a polkadot-parachain node of the same polkadot version 1.3.0. While polkadot-parachain immediately yield an error of the form

cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x7a8a016ffb89b376c3dead2d1f3ca9892c759223022da74e0a98ade58f5a9a68

tanssi-node was never showing any of these. I am still unsure what the PR brings as a difference, but I will investigate in cumulus.

Now I see that that state prunning is working. However after synchronization I still see a bunch of log lines like

[Relaychain] cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x7a8a016ffb89b376c3dead2d1f3ca9892c759223022da74e0a98ade58f5a9a68

And I dont know the reason for this, but I see that you have a similar issue: #19

The paremeters we are using are fast-sync

./polkadot-parachain --chain dancebox-raw-specs.json --database=paritydb --rpc-port=9944 --state-pruning archive --blocks-pruning archive --db-cache 1024 --bootnodes=/dns4/fraa-stagebox-boot-0.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWMM7TQebpc95rcDZfXPapuG3B9FKqk3C1FyDgB1toGfDN --bootnodes=/dns4/fraa-stagebox-boot-1.a.stagenet.tanssi.network/tcp/30333/p2p/12D3KooWCQSoevCxrdkTNi5cMwxFjEbidEonYp8HqSNE9DdP27VJ -- --chain 
westend-raw-specs.json --db-cache=128 --database=paritydb --rpc-port=9945 --state-pruning=2000 --blocks-pruning=2000 --rpc-cors=all --sync=fast --bootnodes=/dns4/frag3-stagenet-relay-val-0.g.moondev.network/tcp/30334/p2p/12D3KooWKvtM52fPRSdAnKBsGmST7VHvpKYeoSYuaAv5JDuAvFCc --bootnodes=/dns4/frag3-stagenet-relay-val-1.g.moondev.network/tcp/30334/p2p/12D3KooWQYLjopFtjojRBfTKkLFq2Untq9yG7gBjmAE8xcHFKbyq --bootnodes=/dns4/frag3-stagenet-relay-val-2.g.moondev.network/tcp/30334/p2p/12D3KooWMAtGe8cnVrg3qGmiwNjNaeVrpWaCTj82PGWN7PBx2tth --bootnodes=/dns4/frag3-stagenet-relay-val-3.g.moondev.network/tcp/30334/p2p/12D3KooWLKAf36uqBBug5W5KJhsSnn9JHFCcw8ykMkhQvW7Eus3U --tmp

I can put the raw specs here if you want to try it out. In any case this looks like a more common problem than the one we were observing, so thank you for the input!

@skunert
Copy link
Contributor

skunert commented Jan 5, 2024

Alright I was able to figure out the issue, and it is an issue in our side. See PR: moondance-labs/tanssi#382

What is the idea behind these changes? To me it looks like this is just extracting the logic from here to the tanssi node without changing anything. I have doubts that this fixes the underlying reason.

Does the log line [Relaychain] Pinned block cache limit reached. Evicting value. hash = 0x28a7…f60d mean something related to state prunning?

Whenever we send out a finality or import notification in the node, we pin the underlying block and state the notification references. The pin here means that this block is not to be pruned. When the notification is dropped, the pin is released and the block can be pruned again. Since you are seeing this log line, the pinning cache is full which means that some notification was not dropped.

@girazoki
Copy link
Contributor Author

girazoki commented Jan 5, 2024

Actually yes it was a false positive from my side, I see it working without these changes too. I will close such PR. We will keep investigating to see if we understand what is going on... but at least I see state pruning working now when I sync fresh.

@skunert
Copy link
Contributor

skunert commented Jan 5, 2024

Actually yes it was a false positive from my side, I see it working without these changes too. I will close such PR. We will keep investigating to see if we understand what is going on... but at least I see state pruning working now when I sync fresh.

Okay. I will also continue investigation.
Btw, what is your motivation to use sync=fast vs sync=warp?

@girazoki
Copy link
Contributor Author

girazoki commented Jan 10, 2024

Hey @skunert we have been able to investigate a bit more and here is what we found.

We are using the 1.3.0 polkadot version (we are using polkadot-parachain) but with additional logs in state-db: https://github.com/moondance-labs/polkadot-sdk/tree/girazoki-1.3.0-additional-logs. You can check the logs we added here: moondance-labs@1be1b22

I have managed to reproduce the behavior in which state pruning is not working by doing the following:

  • Sync polkadot-parachain with the embedded relay (we can put the specs here if its needed but I suspect doing it with statemint would work as well) but before you reach the end of the first state-pruning window (this is, when the window size is lower than what it should be, in the case of the arguments provided above, less than 2000), reboot the node.

After that the log lines that we see are:

_
2024-01-10 14:00:33.616 TRACE main state-db: [Relaychain] StateDb settings: Constrained(Constraints { max_blocks: Some(50) }). Ref-counting: false
2024-01-10 14:00:33.617 TRACE main state-db: [Relaychain] Reading uncanonicalized journal. Last canonicalized #12743547 (0xfbaea2a0cd072e50a15d50b21ca9ea8c1b3975dd1389ce685ea504db248053fd)
2024-01-10 14:00:33.621 TRACE main state-db: [Relaychain] Uncanonicalized journal entry 12743548.0 (0xd72a7a4879fe647e9e4eeef648fdf5998a7a19548465c75bfcf2eabf74c1541c) (106 inserted, 111 deleted)
2024-01-10 14:00:33.648 TRACE main state-db: [Relaychain] Uncanonicalized journal entry 12743549.0 (0x6e991e545e6870c266fa03854fa3e584b86033da5dd9d1ff0dcb54fc70f10936) (96 inserted, 87 deleted)
2024-01-10 14:00:33.654 TRACE main state-db: [Relaychain] Uncanonicalized journal entry 12743549.1 (0x1179c09b8ef5612b86cc3529711c3ba5eba3de6d9483cb2dee649842a9800df6) (99 inserted, 90 deleted)
2024-01-10 14:00:33.693 TRACE main state-db: [Relaychain] Finished reading uncanonicalized journal, 3 entries
2024-01-10 14:00:33.693 TRACE main state-db: [Relaychain] Reading pruning journal for the database-backed queue. Pending #0

----------------------

2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] canonicalizing block 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Canonicalizing 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db::pin: [Relaychain] Pinned non-canon block: 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Discarding 1 records
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717001)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] StateDb prunning set to 12717002
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717001)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717001)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] Adding to pruning window: 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0 (99 inserted, 106 deleted)
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Importing 12717002, base=0
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] Entering pruning loop
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] db backed last  Some(12717002)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] next hash Ok(None)
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] base 0
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] popfront db backed
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db: [Relaychain] Trying to prune when there's nothing to prune
2024-01-08 17:45:22.830 DEBUG tokio-runtime-worker state-db: [Relaychain] we should never reach this branch
2024-01-08 17:45:22.830 TRACE tokio-runtime-worker state-db::pin: [Relaychain] Discarding unpinned non-canon block: 0x3c23ae79e9babf90f68ce334a42e9a59fb0e50c822843a82e5d1eda9fded8aa0

Some of those logs are added by us, but two things stand out to me here:

  • Reading pruning journal for the database-backed queue. Pending #0
  • Trying to prune when there's nothing to prune

We can confirm that after this, state pruning stops working and it never prunes anymore

Does this put some clarity on the issue?

@skunert
Copy link
Contributor

skunert commented Jan 12, 2024

@girazoki Can you elaborate the steps to arrive at this?

Is this correct?:

  • Start polkadot-parachain and sync relay chain with embedded more and sync=fast
  • Wait until sync is finished and state is downloaded
  • Then before PRUNING_WINDOW_SIZE next blocks are finalized you reboot the node (2000 in your CLI from first post, 50 fin the logs from the last comment apparently)

Are these steps correct?

@girazoki
Copy link
Contributor Author

yes, I tried it with both 50 and 2000, so maybe the logs reflect to the trial where I tried with 50. But in both cases I observed this pattern

@skunert
Copy link
Contributor

skunert commented Jan 12, 2024

Can you try to upgrade to 1.4..0, or at least apply the changes form this PR? #2228

Your node is reading #0 form the LAST_PRUNED column. The PR above declares that the column was not set for warp sync, but since we use state sync for fast sync too, it could fix your issue.

@girazoki
Copy link
Contributor Author

We can try yes! do we need to resync from scratch or can we apply it on one of the nodes whose database is not being pruned?

@skunert
Copy link
Contributor

skunert commented Jan 15, 2024

Please try with a fresh node to verify if this fixes the issue.

@girazoki
Copy link
Contributor Author

@skunert we have been running the collators with such change for one week and apparently the state pruning is working in all of them now. So I guess this issue can be considered resolved.

Thank you so much!

@skunert
Copy link
Contributor

skunert commented Jan 29, 2024

Good to hear, thanks for reporting back! There is also #2935 in the works which will get rid of the error message you have been seeing regarding #19

@skunert skunert closed this as completed Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

No branches or pull requests

3 participants